> This is probably the wrong list for this question, but is anyone
> willing to give me a clue why
>
> $line =~ tr/+/ /;
>
> would clip out the lead bytes of a shift-JIS string in a cgi script?
That's a badly-formed regular expression. "+" means "one or more of
what was just expressed", but you haven't expressed anything so far,
so god knows what it will match.
I think you meant to say ".+", but that will just delete the whole
string in this context. What did you want to do?
Doug McNutt - 01 Dec 2007 02:19 GMT
>>This is probably the wrong list for this question, but is anyone willing to give me a clue why
>>
[quoted text clipped - 5 lines]
>
>I think you meant to say ".+", but that will just delete the whole string in this context. What did you want to do?
+ is a special representation of a space in URL encoding but it's the one-or-more collective char in perl. My guess is that the intent is to replace plus's with spaces which requires escaping the +, How about
$line =~ tr/\+/ /g;
where I have added the g to get them all.
But what the devil is the thingy (missing char) repeated at least once? Should it not have produced a compile error or a least a warning? Was the -w switch used?

Signature
--> On the eighth day, about 6 kiloyears ago, the Lord realized that free will would make man ask what existed before the Creation. So He installed a few gigayears of history complete with a big bang and a fossilized record of evolution. <--
> This is probably the wrong list for this question, but is anyone
> willing to give me a clue why
>
> $line =~ tr/+/ /;
>
> would clip out the lead bytes of a shift-JIS string in a cgi script?
snip
The only thing that should do is replace occurrences of '+' with ' '
in $line. You can read more about the tr/// operator in perldoc
perlop.
I guess it would help if I posted my code and what it puts out.
> This is probably the wrong list for this question, but is anyone
> willing to give me a clue why
[quoted text clipped - 10 lines]
> here. Which is why a newbies list would probably be better for this
> question.)
# The code that grabs the parameters:
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{ my ( $key, $value ) = split( '=', $pair, 2 );
# Really should just give in and use CGI.
#$key =~ tr/+/ /;
$key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
$queries{ $key . '_0' } = $value;
my $value_y = my $value_tr = $value;
$value_y =~ y/+/ /;
$queries{ $key . '_y1' } = $value_y;
$value_tr =~ tr/+/ /;
$queries{ $key . '_tr1' } = $value_tr;
$value_y =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
$queries{ $key . '_y' } = $value_y;
$value_tr =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
$queries{ $key . '_tr' } = $value_tr;
$value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
$queries{ $key } = $value;
}
# For reference, the code being used to dump the parameters to html:
sub dumpQueries
{ if ( $debugParams )
{
print "language = $language, function = $function\n";
print "<table border='1'>\n";
my @keys = keys ( %queries );
foreach my $key ( @keys )
{ my $value = $queries{ $key };
if ( !defined ( $value ) )
{ $value = "UNDEF";
}
print "<tr><td align='right'>$key</td><td align='left'>$value</
td></tr>\n";
}
print "</table>\n";
}
}
# ------------results-------------
chatter this+is+a+test.+これはテストです。
function_tr eキ
chatter_y1 this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%
81B
who_tr1 daddy
function 投稿する
who_y1 daddy
function_tr1 %93%8A%8De%82%B7%82%E9
who daddy
who_0 daddy
who_tr daddy
who_y daddy
chatter_0 this+is+a+test.+%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B
chatter_y this is a test. アヘeXgナキB
function_y eキ
chatter_tr this is a test. アヘeXgナキB
function_y1 %93%8A%8De%82%B7%82%E9
function_0 %93%8A%8De%82%B7%82%E9
chatter_tr1 this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
B7%81B
# ------------results-sorted-------------
chatter_0 this+is+a+test.+%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B
chatter_y1 this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%
81B
chatter_tr1 this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
B7%81B
chatter_y this is a test. アヘeXgナキB
chatter_tr this is a test. アヘeXgナキB
chatter this+is+a+test.+これはテストです。
function_0 %93%8A%8De%82%B7%82%E9
function_y1 %93%8A%8De%82%B7%82%E9
function_tr1 %93%8A%8De%82%B7%82%E9
function_y eキ
function_tr eキ
function 投稿する
who_0 daddy
who_y1 daddy
who_tr1 daddy
who_y daddy
who_tr daddy
who daddy
# ------------results-html-------------
language = j, function = talk
<table border='1'>
<tr><td align='right'>chatter</td><td align='left'>this+is+a+test.
+これはテストです。</td></tr>
<tr><td align='right'>function_tr</td><td align='left'>eキ</td></
tr>
<tr><td align='right'>chatter_y1</td><td align='left'>this is a test.
%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B</td></tr>
<tr><td align='right'>who_tr1</td><td align='left'>daddy</td></tr>
<tr><td align='right'>function</td><td align='left'>投稿する</
td></tr>
<tr><td align='right'>who_y1</td><td align='left'>daddy</td></tr>
<tr><td align='right'>function_tr1</td><td align='left'>%93%8A%8De%82%
B7%82%E9</td></tr>
<tr><td align='right'>who</td><td align='left'>daddy</td></tr>
<tr><td align='right'>who_0</td><td align='left'>daddy</td></tr>
<tr><td align='right'>who_tr</td><td align='left'>daddy</td></tr>
<tr><td align='right'>who_y</td><td align='left'>daddy</td></tr>
<tr><td align='right'>chatter_0</td><td align='left'>this+is+a+test.+%
82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B</td></tr>
<tr><td align='right'>chatter_y</td><td align='left'>this is a
test. アヘeXgナキB</td></tr>
<tr><td align='right'>function_y</td><td align='left'>eキ</td></tr>
<tr><td align='right'>chatter_tr</td><td align='left'>this is a
test. アヘeXgナキB</td></tr>
<tr><td align='right'>function_y1</td><td align='left'>%93%8A%8De%82%
B7%82%E9</td></tr>
<tr><td align='right'>function_0</td><td align='left'>%93%8A%8De%82%
B7%82%E9</td></tr>
<tr><td align='right'>chatter_tr1</td><td align='left'>this is a
test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B</td></tr>
</table>
# ------------results-extract-------------
chatter this+is+a+test.+これはテストです。
chatter_tr this is a test. アヘeXgナキB
function 投稿する
function_tr eキ
# ------------results-extract-hexdump-lined-up-------------
00000000 63 68 61 74 74 65 72 09 |chatter.|
00000008 74 68 69 73 2b 69 73 2b 61 2b 74 65 73 74 2e 2b |this+is+a
+test.+|
00000018 82 b1 82 ea 82 cd 83 65 83 58 83 67 82 c5 82 b7 81 42 0a
|.......e.X.g.....B.|
0000002b 63 68 61 74 74 65 72 5f 74 72 09 |chatter_tr.|
00000036 74 68 69 73 20 69 73 20 61 20 74 65 73 74 2e 20 |this is
a test. |
00000046 b1 cd 65 58 67 c5 b7 42 0a |..eXg..B.|
0000004f 66 75 6e 63 74 69 6f 6e 09 |function.|
00000058 93 8a 8d 65 82 b7 82 e9 0a |...e.....|
00000061 66 75 6e 63 74 69 6f 6e 5f 74 72 09 |function_tr.|
0000006d 65 b7 0a |e..|
# ------------results-extract-hexdump-straight-------------
00000000 63 68 61 74 74 65 72 09 74 68 69 73 2b 69 73 2b |
chatter.this+is+|
00000010 61 2b 74 65 73 74 2e 2b 82 b1 82 ea 82 cd 83 65 |a+test.
+.......e|
00000020 83 58 83 67 82 c5 82 b7 81 42 0a 63 68 61 74 74
|.X.g.....B.chatt|
00000030 65 72 5f 74 72 09 74 68 69 73 20 69 73 20 61 20 |
er_tr.this is a |
00000040 74 65 73 74 2e 20 b1 cd 65 58 67 c5 b7 42 0a 66 |
test. ..eXg..B.f|
00000050 75 6e 63 74 69 6f 6e 09 93 8a 8d 65 82 b7 82 e9 |
unction....e....|
00000060 0a 66 75 6e 63 74 69 6f 6e 5f 74 72 09 65 b7 0a
|.function_tr.e..|
# ------------results-end-------------
00000018 82 b1 82 ea 82 cd 83 65 83 58 83 67 82 c5 82 b7 81 42 0a
|.......e.X.g.....B.|
00000046 b1 cd 65 58 67 c5 b7 42 0a |..eXg..B.|
and
00000058 93 8a 8d 65 82 b7 82 e9 0a |...e.....|
0000006d 65 b7 0a |e..|
tell the tale.
Okay, so it looks like it isn't just stripping the lead bytes, every
now and then I'm losing a full JIS character.
Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Chas. Owens - 01 Dec 2007 03:04 GMT
> I guess it would help if I posted my code and what it puts out.
snip
Whoa, way to much information. Try to reproduce your issue with the
least amount of code and data. Off hand I would say your problem is
probably with the encoding of your data (and Perl's lack of knowledge
about it). Try using the locale or encoding pragmas.
Joel Rees - 01 Dec 2007 03:27 GMT
>> I guess it would help if I posted my code and what it puts out.
> snip
>
> Whoa, way to much information.
:-)
> Try to reproduce your issue with the
> least amount of code and data.
Actually, after cleaning out some of the extraneous stuff, I can see
it is not tr/// doing the dirty deed after all (which is a relief, in
a way):
> chatter_0 this+is+a+test.+%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
> B7%81B
> chatter_tr1 this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
> B7%81B
> chatter_tr this is a test. アヘeXgナキB
> chatter this+is+a+test.+これはテストです。
Even though stripping the '+' out forces what had been intermittent
behavior, I can see that tr/// is doing its job right.
> Off hand I would say your problem is
> probably with the encoding of your data (and Perl's lack of knowledge
> about it). Try using the locale or encoding pragmas.
Yeah, I'll have to go back to that black magic.
Thanks everybody for being listening ears.
Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Joel Rees - 01 Dec 2007 08:03 GMT
Okay, given the following (without all the debugging code I had in
earlier):
> # The code that grabs the parameters:
>
[quoted text clipped - 8 lines]
>
> $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
> $queries{ $key } = $value;
> }
Anyone know why commenting out the transliteration will recover the
shift-JIS characters from the url-encoded stream (leaving spaces as
'+', of course), but leaving the transliteration in will induce the
code to drop shift-JIS lead bytes and every now and then whole
characters?
I had a similar problem with
$value =~ s/\+/ /g;
but it was an intermittent problem. (Haven't tried it today to see
whether it only kills the shift-JIS characters when there is 8-bit
space in the stream, but that may have been what was happening.)
Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Doug McNutt - 01 Dec 2007 18:59 GMT
>Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>Message-Id: <567C7534-688F-4335-9555-96412988E2A8@sannet.ne.jp>
>Content-Transfer-Encoding: 7bit
>From: Joel Rees <joel_rees@sannet.ne.jp>
>Content-Type: text/plain; charset=ISO-2022-JP; delsp=yes; format=flowed
>Message-Id: <89A94C5D-6617-4789-880C-8CD2FC825C85@sannet.ne.jp>
>Content-Transfer-Encoding: 7bit
>From: Joel Rees <joel_rees@sannet.ne.jp>
I had some intermittent problems reading your postings using Eudora-5 on this Mac 8500 running OS9.1 which I prefer for email.
The $line =~ tr/+/ /; showed up as $line =? tr/+/ /; and I got a couple of yen marks. I blamed it on lack of unicode support. Looking back I see a couple of Content headers in your email that bother me They both say simple 7 bit ASCII but then they also have divers encodings stated which really are about how to use the eighth bit.
There is also the big-endian / little-endian consideration which has reared its ugly head with the introduction of Intel machines running Mac OS.
Is it possible that some of the failure to decode %xx encoded stuff is associated with development on one machine followed by execution on another? Is UTF-8 input coming from the likes of Apache a possible source of failure? Pack may need to allow for endian-ness of a specific machine.

Signature
--> From the U S of A, the only socialist country that refuses to admit it. <--
Joel Rees - 03 Dec 2007 09:41 GMT
For the record --
> Is UTF-8 input coming from the likes of Apache a possible source of
> failure? Pack may need to allow for endian-ness of a specific machine.
Well, it depends on how one looks at things, perhaps. I think one of
the probable reasons for the failure in the DWIM machinery was that I
am insisting on using shift-JIS characters in the source file instead
of utf-8 in strings and comments. But, no, Apache wasn't filtering
shift-JIS to utf-8 for me. Byte order also was not the problem.
After several hours of analysis (using more of the stuff that made
the original posting of the source somewhat opaque), I determined
that the problem derived from perl sometimes being stricter about
shift-JIS than I wanted it to be.
I don't know why the '+' substitute for space would switch to strict
character interpretation, but it seems to have been doing so.
Shift-JIS is a variable byte width encoding, one or two bytes. Lead
bytes are inherently not valid as single-byte characters. Trailing
bytes are sometimes valid as single-byte characters and sometimes
not. If the regular expression engine is not checking for valid
bytes, all you have to do is string the decoded bytes together. But
if it is checking for valid bytes, you have to put the decoded bytes
into something other than a char. (Blame C for folding the type of a
byte onto the type of a character.)
But if you are collecting into 16-bit words, you have to actually
check for the lead bytes yourself. I'm sure someone could put an RE
together that would do it, but I just decided it was going to be
simpler to check and build the string by hand.
So, for anybody who's curious, here's what I'm doing for now:
-----------------------------------------
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{ my ( $key, $value ) = split( '=', $pair, 2 );
# Really should just give in and use CGI.
# $key =~ tr/+/ /; # You don't expect space in identifiers, but, ...
$key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
# $queries{ $key . '_' } = $value; # dbg
$value =~ tr/+/ /;
my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
{ if ( defined ( $1 ) )
{ my $hexValue = $1;
my $decValue = hex ( $hexValue );
if ( ! defined ( $hexAccm ) )
{ if ( $decValue <= 0x80 || ( $decValue >= 0xa0 && $decValue <
0xe0 ) || $decValue >= 0xfd )
{ $conv .= pack( 'C', $decValue );
}
else # Lead byte -- loose checks all around.
{ $byteAccm = $decValue;
$hexAccm = $hexValue;
}
}
else
{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <
0xe0 ) )
$conv .= pack( 'S', ( $byteAccm << 8 ) + $decValue );
$byteAccm = 0;
$hexAccm = undef;
}
}
else
{ my $cValue = $2;
my $decValue = ord ( $cValue );
if ( ! defined ( $hexAccm ) )
{ $conv .= $cValue;
}
else
{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <
0xe0 ) )
$conv .= pack( 'S', ( $byteAccm << 8 ) + $decValue );
$byteAccm = 0;
$hexAccm = undef;
}
}
}
$queries{ $key } = $conv;
}
-----------------------------------------
If this were production code, I should check some more gaps in the
lead byte (and check where the newest JIS adds the extra several
thousand characters) and uncomment the checks on the trailing bytes
(and add some trailing byte checks specific to certain lead bytes,
geagh). But then I have to figure out what to do with bad bytes.
Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)