Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / Perl / December 2007



Tip: Looking for answers? Try searching our database.

tr question (probably wrong list to ask, but ...)

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Joel Rees - 01 Dec 2007 00:33 GMT
This is probably the wrong list for this question, but is anyone  
willing to give me a clue why

$line =~ tr/+/ /;

would clip out the lead bytes of a shift-JIS string in a cgi script?

Come to think of it, I think it's being applied while the string is  
still hex-encoded, so it makes even less sense to me.

(I know, I should be letting the CGI module decode the url-encoded  
string. But I seem to be mis-understanding something fundamental  
here. Which is why a newbies list would probably be better for this  
question.)

Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Andy Holyer - 01 Dec 2007 01:56 GMT
> This is probably the wrong list for this question, but is anyone  
> willing to give me a clue why
>
> $line =~ tr/+/ /;
>
> would clip out the lead bytes of a shift-JIS string in a cgi script?

That's a badly-formed regular expression. "+" means "one or more of  
what was just expressed", but you haven't expressed anything so far,  
so god knows what it will match.

I think you meant to say ".+", but that will just delete the whole  
string in this context. What did you want to do?
Doug McNutt - 01 Dec 2007 02:19 GMT
>>This is probably the wrong list for this question, but is anyone  willing to give me a clue why
>>
[quoted text clipped - 5 lines]
>
>I think you meant to say ".+", but that will just delete the whole  string in this context. What did you want to do?

+ is a special representation of a space in URL encoding but it's the one-or-more collective char in perl.  My guess is that the intent is to replace plus's with spaces which requires escaping the +, How about

$line =~ tr/\+/ /g;

where I have added the g to get them all.

But what the devil is the thingy (missing char) repeated at least once? Should it not have produced a compile error or a least a warning?  Was the -w switch used?

Signature

--> On the eighth day, about 6 kiloyears ago, the Lord realized that free will would make man ask what existed before the Creation. So He installed a few gigayears of history complete with a big bang and a fossilized record of evolution. <--

Chas. Owens - 01 Dec 2007 02:02 GMT
> This is probably the wrong list for this question, but is anyone
> willing to give me a clue why
>
> $line =~ tr/+/ /;
>
> would clip out the lead bytes of a shift-JIS string in a cgi script?
snip

The only thing that should do is replace occurrences of '+' with ' '
in $line.  You can read more about the tr/// operator in perldoc
perlop.
Joel Rees - 01 Dec 2007 02:43 GMT
I guess it would help if I posted my code and what it puts out.

> This is probably the wrong list for this question, but is anyone  
> willing to give me a clue why
[quoted text clipped - 10 lines]
> here. Which is why a newbies list would probably be better for this  
> question.)

# The code that grabs the parameters:

my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{    my ( $key, $value ) = split( '=', $pair, 2 );
    # Really should just give in and use CGI.
    #$key =~ tr/+/ /;
    $key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
    $queries{ $key . '_0' } = $value;
   
    my $value_y = my $value_tr = $value;
    $value_y =~ y/+/ /;
    $queries{ $key . '_y1' } = $value_y;
    $value_tr =~ tr/+/ /;
    $queries{ $key . '_tr1' } = $value_tr;
   
    $value_y =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
    $queries{ $key . '_y' } = $value_y;
    $value_tr =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
    $queries{ $key . '_tr' } = $value_tr;
   
    $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
    $queries{ $key } = $value;
}

# For reference, the code being used to dump the parameters to html:

sub dumpQueries
{    if ( $debugParams )
    {
        print "language = $language, function = $function\n";

        print "<table border='1'>\n";

        my @keys = keys ( %queries );
        foreach my $key ( @keys )
        {    my $value = $queries{ $key };
            if ( !defined ( $value ) )
            {    $value = "UNDEF";
            }
            print "<tr><td align='right'>$key</td><td align='left'>$value</
td></tr>\n";
        }

        print "</table>\n";
    }
}

# ------------results-------------
chatter    this+is+a+test.+これはテストです。
function_tr    eキ
chatter_y1    this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%
81B
who_tr1    daddy
function    投稿する
who_y1    daddy
function_tr1    %93%8A%8De%82%B7%82%E9
who    daddy
who_0    daddy
who_tr    daddy
who_y    daddy
chatter_0    this+is+a+test.+%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B
chatter_y    this is a test. アヘeXgナキB
function_y    eキ
chatter_tr    this is a test. アヘeXgナキB
function_y1    %93%8A%8De%82%B7%82%E9
function_0    %93%8A%8De%82%B7%82%E9
chatter_tr1    this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
B7%81B
# ------------results-sorted-------------
chatter_0    this+is+a+test.+%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B
chatter_y1    this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%
81B
chatter_tr1    this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
B7%81B
chatter_y    this is a test. アヘeXgナキB
chatter_tr    this is a test. アヘeXgナキB
chatter    this+is+a+test.+これはテストです。
function_0    %93%8A%8De%82%B7%82%E9
function_y1    %93%8A%8De%82%B7%82%E9
function_tr1    %93%8A%8De%82%B7%82%E9
function_y    eキ
function_tr    eキ
function    投稿する
who_0    daddy
who_y1    daddy
who_tr1    daddy
who_y    daddy
who_tr    daddy
who    daddy
# ------------results-html-------------
language = j, function = talk
<table border='1'>
<tr><td align='right'>chatter</td><td align='left'>this+is+a+test.
+これはテストです。</td></tr>
<tr><td align='right'>function_tr</td><td align='left'>eキ</td></
tr>
<tr><td align='right'>chatter_y1</td><td align='left'>this is a test.  
%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B</td></tr>
<tr><td align='right'>who_tr1</td><td align='left'>daddy</td></tr>
<tr><td align='right'>function</td><td align='left'>投稿する</
td></tr>
<tr><td align='right'>who_y1</td><td align='left'>daddy</td></tr>
<tr><td align='right'>function_tr1</td><td align='left'>%93%8A%8De%82%
B7%82%E9</td></tr>
<tr><td align='right'>who</td><td align='left'>daddy</td></tr>
<tr><td align='right'>who_0</td><td align='left'>daddy</td></tr>
<tr><td align='right'>who_tr</td><td align='left'>daddy</td></tr>
<tr><td align='right'>who_y</td><td align='left'>daddy</td></tr>
<tr><td align='right'>chatter_0</td><td align='left'>this+is+a+test.+%
82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B</td></tr>
<tr><td align='right'>chatter_y</td><td align='left'>this is a  
test. アヘeXgナキB</td></tr>
<tr><td align='right'>function_y</td><td align='left'>eキ</td></tr>
<tr><td align='right'>chatter_tr</td><td align='left'>this is a  
test. アヘeXgナキB</td></tr>
<tr><td align='right'>function_y1</td><td align='left'>%93%8A%8De%82%
B7%82%E9</td></tr>
<tr><td align='right'>function_0</td><td align='left'>%93%8A%8De%82%
B7%82%E9</td></tr>
<tr><td align='right'>chatter_tr1</td><td align='left'>this is a  
test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%B7%81B</td></tr>
</table>
# ------------results-extract-------------
chatter    this+is+a+test.+これはテストです。
chatter_tr    this is a test. アヘeXgナキB
function    投稿する
function_tr    eキ
# ------------results-extract-hexdump-lined-up-------------
00000000  63 68 61 74 74 65 72 09  |chatter.|
00000008  74 68 69 73 2b 69 73 2b  61 2b 74 65 73 74 2e 2b  |this+is+a
+test.+|
00000018  82 b1 82 ea 82 cd 83 65  83 58 83 67 82 c5 82 b7  81 42 0a  
|.......e.X.g.....B.|
0000002b  63 68 61 74 74 65 72 5f  74 72 09 |chatter_tr.|
00000036  74 68  69 73 20 69 73 20  61 20 74 65 73 74 2e 20  |this is  
a test. |
00000046  b1 cd  65 58 67 c5 b7 42 0a  |..eXg..B.|
0000004f  66 75 6e 63 74 69 6f 6e  09  |function.|
00000058  93 8a 8d 65 82 b7 82 e9  0a  |...e.....|
00000061  66 75 6e 63 74 69 6f 6e  5f 74 72 09  |function_tr.|
0000006d  65 b7 0a  |e..|
# ------------results-extract-hexdump-straight-------------
00000000  63 68 61 74 74 65 72 09  74 68 69 73 2b 69 73 2b  |
chatter.this+is+|
00000010  61 2b 74 65 73 74 2e 2b  82 b1 82 ea 82 cd 83 65  |a+test.
+.......e|
00000020  83 58 83 67 82 c5 82 b7  81 42 0a 63 68 61 74 74  
|.X.g.....B.chatt|
00000030  65 72 5f 74 72 09 74 68  69 73 20 69 73 20 61 20  |
er_tr.this is a |
00000040  74 65 73 74 2e 20 b1 cd  65 58 67 c5 b7 42 0a 66  |
test. ..eXg..B.f|
00000050  75 6e 63 74 69 6f 6e 09  93 8a 8d 65 82 b7 82 e9  |
unction....e....|
00000060  0a 66 75 6e 63 74 69 6f  6e 5f 74 72 09 65 b7 0a  
|.function_tr.e..|
# ------------results-end-------------

00000018  82 b1 82 ea 82 cd 83 65  83 58 83 67 82 c5 82 b7  81 42 0a  
|.......e.X.g.....B.|
00000046  b1 cd  65 58 67 c5 b7 42 0a  |..eXg..B.|

and

00000058  93 8a 8d 65 82 b7 82 e9  0a  |...e.....|
0000006d  65 b7 0a  |e..|

tell the tale.

Okay, so it looks like it isn't just stripping the lead bytes, every  
now and then I'm losing a full JIS character.

Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Chas. Owens - 01 Dec 2007 03:04 GMT
> I guess it would help if I posted my code and what it puts out.
snip

Whoa, way to much information.  Try to reproduce your issue with the
least amount of code and data.  Off hand I would say your problem is
probably with the encoding of your data (and Perl's lack of knowledge
about it).  Try using the locale or encoding pragmas.
Joel Rees - 01 Dec 2007 03:27 GMT
>> I guess it would help if I posted my code and what it puts out.
> snip
>
> Whoa, way to much information.

:-)

> Try to reproduce your issue with the
> least amount of code and data.

Actually, after cleaning out some of the extraneous stuff, I can see  
it is not tr/// doing the dirty deed after all (which is a relief, in  
a way):

> chatter_0    this+is+a+test.+%82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
> B7%81B
> chatter_tr1    this is a test. %82%B1%82%EA%82%CD%83e%83X%83g%82%C5%82%
> B7%81B
> chatter_tr    this is a test. アヘeXgナキB
> chatter    this+is+a+test.+これはテストです。

Even though stripping the '+' out forces what had been intermittent  
behavior, I can see that tr/// is doing its job right.

>   Off hand I would say your problem is
> probably with the encoding of your data (and Perl's lack of knowledge
> about it).  Try using the locale or encoding pragmas.

Yeah, I'll have to go back to that black magic.

Thanks everybody for being listening ears.

Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Joel Rees - 01 Dec 2007 08:03 GMT
Okay, given the following (without all the debugging code I had in  
earlier):

> # The code that grabs the parameters:
>
[quoted text clipped - 8 lines]
>    
>     $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;

>     $queries{ $key } = $value;
> }

Anyone know why commenting out the transliteration will recover the  
shift-JIS characters from the url-encoded stream (leaving spaces as  
'+', of course), but leaving the transliteration in will induce the  
code to drop shift-JIS lead bytes and every now and then whole  
characters?

I had a similar problem with

    $value =~ s/\+/ /g;

but it was an intermittent problem. (Haven't tried it today to see  
whether it only kills the shift-JIS characters when there is 8-bit  
space in the stream, but that may have been what was happening.)

Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
Doug McNutt - 01 Dec 2007 18:59 GMT
>Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>Message-Id: <567C7534-688F-4335-9555-96412988E2A8@sannet.ne.jp>
>Content-Transfer-Encoding: 7bit
>From: Joel Rees <joel_rees@sannet.ne.jp>

>Content-Type: text/plain; charset=ISO-2022-JP; delsp=yes; format=flowed
>Message-Id: <89A94C5D-6617-4789-880C-8CD2FC825C85@sannet.ne.jp>
>Content-Transfer-Encoding: 7bit
>From: Joel Rees <joel_rees@sannet.ne.jp>

I had some intermittent problems reading your postings using Eudora-5 on this Mac 8500 running OS9.1 which I prefer for email.

The $line =~ tr/+/ /; showed up as $line =? tr/+/ /;  and I got a couple of yen marks.  I blamed it on lack of unicode support. Looking back I see a couple of Content headers in your email that bother me They both say simple 7 bit ASCII but then they also have divers encodings stated which really are about how to use the eighth bit.

There is also the big-endian / little-endian consideration which has reared its ugly head with the introduction of Intel machines running Mac OS.

Is it possible that some of the failure to decode %xx encoded stuff is associated with development on one machine followed by execution on another? Is UTF-8 input coming from the likes of Apache a possible source of failure? Pack may need to allow for endian-ness of a specific machine.

Signature

--> From the U S of A, the only socialist country that refuses to admit it. <--

Joel Rees - 03 Dec 2007 09:41 GMT
For the record --

> Is UTF-8 input coming from the likes of Apache a possible source of  
> failure? Pack may need to allow for endian-ness of a specific machine.

Well, it depends on how one looks at things, perhaps. I think one of  
the probable reasons for the failure in the DWIM machinery was that I  
am insisting on using shift-JIS characters in the source file instead  
of utf-8 in strings and comments. But, no, Apache wasn't filtering  
shift-JIS to utf-8 for me. Byte order also was not the problem.

After several hours of analysis (using more of the stuff that made  
the original posting of the source somewhat opaque), I determined  
that the problem derived from perl sometimes being stricter about  
shift-JIS than I wanted it to be.

I don't know why the '+' substitute for space would switch to strict  
character interpretation, but it seems to have been doing so.

Shift-JIS is a variable byte width encoding, one or two bytes. Lead  
bytes are inherently not valid as single-byte characters. Trailing  
bytes are sometimes valid as single-byte characters and sometimes  
not. If the regular expression engine is not checking for valid  
bytes, all you have to do is string the decoded bytes together. But  
if it is checking for valid bytes, you have to put the decoded bytes  
into something other than a char. (Blame C for folding the type of a  
byte onto the type of a character.)

But if you are collecting into 16-bit words, you have to actually  
check for the lead bytes yourself. I'm sure someone could put an RE  
together that would do it, but I just decided it was going to be  
simpler to check and build the string by hand.

So, for anybody who's curious, here's what I'm doing for now:

-----------------------------------------
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{    my ( $key, $value ) = split( '=', $pair, 2 );
    # Really should just give in and use CGI.
    # $key =~ tr/+/ /;    # You don't expect space in identifiers, but, ...
    $key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;

    # $queries{ $key . '_' } = $value; # dbg
   
    $value =~ tr/+/ /;
   
    my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
    while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
    {    if ( defined ( $1 ) )
        {    my $hexValue = $1;
            my $decValue = hex ( $hexValue );
            if ( ! defined ( $hexAccm ) )
            {    if ( $decValue <= 0x80 || ( $decValue >= 0xa0 && $decValue <  
0xe0 ) || $decValue >= 0xfd )
                {    $conv .= pack( 'C', $decValue );
                }
                else    # Lead byte -- loose checks all around.
                {    $byteAccm = $decValue;
                    $hexAccm = $hexValue;
                }
            }
            else
            {    # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <  
0xe0 ) )
                $conv .= pack( 'S', ( $byteAccm << 8 ) + $decValue );
                $byteAccm = 0;
                $hexAccm = undef;
            }
        }
        else
        {    my $cValue = $2;
            my $decValue = ord ( $cValue );
            if ( ! defined ( $hexAccm ) )
            {    $conv .= $cValue;
            }
            else
            {    # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <  
0xe0 ) )
                $conv .= pack( 'S', ( $byteAccm << 8 ) + $decValue );
                $byteAccm = 0;
                $hexAccm = undef;
            }
        }
    }

    $queries{ $key } = $conv;
}
-----------------------------------------

If this were production code, I should check some more gaps in the  
lead byte (and check where the newest JIS adds the extra several  
thousand characters) and uncomment the checks on the trailing bytes  
(and add some trailing byte checks specific to certain lead bytes,  
geagh). But then I have to figure out what to do with bad bytes.

Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.