Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / Perl / April 2005



Tip: Looking for answers? Try searching our database.

UTF-8 support?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
John Blumel - 30 Apr 2005 19:10 GMT
I'm using Perl 5.8.1 on Panther (the stock version) and it seems as
though not everything handles utf8 gracefully. (I haven't worked in
Perl for a while; not since MacPerl 5.0.3, I think it was) Regular
expressions seem to work, as does input and output, but some of the
modules and built-in functions don't seem to work quite as well as
would be hoped.

In particular I've noticed that "substr" doesn't seem to work correctly
when dealing with wide characters. For example:

use utf8;
...
$blah =~ m/<wide_regex>/g;
$position = pos $blah;

seems to give the correct character position but,

$matched = substr($blah,
                  $position - length($blah),
                  length($blah));

doesn't put the matched text into $matched when there are wide
characters in $blah -- i.e., it seems to work off bytes rather than
characters.

Are these issues documented somewhere and are there standard techniques
for dealing with them?

John Blumel
Sherm Pendley - 30 Apr 2005 21:08 GMT
> use utf8;
> ...
[quoted text clipped - 9 lines]
> doesn't put the matched text into $matched when there are wide  
> characters in $blah

Nor should it, even if the text is plain old ASCII - there's a bug in  
the above code that has nothing to do with string encoding.

Take a ten-character ASCII string: 'abcdefghij'. Match it for 'fgh',  
and $position will be 8, as expected.

So, the length of $blah is 10, and $position is 8. So, the above call  
to substr amounts to:

    $matched = substr('abcdefghij', -2, 10);

Which returns 'ij'; everything in the string *after* what was matched  
by the regex.

As the docs for substr() state, if the offset (second argument) is  
negative, the offset is taken from the end of the string, and if the  
combination of offset and length is partially outside of the string,  
the portion inside the string is returned. With an offset of -2, it's  
obviously impossible to take ten characters beginning two from the  
end, so only the remaining two are returned.

But really, why bother with substr() at all? Just parenthesize the  
regex, and store the results in a list:

    my @matched = ($blah =~ m/(<regex>)/g);

That will return a list of all the strings matching the expression.

If, after such a match you need to know the positions of the matched  
strings in $blah, have a look at @- or @+.

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.