Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / Perl / June 2005



Tip: Looking for answers? Try searching our database.

Parsing UTF8 files with wide characters

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Robin - 15 Jun 2005 19:48 GMT
I thought I'd understood how to use unicode support in perl, but
evidently not. In the script below, I'm stumped as to:

1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in UTF8,
nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode;
........... #Encode::decode_utf8($_); to test this )

TIA

Robin

 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;

binmode (DATA,":utf8");

binmode (STDOUT,":utf8");

for (<DATA>){
   
    if (/(<!--\@hidden-->)/gs){
    print "match: ",$1,"\n";
    #Encode::decode_utf8($_);
    s/$1/=AC=AE/gs;
   
    }elsif(/()/gs){
    print "match: ",$1,"\n";
    s/$1/12/gs;
   
   
    }
   
    print;
   
}   
   

__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
    <TITLE> A Web Page</TITLE>   
</HEAD>
<BODY>
<BLOCKQUOTE>
<H3>=AC=E8=9E=AEnews<FONT COLOR=#FF3300>1</FONT></H3>
... and this is a web page.
<P>
<IMG ALT="A Filler" WIDTH="450" HEIGHT="296">
<P>
hidden marker here -----><FONT
COLOR=#FF3300><!--@hidden--></FONT><------<BR>
</BLOCKQUOTE>
</BODY>
</HTML>
Andrew Mace - 15 Jun 2005 19:54 GMT
Try "use utf8" - it lets Perl know that your script contains utf8 chars.

More info: http://perlpod.com/5.9.1/lib/utf8.html

Andrew

> I thought I'd understood how to use unicode support in perl, but  
> evidently not. In the script below, I'm stumped as to:
[quoted text clipped - 55 lines]
> </BODY>
> </HTML>
Robin - 15 Jun 2005 20:26 GMT
thanks Andrew and Sherm

I went back to look at perluniintro because I was sure I could remember
reading that the "use utf8" pragma was no longer needed, right under
where it says this it continues "Only one case remains where an
explicit "use utf8" is needed: if your Perl script itself is encoded in
UTF-8...."

*sigh*

Robin
Sherm Pendley - 15 Jun 2005 20:05 GMT
> I thought I'd understood how to use unicode support in perl, but  
> evidently not. In the script below, I'm stumped as to:
[quoted text clipped - 3 lines]
> UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require  
> Encode; ........... #Encode::decode_utf8($_); to test this )

The binmode() calls you've included tell Perl that the data coming  
from and going to those file handles is UTF8 encoded.

But, you have UTF8-encoded text in your code, too. To tell Perl about  
that, you need to use the "use utf8;" pragma.

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.