Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / Perl / June 2006



Tip: Looking for answers? Try searching our database.

Writing utf 8 files

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Tommy Nordgren - 22 Jun 2006 18:48 GMT
How do I write proper utf 8 characters to a file? I write only two  
characters, and they come out as four
garbage characters when I view the file in an editor.
-------------------------------------
This sig is dedicated to the advancement of Nuclear Power
Tommy Nordgren
tommy.nordgren@chello.se
Sherm Pendley - 22 Jun 2006 19:15 GMT
> How do I write proper utf 8 characters to a file? I write only two  
> characters, and they come out as four
> garbage characters when I view the file in an editor.

Quick answer:

    open FH, ">:utf8", "file";

Complete answer:

    perldoc perluniintro
    perldoc PerlIO

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
Tommy Nordgren - 22 Jun 2006 19:29 GMT
22 jun 2006 kl. 20.15 skrev Sherm Pendley:

>> How do I write proper utf 8 characters to a file? I write only two  
>> characters, and they come out as four
[quoted text clipped - 13 lines]
> Cocoa programming in Perl: http://camelbones.sourceforge.net
> Hire me! My resume: http://www.dot-app.org

    I've already tried that. That was what i was doing when I got garbage.
-------------------------------------
This sig is dedicated to the advancement of Nuclear Power
Tommy Nordgren
tommy.nordgren@chello.se
Sherm Pendley - 22 Jun 2006 20:21 GMT
> 22 jun 2006 kl. 20.15 skrev Sherm Pendley:
>
[quoted text clipped - 13 lines]
>     I've already tried that. That was what i was doing when I got  
> garbage.

Well, the above is correct as far as Perl goes - but it doesn't rule  
out other problems. Are you certain that the editor you're using is  
interpreting the file correctly, as UTF8? Also, are you certain that  
your input really is UTF8?

For instance, I ran this script to generate a test file:

    #!/usr/bin/perl

    use strict;
    use warnings;
    use utf8; # This allows utf8 in string literals, like below

    open FH, '>:utf8', '/Users/sherm/hello.txt' or die $!;
    print FH "Hëllö, wörld!\n";
    close FH;

When I open the file in BBEdit, I see gibberish, because BBEdit can't  
determine that it's UTF8 (there's no BOM), and misinterprets it as  
the default Mac OS Roman instead. But, if I change BBEdit's default  
encoding, or use the "Reopen Using Encoding" function, BBEdit  
displays the file correctly.

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
Tommy Nordgren - 22 Jun 2006 20:28 GMT
22 jun 2006 kl. 20.29 skrev Tommy Nordgren:

> 22 jun 2006 kl. 20.15 skrev Sherm Pendley:
>
[quoted text clipped - 18 lines]
>     I've already tried that. That was what i was doing when I got  
> garbage.

    I found the problem it is necessary to
1) use the use utf8 pragma;
2) Explicitly write a BOM byte sequence immediately after opening the  
file.
point 2 is where I erred. I expected the BOM to be added automatically,
when opening a file for write with the utf-8 encoding.
-------------------------------------
This sig is dedicated to the advancement of Nuclear Power
Tommy Nordgren
tommy.nordgren@chello.se
Sherm Pendley - 22 Jun 2006 20:44 GMT
> 22 jun 2006 kl. 20.29 skrev Tommy Nordgren:
>
[quoted text clipped - 18 lines]
>     I found the problem it is necessary to
> 1) use the use utf8 pragma;

That's only needed if your actual Perl code is UTF-8 encoded, like my  
example was. If your UTF-8 data is coming from an external source,  
"use utf8" has no effect.

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
John Delacour - 22 Jun 2006 21:02 GMT
>>>>How do I write proper utf 8 characters to a file? I write only
>>>>two characters, and they come out as four
>>>>garbage characters when I view the file in an editor.

The only reason for that can be that you have your editor set to open
files as MacRoman or some non-utf-8 charset.  Provided your editor
prefs are set to open as utf-8 or you opt for utf-8 in the open file
dialog you will not get this problem.

>    I found the problem it is necessary to
>1) use the use utf8 pragma;
>2) Explicitly write a BOM byte sequence immediately after opening the file.
>point 2 is where I erred. I expected the BOM to be added automatically,
>when opening a file for write with the utf-8 encoding.

You would need to give an example of what you are doing, but neither
of those things should be necessary and nor should it be necessary to
specify utf-8 when opening the filehandle as Sherm suggested.

The following script will write "ö", utf8-encoded to "trash.txt" on
the desktop:

#!/usr/bin/perl
my $text = "ö";
my $f = "$ENV{HOME}/desktop/trash.txt";
open F, ">$f" or die $!;
print F $text;
close F;

If you open the file as utf-8 you will see "ö" and if you open it as
MacRoman you will see "√∂".  You could also open it as Traditional
Chinese or Simplified Chinese or many other things and see other
things.  UTF-8 byte order is always the same, so there is no need for
a BOM, though some editors might use it as a hint.

JD
Joel Rees - 23 Jun 2006 15:03 GMT
> If you open the file as utf-8 you will see "" and if you open it  
> as MacRoman you will see "".  You could also open it as  
> Traditional Chinese or Simplified Chinese or many other things and  
> see other things.  UTF-8 byte order is always the same, so there is  
> no need for a BOM, though some editors might use it as a hint.

Given that his editor seems to have interpreted the file as utf-8  
with the BOM in place and as something else without the BOM, we might  
guess that his editor recognizes the BOM.

We could also, of course, guess that his login account is set to  
default to something other than utf-8, which is also in keeping with  
my experience with Mac OS X when the user has not deliberately messed  
around with things.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.