Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / Mac Programming / September 2005



Tip: Looking for answers? Try searching our database.

exotic charsets

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Allen Brunson - 27 Sep 2005 07:28 GMT
a guy recently e-mailed me saying there are messages in certain newsgroups
encoded in chinese that he would like to read.  they indicate their charset
like this:

 Content-type: text/plain; charset=GB2312

i've never heard of GB2312.  a little googling turns up this line in a source
file, apparently from the darwin sources.  it's from a big struct that maps
charset names to encodings.

 { "gb2312", kCFStringEncodingEUC_CN_DOSVariant, NoEncodingFlags },

a kCFStringEncoding<whatever> enum should be enough for me to know how to
decode and encode that charset.  alas, the master encoding list, in
CFStringEncodingExt.h, does not list that one.  so i do a little bit more
googling and turn up this:

 #define kCFStringEncodingEUC_CN_DOSVariant
   (kTextEncodingEUC_CN | (kEUC_CN_DOSVariant << 16))

and amazingly, that compiles.  in theory, my program should now be able to
deal with that charset, just as easily as it can handle iso-8859-1, koi8-r,
and so on.

this sure makes me queasy, though.  i basically just made up a number and used
it as if it were an official enum.  is there any hope that this will work?
how should i document it?  the user is probably going to need some extra
language support library installed before this will work, right?  where can i
get such a thing for testing?

i realize this question is probably WAY beyond the scope of this newsgroup.
can anybody suggest one of apple's mailing lists that might be appropriate?
Michael Ash - 27 Sep 2005 12:47 GMT
> a guy recently e-mailed me saying there are messages in certain newsgroups
> encoded in chinese that he would like to read.  they indicate their charset
[quoted text clipped - 11 lines]
> decode and encode that charset.  alas, the master encoding list, in
> CFStringEncodingExt.h, does not list that one.

However, it does list kCFStringEncodingGB_2312_80. Is this not what you
want?

Failing that, the CFStringConvertIANACharSetNameToEncoding() function
would seem to come in handy here.

>  so i do a little bit more
> googling and turn up this:
[quoted text clipped - 5 lines]
> deal with that charset, just as easily as it can handle iso-8859-1, koi8-r,
> and so on.

Of course it compiles. You can do:

#define kCFStringEncodingEUC_CN_DOSVariant I am a pretty princess, \
    would you like some tea?

And it will still compile (until you try to actually use the #define,
anyway)

> this sure makes me queasy, though.  i basically just made up a number and used
> it as if it were an official enum.  is there any hope that this will work?

I would be fairly shocked if it worked. Every constant in
CFStringEncodingExt.h is 16 bits, but you just created a 32-bit one out of
thin air. Trying to use it will probably at best fail, and at worst crash.

> how should i document it?  the user is probably going to need some extra
> language support library installed before this will work, right?  where can i
> get such a thing for testing?

Mac OS X comes with support for all of this stuff out of the box. For
testing, just find a web page or text file encoded with this encoding,
save it, and run it through your code to see if it gets read correctly.

Signature

Michael Ash
Rogue Amoeba Software

Allen Brunson - 27 Sep 2005 16:09 GMT
> However, it does list kCFStringEncodingGB_2312_80. Is this not what you
> want?

i know i want a character set called GB2312.  i wasn't sure how macosx refers
to that character set.  in other words, i don't know *what* i want (heh).

> Failing that, the CFStringConvertIANACharSetNameToEncoding() function
> would seem to come in handy here.

holy cow, you're right.  that does *exactly* what i want.  i didn't know there
was such a thing.  i've just been doing manual substitutions up to now, for
charsets i was familiar with.  that gets me most of the way there.

> I would be fairly shocked if it worked.

prepare to be fairly shocked: it worked.  a formerly garbled message file in
chinese displayed properly with that weird #define i googled up.  i had
established that much before i posted.  i figured it had worked for me but
wouldn't for most because of a unicode font i installed awhile back that has
eleventy-jillion glyphs in it.  until you pointed it out, i hadn't noticed
that they were all 16-bit defines, and the thing i googled up is 32 bits.
perhaps the CFString conversion methods ignore the top 16 bits.

i had just assumed the user wouldn't have a clue as to what a character set
even *is*, let alone be able to help me pick the correct one.  i was wrong
about that.  he has since written back, and he gave me a whole tutorial on the
history of various charsets used for chinese.  yay for clueful users.  i've
definitely got enough information to make intelligent choices now.
Ben Artin - 27 Sep 2005 22:00 GMT
> > I would be fairly shocked if it worked.
>
> prepare to be fairly shocked: it worked.  

The reason that this works is that a CFStringEncoding is really a TextEncoding;
TextEncodings are, indeed, 32-bit types (it just so happens that the most common
ones fit in 16 bits); and, of course, the CFStringEncoding you made up is a
valid TextEncoding.

The right way to create a TextEncoding would have been to use the appropriate
constants and functions in TextEncoding.h, but for your purposes you really
should be using the IANA->Encoding conversion anyway.

Ben

Signature

If this message helped you, consider buying an item
from my wish list: <http://artins.org/ben/wishlist>

I changed my name: <http://periodic-kingdom.org/People/NameChange.php>

Michael Ash - 27 Sep 2005 22:05 GMT
>> However, it does list kCFStringEncodingGB_2312_80. Is this not what you
>> want?
>
> i know i want a character set called GB2312.  i wasn't sure how macosx refers
> to that character set.  in other words, i don't know *what* i want (heh).

The name of the constant would seem to indicate that it's GB2312 of
course, but I don't know what the "80" is for.

>> Failing that, the CFStringConvertIANACharSetNameToEncoding() function
>> would seem to come in handy here.
[quoted text clipped - 6 lines]
>
> prepare to be fairly shocked: it worked.

I am suitably shocked. :)

>  a formerly garbled message file in
> chinese displayed properly with that weird #define i googled up.  i had
> established that much before i posted.  i figured it had worked for me but
> wouldn't for most because of a unicode font i installed awhile back that has
> eleventy-jillion glyphs in it.

I believe that all Mac OS X installs come with at least one basic Chinese
font, so it should work anywhere.

Signature

Michael Ash
Rogue Amoeba Software

Allen Brunson - 28 Sep 2005 06:26 GMT
> The name of the constant would seem to indicate that it's GB2312 of
> course, but I don't know what the "80" is for.

my new user did.  he said that particular variant was created in 1980, and has
since been superseded by a later version.  there are several more chinese
charsets i will have to add support for.

nobody should take what i say about this stuff as gospel, though.  my program
has been "chinese-enabled" (heh) for about 12 hours now.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.