Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Country Specific / UK Mac Group / May 2008



Tip: Looking for answers? Try searching our database.

TeXShop (Cocoa app) and Unicode trouble.

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Rowland McDonnell - 30 May 2008 05:41 GMT
MacTeX 2007, and TeXShop v2.14

I've just pasted some Unicode text into a tex document open in TeXShop.

I've saved the file with Western MacOS X encoding.  Everything looks
fine.

But when I try to LaTeX it, pdfTeX gets as far as the pasted in Unicode
stuff and barfs like this:

! Text line contains an invalid character.
l.48 I^^@
         ^^@a^^@m^^@ ^^@a^^@l^^@s^^@o^^@
^^@p^^@l^^@e^^@a^^@s^^@e^^@d^^@ ^^...

It seems that TeX thinks that I^^@ ^^@a^^@m^^@ ^^@a^^@l^^@s^^@o^^@
^^@p^^@l^^@e^^@a^^@s^^@e^^@d^^@ is what the line of text looks like - "I
am pleased", is how it begins in `reality'.

(^^@ means `charcode 0' in TeX logfile speak.)

So I tried pasting the suspect text into Textedit, saved that as Western
MacOS X encoding, closed the file, opened it again in Textedit, tried
cut-and-paste into TeXShop - and got the same again.

Does anyone have a clue how I could deal with this one?  Something -
anything - to turn that Unicode into something that looks a bit like
ASCII would do me.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Chris Ridd - 30 May 2008 07:53 GMT
> Does anyone have a clue how I could deal with this one?  Something -
> anything - to turn that Unicode into something that looks a bit like
> ASCII would do me.

TextWrangler will do the job. It guesses the encoding when it opens a
file, so if it guesses wrong you need to use File > Reopen Using
Encoding and force the right one. Then the bottom of each window has a
popup menu which lets you switch the encoding used, which will also be
the encoding used when you save.

You can probably also make it search/replace the NUL bytes instead.

Cheers,

Chris
Rowland McDonnell - 30 May 2008 17:30 GMT
> (Rowland McDonnell) said:
>
[quoted text clipped - 3 lines]
>
> TextWrangler will do the job.

I tried that before posting.

It didn't.

> It guesses the encoding when it opens a
> file, so if it guesses wrong you need to use File > Reopen Using
> Encoding and force the right one.

Ah!  That's what that command does.

> Then the bottom of each window has a
> popup menu which lets you switch the encoding used, which will also be
> the encoding used when you save.

Yeah, well, I can see TextWrangler telling me that I'm looking at
Western MacOS encoding, and that's *AFTER* I've saved and re-opened the
file.

You know what?  TeX still barfs because it's seeing two-byte characters
when I cut and paste from that, just like it did early.

Yep, I have just tried again, following your suggestion and checking the
signs in the pop-up menu.

> You can probably also make it search/replace the NUL bytes instead.

There might be a way of doing that, but I can't see it.

I hate to admit to this, but I found a way round it.

I thought `All these apps I've been playing with are fancy Cocoa apps,
and they're all broken the same way when attempting to perform the same
job - and they're all clearly using OS support to do it, so let's assume
that Cocoa's broken.

MS is famous for writing software that ignores OS support for many
things and `just does it its own way'.  So I used MS Word, opened the
file, saved it as text.

And that works perfectly well.

Grr.

Annoying that it took bloody MS to dig me out of that hole, but there
you go.  It wouldn't be installed at all if we hadn't had access to the
`free to us but officially licenced for us to do this' installer from
work.  Too bloody expensive otherwise and up until today, I would have
said I had no use for it at all.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Chris Ridd - 30 May 2008 17:41 GMT
>> You can probably also make it search/replace the NUL bytes instead.
>
> There might be a way of doing that, but I can't see it.

Find, choose the "Use Grep" option, and then search for "\0" without
the quotes. Replace it with whatever, eg blank.

The number after the \ is octal (historical reasons). You can also do
\x followed by hex. The value's the byte in the file.

> MS is famous for writing software that ignores OS support for many
> things and `just does it its own way'.  So I used MS Word, opened the
> file, saved it as text.

Heh, sometimes being dumb and stupid (Word) pays off.

Cheers,

Chris
Rowland McDonnell - 30 May 2008 19:00 GMT
> (Rowland McDonnell) said:
>
[quoted text clipped - 4 lines]
> Find, choose the "Use Grep" option, and then search for "\0" without
> the quotes. Replace it with whatever, eg blank.

Yse, but since the zero value characters are not visible, but seem to be
considered as part of two-byte characters whatever I do, I'd not expect
that to do a thing.

I tried it, and sure enough: no effect at all.  I got a `didn't find
anything' bong and no changes to file contents (yes, I looked).

> The number after the \ is octal (historical reasons). You can also do
> \x followed by hex. The value's the byte in the file.
[quoted text clipped - 4 lines]
>
> Heh, sometimes being dumb and stupid (Word) pays off.

I'm slightly astonished at the way all these other alleged encoding
conversions appear to do nothing whatsoever.

I'd still like to get to the bottom of this one.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Chris Ridd - 30 May 2008 19:31 GMT
>> (Rowland McDonnell) said:
>>
[quoted text clipped - 11 lines]
> I tried it, and sure enough: no effect at all.  I got a `didn't find
> anything' bong and no changes to file contents (yes, I looked).

You've got some other sort of file then, because if I replace \0 with a
semi-colon in a file saved as UTF-16 (no BOM) and force read-in as
MacOS Roman, it puts a ; in between each char. I made sure there were
NUL bytes in there by doing a Hex dump file, as you rightly note that
TW doesn't show the NUL bytes.

> I'm slightly astonished at the way all these other alleged encoding
> conversions appear to do nothing whatsoever.

It definitely makes a difference, so you're doing something wrong. Ah -
you're not using TextWrangler's Hex dump *front window* option are you?
That dumps out the in-memory representation of each char, and as TW
uses UTF16 for that you'll see a load of NULs...

Cheers,

Chris
Rowland McDonnell - 30 May 2008 20:50 GMT
> (Rowland McDonnell) said:
>
[quoted text clipped - 19 lines]
> NUL bytes in there by doing a Hex dump file, as you rightly note that
> TW doesn't show the NUL bytes.

Ah - I made the attempt when I had the file in `allegedly Western MacOS
Roman' encoding according to the operation performed by `something else,
might well be TextEdit'.  Back to the original file.

Okay.  Unicode UTF 16 little endian is what TextWrangler says it is,
with CRLF line endings.

Converted to Western (MacOS Roman), saved it - that works fine in
TeXShop.  All is well.  No fiddling required, `it just worked without
fuss as you'd expect'.

(the original file was created on a Mac by a Mac application - but by a
very bad Mac app badly ported from Windoze by the look of it)

btw, which way round is little endian?  I understand that it means
`Least signficant bit at the end of the binary word', but the phrase
gives no hint as to which end.

> > I'm slightly astonished at the way all these other alleged encoding
> > conversions appear to do nothing whatsoever.
>
> It definitely makes a difference, so you're doing something wrong. Ah -
> you're not using TextWrangler's Hex dump *front window* option are you?

Nope.  Just opening the file as a text file for editing and saving it
again - Save As... and selecting the encoding I want was my first
attempt; using the pop-up menu in the bottom left and then saving was my
second attempt.

> That dumps out the in-memory representation of each char, and as TW
> uses UTF16 for that you'll see a load of NULs...

What I was doing wrong was using not the original of the troublesome
file, but a version that had been messed up by something else first.

Now I've gone back to the original, I find that TextWrangler does indeed
work as advertised - TextEdit and TeXShop do not.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Chris Ridd - 30 May 2008 21:14 GMT
> btw, which way round is little endian?  I understand that it means
> `Least signficant bit at the end of the binary word', but the phrase
> gives no hint as to which end.

Just replace bit with byte; a little-endian version of ' ' (0x20) is
stored as  0x20 0x00. A big-endian version is stored as 0x00 0x20.

> Now I've gone back to the original, I find that TextWrangler does indeed
> work as advertised - TextEdit and TeXShop do not.

Success!

Cheers,

Chris
Rowland McDonnell - 31 May 2008 01:17 GMT
> (Rowland McDonnell) said:
>
[quoted text clipped - 4 lines]
> Just replace bit with byte; a little-endian version of ' ' (0x20) is
> stored as  0x20 0x00. A big-endian version is stored as 0x00 0x20.

From which I deduce that the information I need is that `little endian'
means `least significant first', and `big endian' means `most
significant first'.

(why do these confusing terms come into being?)

> > Now I've gone back to the original, I find that TextWrangler does indeed
> > work as advertised - TextEdit and TeXShop do not.
>
> Success!

That's what I thought.  Bloody pain, some of this stuff, innit?

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Chris Ridd - 31 May 2008 06:53 GMT
>> (Rowland McDonnell) said:
>>
[quoted text clipped - 8 lines]
> means `least significant first', and `big endian' means `most
> significant first'.

Yes.

> (why do these confusing terms come into being?)

Blame Jonathan Swift.

The Byte Order Mark that robert mentioned is a single UTF-16 value
(0xfffe) that Unicode defines to be an illegal character, yet the way
you write it in the first two bytes unambiguously indicates if the rest
of the file is little- or big-endian.

> That's what I thought.  Bloody pain, some of this stuff, innit?

It is a bit, but at least Unicode's somewhat sane compared to previous
attempts at defining "mega" sets of funny characters. ISO 2022, I'm
talking about you.

Cheers,

Chris
Rowland McDonnell - 31 May 2008 18:33 GMT
> (Rowland McDonnell) said:
>
[quoted text clipped - 12 lines]
>
> Yes.

So why can't they just say that?  LSB first, MSB first - the terms that
used to be used - were impossible to misunderstand even if you'd never
met them before.  Little endian and big endian - well, I need to have
those terms written down on a piece of paper sat in front of me so I can
learn them, and even then I'll forget in a year or two if I don't use
the info.

The problem is really whatever cultural instinct makes US engineers
refer to pressure in pounds and specific impulse in seconds.

(what's wrong with the latter?  Specific impulse is `mpg for rockets',
and the real units are m/s - the Yanks get `s' by dividing the real
number by `gravitational field strength at the Earth's surface'.  It's a
rocket engine equation.  Gravitational field strength at the Earth's
surface cannot have a part to play in any valid function relating to
general rocket engine operation.   But these peopple aren't stupid - so
why do they do it?)

> > (why do these confusing terms come into being?)
>
> Blame Jonathan Swift.

He didn't mis-apply those terms in a confusing way, did he?  Nope.

> The Byte Order Mark that robert mentioned is a single UTF-16 value
> (0xfffe) that Unicode defines to be an illegal character, yet the way
> you write it in the first two bytes unambiguously indicates if the rest
> of the file is little- or big-endian.

Can't you say LSB first or MSB first instead of those confusing terms?
If people start to do that sort of thing a bit more, we might be able to
wipe out the confusing terms.

> > That's what I thought.  Bloody pain, some of this stuff, innit?
>
> It is a bit, but at least Unicode's somewhat sane

The problem is more the software that claims to do conversions but
doesn't do them properly.

> compared to previous
> attempts at defining "mega" sets of funny characters. ISO 2022, I'm
> talking about you.

I don't see why the `this numbered slot in the encoding means this
particular glyph' method has to be used at all.

Why can't there be a list of verbally defined glyphs, with the standard
including the standard *names*, and a standard mechanism for defining
encodings on the fly.

Do it right, and that way you can have aribtrary character sets using
eight bit numerical encodings if you like - just spread across several
sets of eight bit encodings.

(TeX can do things a bit like that using the virtual fount mechanism)

ISO 2022 seems to be thinking in that part of the spectrum, but it does
seem to be a bit of a dog's dinner to me from the Wikipedia article.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Peter Flynn - 31 May 2008 15:55 GMT
> I'd still like to get to the bottom of this one.

As I understand it (which may be faulty), the original document seems to
be using the UTF-16 encoding, so each character is forced to a two-byte
representation with a null first byte where necessary. UTF-8 would have
been a better choice, because characters that can be represented by a
single low-order byte (eg ASCII) are left as single bytes, and only
those requiring two or more bytes are encoded as multibyte sequences.

At a wild random guess, the original was generated from some XML system
which uses UTF-16 by default. There are several culprits (see
microsoft.public.dotnet.xml and similar _passim_) and they tend to be
large corporate database systems which are just covering their lardy
white corporate a.ses against the odd two-byte character.

///Peter
Rowland McDonnell - 31 May 2008 18:33 GMT
> > I'd still like to get to the bottom of this one.
>
[quoted text clipped - 4 lines]
> single low-order byte (eg ASCII) are left as single bytes, and only
> those requiring two or more bytes are encoded as multibyte sequences.

I didn't have much choice over text format.

> At a wild random guess, the original was generated from some XML system
> which uses UTF-16 by default. There are several culprits (see
> microsoft.public.dotnet.xml and similar _passim_) and they tend to be
> large corporate database systems which are just covering their lardy
> white corporate a.ses against the odd two-byte character.

Plenty of the people doing that sort of thing have lardy arses which are
brown or black or whatever.

As I mentioned elsewhere in this thread, the text was generated by an
OCR application - a not very good OCR app, but the one I happen to have.

As mentioned elsewhere in this thread, I've figured out how to solve the
problem but I've not quite worked out how come my earlier attempts
failed. On the other hand, who cares if I have a realiable solution?

Thanks for the thoughts.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Rowland McDonnell - 30 May 2008 18:32 GMT
[snip]

> > Then the bottom of each window has a
> > popup menu which lets you switch the encoding used, which will also be
[quoted text clipped - 9 lines]
> Yep, I have just tried again, following your suggestion and checking the
> signs in the pop-up menu.

[snip]

I've just asked TextWrangler to convert to ASCII for me.  ASCII - can't
go wrong, right?

Wrong.

It *says* it's converted to ASCII, but HexEdit shows clearly that it's
still a two-byte encoding in use when I look at the guts of this
allegedly ASCII encoded file.

Ayone got a clue?

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

robert - 30 May 2008 18:34 GMT
Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:
> MacTeX 2007, and TeXShop v2.14
>
> I've just pasted some Unicode text into a tex document open in TeXShop.

Where did that Unicode text originate from? If I paste some (UTF-16)
text from TextEdit to TeXShop and save it from there as Mac OS Roman,
the resulting .tex seems to be correct (accented characters are
converted from their UTF-16 representation to Mac OS Roman).

Signature

                                                                  robert

Rowland McDonnell - 30 May 2008 19:20 GMT
> Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:
> > MacTeX 2007, and TeXShop v2.14
> >
> > I've just pasted some Unicode text into a tex document open in TeXShop.
>
> Where did that Unicode text originate from?

An OCR package.

>If I paste some (UTF-16)
> text from TextEdit to TeXShop and save it from there as Mac OS Roman,
> the resulting .tex seems to be correct (accented characters are
> converted from their UTF-16 representation to Mac OS Roman).

Now try running the text past TeX.

Go on, try it.

Everything looks fine here until I try TeXing it.  I've opened up the
text in TextWrangler, it claims that it's in Western (Mac OS) encoding,
but there's a null in between each `byte representing a character' if
that's the case.  That null seems to be ignored by all the text editors
I've tried, including TeXshop.  But TeX can see the nulls.

Admittedly, one could probably make some changes so that TeX would
ignore the nulls at least for the purpose in hand - but that's not
really the point.

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

robert - 30 May 2008 21:13 GMT
Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:

>> Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:
>> > MacTeX 2007, and TeXShop v2.14
[quoted text clipped - 5 lines]
>
> An OCR package.

Since its the responsibility of the application from which you copy text to
encode it in a sane manner, my guess would be that the culprit is that OCR
package.

If it dumps a UTF-16 string to the pasteboard whilst not actually declaring
it to be UTF-16-encoded, Cocoa might assume the string is ISO-Latin-1-
encoded (or whatever the default encoding would be).

>> If I paste some (UTF-16) text from TextEdit to TeXShop and save it
>> from there as Mac OS Roman, the resulting .tex seems to be correct
[quoted text clipped - 4 lines]
>
> Go on, try it.

I did, and it works as expected. That is, both 'latex' and 'pdflatex'
run just fine, they just don't handle files in Mac OS Roman too well
because they expect - I think - ISO-Latin-1 as character encoding for
inputfiles. This results in some unexpected characters in the resulting
PDF, but no errors like the one you got.

The .tex file doesn't contain any NULL-bytes either.

> Everything looks fine here until I try TeXing it. I've opened up
> the text in TextWrangler, it claims that it's in Western (Mac OS)
> encoding, but there's a null in between each `byte representing a
> character' if that's the case. That null seems to be ignored by
> all the text editors I've tried, including TeXshop. But TeX can
> see the nulls.

If you forced TextWrangler to use UTF-16 inputencoding, but it didn't
'take', it might have to do with a missing UTF-16 Byte Order Mark in
front of the text. UTF-16 files/strings are recognised by starting with
either a 0xFFFE or 0xFEFF character.

Signature

                                                                    robert

Rowland McDonnell - 31 May 2008 01:17 GMT
> Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:
> >
[quoted text clipped - 15 lines]
> it to be UTF-16-encoded, Cocoa might assume the string is ISO-Latin-1-
> encoded (or whatever the default encoding would be).

Actually, the situation was more like this - quoting myself from another
post in this thread:

--------------------------------------------------------------------

Ah - I made the attempt when I had the file in `allegedly Western MacOS
Roman' encoding according to the operation performed by `something else,
might well be TextEdit'.  Back to the original file.

Okay.  Unicode UTF 16 little endian is what TextWrangler says it is,
with CRLF line endings.

Converted to Western (MacOS Roman), saved it - that works fine in
TeXShop.  All is well.  No fiddling required, `it just worked without
fuss as you'd expect'.
--------------------------------------------------------------------

[snip]

Thanks for the post - I've removed loads of useful stuff that aids
understanding and so on.  Cheers!

Rowland.

Signature

Remove the animal for email address: rowland.mcdonnell@dog.physics.org
                                           Sorry - the spam got to me
http://www.mag-uk.org                             http://www.bmf.co.uk
UK biker?   Join MAG and the BMF and stop the Eurocrats banning biking

Franck Pastor - 31 May 2008 06:21 GMT
> MacTeX 2007, and TeXShop v2.14
>
[quoted text clipped - 24 lines]
> anything - to turn that Unicode into something that looks a bit like
> ASCII would do me.

You have first to convert that bit of Unicode stuff (UTF-8? UTF-16?) in
MacOS Roman encoding (or vice versa, your TeX file in Unicode encoding)
before including it in your TeX file. TeXShop unfortunately hasn't got
the tools for that, but some other text editor as BBEdit, SubEthaEdit
have. Perhaps TextWrangler (free, as in free beer) too.
Franck Pastor - 31 May 2008 06:23 GMT
>> MacTeX 2007, and TeXShop v2.14
>>
[quoted text clipped - 30 lines]
> the tools for that, but some other text editor as BBEdit, SubEthaEdit
> have. Perhaps TextWrangler (free, as in free beer) too.

Sorry, someone has already answered the same thing before!
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.