TeXShop (Cocoa app) and Unicode trouble.
|
|
Thread rating:  |
Rowland McDonnell - 30 May 2008 05:41 GMT MacTeX 2007, and TeXShop v2.14
I've just pasted some Unicode text into a tex document open in TeXShop.
I've saved the file with Western MacOS X encoding. Everything looks fine.
But when I try to LaTeX it, pdfTeX gets as far as the pasted in Unicode stuff and barfs like this:
! Text line contains an invalid character. l.48 I^^@ ^^@a^^@m^^@ ^^@a^^@l^^@s^^@o^^@ ^^@p^^@l^^@e^^@a^^@s^^@e^^@d^^@ ^^...
It seems that TeX thinks that I^^@ ^^@a^^@m^^@ ^^@a^^@l^^@s^^@o^^@ ^^@p^^@l^^@e^^@a^^@s^^@e^^@d^^@ is what the line of text looks like - "I am pleased", is how it begins in `reality'.
(^^@ means `charcode 0' in TeX logfile speak.)
So I tried pasting the suspect text into Textedit, saved that as Western MacOS X encoding, closed the file, opened it again in Textedit, tried cut-and-paste into TeXShop - and got the same again.
Does anyone have a clue how I could deal with this one? Something - anything - to turn that Unicode into something that looks a bit like ASCII would do me.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Chris Ridd - 30 May 2008 07:53 GMT > Does anyone have a clue how I could deal with this one? Something - > anything - to turn that Unicode into something that looks a bit like > ASCII would do me. TextWrangler will do the job. It guesses the encoding when it opens a file, so if it guesses wrong you need to use File > Reopen Using Encoding and force the right one. Then the bottom of each window has a popup menu which lets you switch the encoding used, which will also be the encoding used when you save.
You can probably also make it search/replace the NUL bytes instead.
Cheers,
Chris
Rowland McDonnell - 30 May 2008 17:30 GMT > (Rowland McDonnell) said: > [quoted text clipped - 3 lines] > > TextWrangler will do the job. I tried that before posting.
It didn't.
> It guesses the encoding when it opens a > file, so if it guesses wrong you need to use File > Reopen Using > Encoding and force the right one. Ah! That's what that command does.
> Then the bottom of each window has a > popup menu which lets you switch the encoding used, which will also be > the encoding used when you save. Yeah, well, I can see TextWrangler telling me that I'm looking at Western MacOS encoding, and that's *AFTER* I've saved and re-opened the file.
You know what? TeX still barfs because it's seeing two-byte characters when I cut and paste from that, just like it did early.
Yep, I have just tried again, following your suggestion and checking the signs in the pop-up menu.
> You can probably also make it search/replace the NUL bytes instead. There might be a way of doing that, but I can't see it.
I hate to admit to this, but I found a way round it.
I thought `All these apps I've been playing with are fancy Cocoa apps, and they're all broken the same way when attempting to perform the same job - and they're all clearly using OS support to do it, so let's assume that Cocoa's broken.
MS is famous for writing software that ignores OS support for many things and `just does it its own way'. So I used MS Word, opened the file, saved it as text.
And that works perfectly well.
Grr.
Annoying that it took bloody MS to dig me out of that hole, but there you go. It wouldn't be installed at all if we hadn't had access to the `free to us but officially licenced for us to do this' installer from work. Too bloody expensive otherwise and up until today, I would have said I had no use for it at all.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Chris Ridd - 30 May 2008 17:41 GMT >> You can probably also make it search/replace the NUL bytes instead. > > There might be a way of doing that, but I can't see it. Find, choose the "Use Grep" option, and then search for "\0" without the quotes. Replace it with whatever, eg blank.
The number after the \ is octal (historical reasons). You can also do \x followed by hex. The value's the byte in the file.
> MS is famous for writing software that ignores OS support for many > things and `just does it its own way'. So I used MS Word, opened the > file, saved it as text. Heh, sometimes being dumb and stupid (Word) pays off.
Cheers,
Chris
Rowland McDonnell - 30 May 2008 19:00 GMT > (Rowland McDonnell) said: > [quoted text clipped - 4 lines] > Find, choose the "Use Grep" option, and then search for "\0" without > the quotes. Replace it with whatever, eg blank. Yse, but since the zero value characters are not visible, but seem to be considered as part of two-byte characters whatever I do, I'd not expect that to do a thing.
I tried it, and sure enough: no effect at all. I got a `didn't find anything' bong and no changes to file contents (yes, I looked).
> The number after the \ is octal (historical reasons). You can also do > \x followed by hex. The value's the byte in the file. [quoted text clipped - 4 lines] > > Heh, sometimes being dumb and stupid (Word) pays off. I'm slightly astonished at the way all these other alleged encoding conversions appear to do nothing whatsoever.
I'd still like to get to the bottom of this one.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Chris Ridd - 30 May 2008 19:31 GMT >> (Rowland McDonnell) said: >> [quoted text clipped - 11 lines] > I tried it, and sure enough: no effect at all. I got a `didn't find > anything' bong and no changes to file contents (yes, I looked). You've got some other sort of file then, because if I replace \0 with a semi-colon in a file saved as UTF-16 (no BOM) and force read-in as MacOS Roman, it puts a ; in between each char. I made sure there were NUL bytes in there by doing a Hex dump file, as you rightly note that TW doesn't show the NUL bytes.
> I'm slightly astonished at the way all these other alleged encoding > conversions appear to do nothing whatsoever. It definitely makes a difference, so you're doing something wrong. Ah - you're not using TextWrangler's Hex dump *front window* option are you? That dumps out the in-memory representation of each char, and as TW uses UTF16 for that you'll see a load of NULs...
Cheers,
Chris
Rowland McDonnell - 30 May 2008 20:50 GMT > (Rowland McDonnell) said: > [quoted text clipped - 19 lines] > NUL bytes in there by doing a Hex dump file, as you rightly note that > TW doesn't show the NUL bytes. Ah - I made the attempt when I had the file in `allegedly Western MacOS Roman' encoding according to the operation performed by `something else, might well be TextEdit'. Back to the original file.
Okay. Unicode UTF 16 little endian is what TextWrangler says it is, with CRLF line endings.
Converted to Western (MacOS Roman), saved it - that works fine in TeXShop. All is well. No fiddling required, `it just worked without fuss as you'd expect'.
(the original file was created on a Mac by a Mac application - but by a very bad Mac app badly ported from Windoze by the look of it)
btw, which way round is little endian? I understand that it means `Least signficant bit at the end of the binary word', but the phrase gives no hint as to which end.
> > I'm slightly astonished at the way all these other alleged encoding > > conversions appear to do nothing whatsoever. > > It definitely makes a difference, so you're doing something wrong. Ah - > you're not using TextWrangler's Hex dump *front window* option are you? Nope. Just opening the file as a text file for editing and saving it again - Save As... and selecting the encoding I want was my first attempt; using the pop-up menu in the bottom left and then saving was my second attempt.
> That dumps out the in-memory representation of each char, and as TW > uses UTF16 for that you'll see a load of NULs... What I was doing wrong was using not the original of the troublesome file, but a version that had been messed up by something else first.
Now I've gone back to the original, I find that TextWrangler does indeed work as advertised - TextEdit and TeXShop do not.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Chris Ridd - 30 May 2008 21:14 GMT > btw, which way round is little endian? I understand that it means > `Least signficant bit at the end of the binary word', but the phrase > gives no hint as to which end. Just replace bit with byte; a little-endian version of ' ' (0x20) is stored as 0x20 0x00. A big-endian version is stored as 0x00 0x20.
> Now I've gone back to the original, I find that TextWrangler does indeed > work as advertised - TextEdit and TeXShop do not. Success!
Cheers,
Chris
Rowland McDonnell - 31 May 2008 01:17 GMT > (Rowland McDonnell) said: > [quoted text clipped - 4 lines] > Just replace bit with byte; a little-endian version of ' ' (0x20) is > stored as 0x20 0x00. A big-endian version is stored as 0x00 0x20. From which I deduce that the information I need is that `little endian' means `least significant first', and `big endian' means `most significant first'.
(why do these confusing terms come into being?)
> > Now I've gone back to the original, I find that TextWrangler does indeed > > work as advertised - TextEdit and TeXShop do not. > > Success! That's what I thought. Bloody pain, some of this stuff, innit?
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Chris Ridd - 31 May 2008 06:53 GMT >> (Rowland McDonnell) said: >> [quoted text clipped - 8 lines] > means `least significant first', and `big endian' means `most > significant first'. Yes.
> (why do these confusing terms come into being?) Blame Jonathan Swift.
The Byte Order Mark that robert mentioned is a single UTF-16 value (0xfffe) that Unicode defines to be an illegal character, yet the way you write it in the first two bytes unambiguously indicates if the rest of the file is little- or big-endian.
> That's what I thought. Bloody pain, some of this stuff, innit? It is a bit, but at least Unicode's somewhat sane compared to previous attempts at defining "mega" sets of funny characters. ISO 2022, I'm talking about you.
Cheers,
Chris
Rowland McDonnell - 31 May 2008 18:33 GMT > (Rowland McDonnell) said: > [quoted text clipped - 12 lines] > > Yes. So why can't they just say that? LSB first, MSB first - the terms that used to be used - were impossible to misunderstand even if you'd never met them before. Little endian and big endian - well, I need to have those terms written down on a piece of paper sat in front of me so I can learn them, and even then I'll forget in a year or two if I don't use the info.
The problem is really whatever cultural instinct makes US engineers refer to pressure in pounds and specific impulse in seconds.
(what's wrong with the latter? Specific impulse is `mpg for rockets', and the real units are m/s - the Yanks get `s' by dividing the real number by `gravitational field strength at the Earth's surface'. It's a rocket engine equation. Gravitational field strength at the Earth's surface cannot have a part to play in any valid function relating to general rocket engine operation. But these peopple aren't stupid - so why do they do it?)
> > (why do these confusing terms come into being?) > > Blame Jonathan Swift. He didn't mis-apply those terms in a confusing way, did he? Nope.
> The Byte Order Mark that robert mentioned is a single UTF-16 value > (0xfffe) that Unicode defines to be an illegal character, yet the way > you write it in the first two bytes unambiguously indicates if the rest > of the file is little- or big-endian. Can't you say LSB first or MSB first instead of those confusing terms? If people start to do that sort of thing a bit more, we might be able to wipe out the confusing terms.
> > That's what I thought. Bloody pain, some of this stuff, innit? > > It is a bit, but at least Unicode's somewhat sane The problem is more the software that claims to do conversions but doesn't do them properly.
> compared to previous > attempts at defining "mega" sets of funny characters. ISO 2022, I'm > talking about you. I don't see why the `this numbered slot in the encoding means this particular glyph' method has to be used at all.
Why can't there be a list of verbally defined glyphs, with the standard including the standard *names*, and a standard mechanism for defining encodings on the fly.
Do it right, and that way you can have aribtrary character sets using eight bit numerical encodings if you like - just spread across several sets of eight bit encodings.
(TeX can do things a bit like that using the virtual fount mechanism)
ISO 2022 seems to be thinking in that part of the spectrum, but it does seem to be a bit of a dog's dinner to me from the Wikipedia article.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Peter Flynn - 31 May 2008 15:55 GMT > I'd still like to get to the bottom of this one. As I understand it (which may be faulty), the original document seems to be using the UTF-16 encoding, so each character is forced to a two-byte representation with a null first byte where necessary. UTF-8 would have been a better choice, because characters that can be represented by a single low-order byte (eg ASCII) are left as single bytes, and only those requiring two or more bytes are encoded as multibyte sequences.
At a wild random guess, the original was generated from some XML system which uses UTF-16 by default. There are several culprits (see microsoft.public.dotnet.xml and similar _passim_) and they tend to be large corporate database systems which are just covering their lardy white corporate a.ses against the odd two-byte character.
///Peter
Rowland McDonnell - 31 May 2008 18:33 GMT > > I'd still like to get to the bottom of this one. > [quoted text clipped - 4 lines] > single low-order byte (eg ASCII) are left as single bytes, and only > those requiring two or more bytes are encoded as multibyte sequences. I didn't have much choice over text format.
> At a wild random guess, the original was generated from some XML system > which uses UTF-16 by default. There are several culprits (see > microsoft.public.dotnet.xml and similar _passim_) and they tend to be > large corporate database systems which are just covering their lardy > white corporate a.ses against the odd two-byte character. Plenty of the people doing that sort of thing have lardy arses which are brown or black or whatever.
As I mentioned elsewhere in this thread, the text was generated by an OCR application - a not very good OCR app, but the one I happen to have.
As mentioned elsewhere in this thread, I've figured out how to solve the problem but I've not quite worked out how come my earlier attempts failed. On the other hand, who cares if I have a realiable solution?
Thanks for the thoughts.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Rowland McDonnell - 30 May 2008 18:32 GMT [snip]
> > Then the bottom of each window has a > > popup menu which lets you switch the encoding used, which will also be [quoted text clipped - 9 lines] > Yep, I have just tried again, following your suggestion and checking the > signs in the pop-up menu. [snip]
I've just asked TextWrangler to convert to ASCII for me. ASCII - can't go wrong, right?
Wrong.
It *says* it's converted to ASCII, but HexEdit shows clearly that it's still a two-byte encoding in use when I look at the guts of this allegedly ASCII encoded file.
Ayone got a clue?
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
robert - 30 May 2008 18:34 GMT Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:
> MacTeX 2007, and TeXShop v2.14 > > I've just pasted some Unicode text into a tex document open in TeXShop. Where did that Unicode text originate from? If I paste some (UTF-16) text from TextEdit to TeXShop and save it from there as Mac OS Roman, the resulting .tex seems to be correct (accented characters are converted from their UTF-16 representation to Mac OS Roman).
 Signature robert
Rowland McDonnell - 30 May 2008 19:20 GMT > Rowland McDonnell <real-address-in-sig@flur.bltigibbet>: > > MacTeX 2007, and TeXShop v2.14 > > > > I've just pasted some Unicode text into a tex document open in TeXShop. > > Where did that Unicode text originate from? An OCR package.
>If I paste some (UTF-16) > text from TextEdit to TeXShop and save it from there as Mac OS Roman, > the resulting .tex seems to be correct (accented characters are > converted from their UTF-16 representation to Mac OS Roman). Now try running the text past TeX.
Go on, try it.
Everything looks fine here until I try TeXing it. I've opened up the text in TextWrangler, it claims that it's in Western (Mac OS) encoding, but there's a null in between each `byte representing a character' if that's the case. That null seems to be ignored by all the text editors I've tried, including TeXshop. But TeX can see the nulls.
Admittedly, one could probably make some changes so that TeX would ignore the nulls at least for the purpose in hand - but that's not really the point.
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
robert - 30 May 2008 21:13 GMT Rowland McDonnell <real-address-in-sig@flur.bltigibbet>:
>> Rowland McDonnell <real-address-in-sig@flur.bltigibbet>: >> > MacTeX 2007, and TeXShop v2.14 [quoted text clipped - 5 lines] > > An OCR package. Since its the responsibility of the application from which you copy text to encode it in a sane manner, my guess would be that the culprit is that OCR package.
If it dumps a UTF-16 string to the pasteboard whilst not actually declaring it to be UTF-16-encoded, Cocoa might assume the string is ISO-Latin-1- encoded (or whatever the default encoding would be).
>> If I paste some (UTF-16) text from TextEdit to TeXShop and save it >> from there as Mac OS Roman, the resulting .tex seems to be correct [quoted text clipped - 4 lines] > > Go on, try it. I did, and it works as expected. That is, both 'latex' and 'pdflatex' run just fine, they just don't handle files in Mac OS Roman too well because they expect - I think - ISO-Latin-1 as character encoding for inputfiles. This results in some unexpected characters in the resulting PDF, but no errors like the one you got.
The .tex file doesn't contain any NULL-bytes either.
> Everything looks fine here until I try TeXing it. I've opened up > the text in TextWrangler, it claims that it's in Western (Mac OS) > encoding, but there's a null in between each `byte representing a > character' if that's the case. That null seems to be ignored by > all the text editors I've tried, including TeXshop. But TeX can > see the nulls. If you forced TextWrangler to use UTF-16 inputencoding, but it didn't 'take', it might have to do with a missing UTF-16 Byte Order Mark in front of the text. UTF-16 files/strings are recognised by starting with either a 0xFFFE or 0xFEFF character.
 Signature robert
Rowland McDonnell - 31 May 2008 01:17 GMT > Rowland McDonnell <real-address-in-sig@flur.bltigibbet>: > > [quoted text clipped - 15 lines] > it to be UTF-16-encoded, Cocoa might assume the string is ISO-Latin-1- > encoded (or whatever the default encoding would be). Actually, the situation was more like this - quoting myself from another post in this thread:
--------------------------------------------------------------------
Ah - I made the attempt when I had the file in `allegedly Western MacOS Roman' encoding according to the operation performed by `something else, might well be TextEdit'. Back to the original file.
Okay. Unicode UTF 16 little endian is what TextWrangler says it is, with CRLF line endings.
Converted to Western (MacOS Roman), saved it - that works fine in TeXShop. All is well. No fiddling required, `it just worked without fuss as you'd expect'. --------------------------------------------------------------------
[snip]
Thanks for the post - I've removed loads of useful stuff that aids understanding and so on. Cheers!
Rowland.
 Signature Remove the animal for email address: rowland.mcdonnell@dog.physics.org Sorry - the spam got to me http://www.mag-uk.org http://www.bmf.co.uk UK biker? Join MAG and the BMF and stop the Eurocrats banning biking
Franck Pastor - 31 May 2008 06:21 GMT > MacTeX 2007, and TeXShop v2.14 > [quoted text clipped - 24 lines] > anything - to turn that Unicode into something that looks a bit like > ASCII would do me. You have first to convert that bit of Unicode stuff (UTF-8? UTF-16?) in MacOS Roman encoding (or vice versa, your TeX file in Unicode encoding) before including it in your TeX file. TeXShop unfortunately hasn't got the tools for that, but some other text editor as BBEdit, SubEthaEdit have. Perhaps TextWrangler (free, as in free beer) too.
Franck Pastor - 31 May 2008 06:23 GMT >> MacTeX 2007, and TeXShop v2.14 >> [quoted text clipped - 30 lines] > the tools for that, but some other text editor as BBEdit, SubEthaEdit > have. Perhaps TextWrangler (free, as in free beer) too. Sorry, someone has already answered the same thing before!
|
|
|