Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / Perl / July 2006



Tip: Looking for answers? Try searching our database.

Regex and Mac vs UNIX line endings

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Andrew Brosnan - 20 Jul 2006 01:33 GMT
I'm processing a string with embedded newlines. For testing I was
storing the text in __DATA__ and slurping it into a string. This works
fine. However when I read in a file, I'm having trouble with the line
endings. Matching begining/end of logical lines is not working as I
expect. Regexes like the one below match when using the DATA filehandle,
but don't when opening other text files on my Mac.

   $text =~ s/^Text to match.*$//m;

Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 'm'
modifier would recognize any line ending.

Oh what to do?

Andrew
Robert Hicks - 20 Jul 2006 02:51 GMT
> I'm processing a string with embedded newlines. For testing I was
> storing the text in __DATA__ and slurping it into a string. This works
[quoted text clipped - 11 lines]
>
> Andrew

What version of the Mac? Anything in the OSX family is Unix and uses the
standard "\n" line ending/new line. If you brought the files over then
yes you are going to have the '\r' line ending.

:Robert
Andrew Brosnan - 20 Jul 2006 03:01 GMT
> > I'm processing a string with embedded newlines. For testing I was
> > storing the text in __DATA__ and slurping it into a string. This
[quoted text clipped - 14 lines]
> >
> What version of the Mac?

10.3.9

> Anything in the OSX family is Unix and uses the
> standard "\n" line ending

I don't think that is the case. These are text files created on 10.3.9
and they use \r for line endings. The problem is that /^.*$/ won't match
lines ending with \r even with the m modifier.

Andrew
Doug McNutt - 20 Jul 2006 03:32 GMT
If you want to adjust the line ends in the files have a look at:

<ftp://ftp.macnauchtan.com/Software/LineEnds/FixEndsFolder.sit>  52 kB
<ftp://ftp.macnauchtan.com/Software/LineEnds/ReadMe_fixends.txt>  4 kB

Yeah. It's pretty easy in perl too.

I have on occasion, read the first few hundred characters of a file and then searched for \n and \r and \r\n. From that I make a guess and reopen the file for line by line reading after setting $/ to what I found.

If you slurp in the whole string you can play with

$option1 = split /\n/, $thedata;
$option2 = split /\r/, $thedata;

Which option has the most elements?

split /(\r|\n)/, $thedata; # is an idea I just had. I wonder?
Signature


--> Science is the business of discovering and codifying the rules and methods employed by the Intelligent Designer. Religions provide myths to mollify the anxiety experienced by those who choose not to participate. <--

Bruce Van Allen - 20 Jul 2006 16:14 GMT
Peter gave some good examples, so I shortened this to supplement his
suggestions.

I prefer to determine what the end-of-line (eol) "character" is using
something less slippery than \r and \n. In Perl, \n is the native eol
for the OS that Perl is executing under, so it could any of the \n, \r,
\r\n, etc., constructs.

Instead, use the octal characters, which for this are:

Mac                 CR (Carriage Return)  "\015"
UNIX, Linux, VMS    LF (Line Feed)        "\012"
Win                 CRLF                  "\015\012"

BTW, many apps in Mac OS X (Excel, Filemaker Pro) continue to use the
eol used in OS 9 and before (CR), not the UNIX eol (LF).

Here's my favorite way to get the eol and convert it to native, no
matter what's in the original file (at least in the popular OSes):

   $text       =~ s/(\015?\012|\015)/\n/gs;

You could also specify what you want, if that isn't simply the native
eol:
   my $new_eol     = "\015";  # or "\012" or "\015\012"
   $text           =~ s/(\015?\012|\015)/$new_eol/gs;
   
If the file is large, then you may need to use a heuristic (that is,
test some of the text trying to detect a pattern), as Doug suggests,
testing the first x characters of the file to find one of the above eol
constructs, and then seeing whether it shows up again, and then backing
up and processing the whole file. Or use the look-ahead/behind
approaches that Peter suggests.

1;

- Bruce

__bruce__van_allen__santa_cruz__ca__
Peter N Lewis - 20 Jul 2006 07:34 GMT
>I'm processing a string with embedded newlines. For testing I was
>storing the text in __DATA__ and slurping it into a string. This works
[quoted text clipped - 9 lines]
>
>Oh what to do?

You have several possibilities, depending on what you are trying to do.

You could explicitly use either line ending, as it:

$text =~ s/(\012|\015|\A)Text to match[^\012\015]*(\012|\015|\z)/$1$2/;

or using backward/forward assertions:

$text =~ s/(?:\A|(?<=\012|\015))Text to match[^\012\015]*(?=\012|\015|\z)//;

(the convoluted backward assertion is required because backward
assertions must be fixed lengths)

Or you could convert $text to \n line endings:

$text =~ s/(\015\012|\012|\015)/\n/g;
$text =~ s/^Text to match.*$//m;

Or you could detect the line ending and explicitly use it.

Enjoy,
   Peter.

Signature

Check out Interarchy 8.1.1, just released, now with Amazon S3 support.
<http://www.stairways.com/>  <http://download.stairways.com/>

Kurtz Le Pirate - 20 Jul 2006 18:25 GMT
In article
<r02010500-1039-5A2AE072178711DBB98900039344BB98@[10.0.0.2]>,

> I'm processing a string with embedded newlines. For testing I was
> storing the text in __DATA__ and slurping it into a string. This works
[quoted text clipped - 11 lines]
>
> Andrew

hum... is 'end of line' caracter important ?

if not, you can do something like that :
while (<FILE>) {
 chomp;
 if (/?????/) { ... }
 }

yes ? no ?

Signature

klp

Andrew Brosnan - 20 Jul 2006 18:33 GMT
All set with this. Converting the line endings worked fine. Thanks.

Andrew

On 7/20/06 at 7:25 PM, kurtzlepirate@yahoo.fr (kurtz le pirate) wrote:

> In article
> <r02010500-1039-5A2AE072178711DBB98900039344BB98@[10.0.0.2]>,
[quoted text clipped - 25 lines]
>
> yes ? no ?
Peter N Lewis - 21 Jul 2006 08:29 GMT
At 19:25 +0200 20/7/06, kurtz le pirate wrote:
>hum... is 'end of line' caracter important ?
>
[quoted text clipped - 5 lines]
>
>yes ? no ?

Not really, because if the file is Mac line endings, then that will
read the entire file in a single gulp.  Also, if the file is DOS line
endings, then the chomp will remove only the linefeed (unless you
have changed $/ to CRLF, in which case it will not remove a single
linefeed).

If you fist check the fie and determine the line endings (and the
file has consistent line endings, which is not always the case) and
set $/ appropriately, then what you suggest will work.

Enjoy,
   Peter.

Signature

Check out Interarchy 8.1.1, just released, now with Amazon S3 support.
<http://www.stairways.com/>  <http://download.stairways.com/>

 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.