Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Properly detect line-endings of a file in Perl?

Tags:

newline

perl

Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:

while (<$fh>){
    tr/\r\n//d;
    my @fields = split /,/, $_;
    # ...
}

On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.

But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?

I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.

Any help?

Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.

like image 455
user1481 Avatar asked Aug 28 '12 22:08

user1481


2 Answers

You could just open the file using the :crlf PerlIO layer and then tell Text::CSV_XS to use \n as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.

use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );

open( $fh, '<:crlf', 'data.csv' ) or die $!;

while ( my $row = $csv->getline( $fh ) ) {
     # do something with $row
}
like image 188
Ilmari Karonen Avatar answered Oct 23 '22 10:10

Ilmari Karonen


Since Perl 5.10, you can use this to check general line endings,

s/\R//g;

It should work in all cases, both *nix and Windows.

like image 34
squiguy Avatar answered Oct 23 '22 09:10

squiguy