Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:
while (<$fh>){
tr/\r\n//d;
my @fields = split /,/, $_;
# ...
}
On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.
But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?
I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.
Any help?
Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.
You could just open the file using the :crlf
PerlIO layer and then tell Text::CSV_XS to use \n
as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );
open( $fh, '<:crlf', 'data.csv' ) or die $!;
while ( my $row = $csv->getline( $fh ) ) {
# do something with $row
}
Since Perl 5.10, you can use this to check general line endings,
s/\R//g;
It should work in all cases, both *nix and Windows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With