I have a csv file where some of the cells have newline character inside. For example:
id,name
01,"this is
with newline"
02,no newline
I want to remove all the newline characters inside cells.
How to do it with regex or with other terminal tools generically without knowing number of columns in advance?
This is actually a harder problem than it looks, and in my opinion, means that regex
isn't the right solution. Because you're dealing with quoting/escaped strings, spanning multiple 'lines' you end up with a complicated and difficult to read regex
. (It's not impossible, it's just messy).
I would suggest instead - use a parser. Perl has one in Text::CSV
and it goes a bit like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1, eol => "\n" } );
while ( my $row = $csv->getline( \*ARGV ) ) {
s/\n/ /g for @$row;
$csv->print( \*STDOUT, $row );
}
This will take files as piped in/specified on command line - that's what \*ARGV
does - it's a special file handle that lets you do ... basically what sed
does:
somecommand.sh | myscript.pl
myscript.pl filename_to_process
The ARGV
filehandle doe either automagically. (You could explicitly open a file or use \*STDIN
if you prefer)
I suspect that instead of removing the newline you actually want to replace it with a space. If your input file is as simple as it looks this should do it for you:
$ awk '{ORS=( (c+=gsub(/"/,"&"))%2 ? FS : RS )} 1' file
id,name
01,"this is with newline"
02,no newline
How to do it with regex or with other terminal tools generically without knowing number of columns in advance?
I don't think a regex is the most appropriate approach and might end up being quite complicated. Instead, I think a separate program to process the files might be easier to maintain in the long-term.
Since you're OK with any terminal tools, I've chosen python, and the code's below:
#!/usr/bin/python3 -B
import csv
import sys
with open(sys.argv[1]) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
stripped = [col.replace('\n', ' ') for col in row]
print(','.join(stripped))
I think the code above is very straightforward and easy to understand, without a need for complicated regular expressions.
The input file here has the following contents:
id,name
01,"this is
with newline"
02,no newline
To prove it works, its output is reproduced below:
➜ ~ ./test.py input.csv
id,name
01,this is with newline
02,no newline
You could call the python script from some other program and feed filenames to it. You just need to add a minor update for the python program to write out files, if that's what you really need.
I've replaced the newlines with spaces to avoid a potentially unwanted concatenation (e.g. this iswith newline
), but you can replace the newline with whatever you want, including the empty string ''
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With