Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove newlines inside csv cells using regex/terminal tools?

I have a csv file where some of the cells have newline character inside. For example:

id,name 
01,"this is
with newline"
02,no newline 

I want to remove all the newline characters inside cells.

How to do it with regex or with other terminal tools generically without knowing number of columns in advance?

like image 772
Mert Nuhoglu Avatar asked Nov 30 '15 08:11

Mert Nuhoglu


3 Answers

This is actually a harder problem than it looks, and in my opinion, means that regex isn't the right solution. Because you're dealing with quoting/escaped strings, spanning multiple 'lines' you end up with a complicated and difficult to read regex. (It's not impossible, it's just messy).

I would suggest instead - use a parser. Perl has one in Text::CSV and it goes a bit like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new( { binary => 1, eol => "\n" } );

while ( my $row = $csv->getline( \*ARGV ) ) {
    s/\n/ /g for @$row;
    $csv->print( \*STDOUT, $row );
}

This will take files as piped in/specified on command line - that's what \*ARGV does - it's a special file handle that lets you do ... basically what sed does:

somecommand.sh | myscript.pl
myscript.pl filename_to_process

The ARGV filehandle doe either automagically. (You could explicitly open a file or use \*STDIN if you prefer)

like image 176
Sobrique Avatar answered Oct 19 '22 11:10

Sobrique


I suspect that instead of removing the newline you actually want to replace it with a space. If your input file is as simple as it looks this should do it for you:

$ awk '{ORS=( (c+=gsub(/"/,"&"))%2 ? FS : RS )} 1' file
id,name
01,"this is with newline"
02,no newline
like image 32
Ed Morton Avatar answered Oct 19 '22 10:10

Ed Morton


How to do it with regex or with other terminal tools generically without knowing number of columns in advance?

I don't think a regex is the most appropriate approach and might end up being quite complicated. Instead, I think a separate program to process the files might be easier to maintain in the long-term.

Since you're OK with any terminal tools, I've chosen python, and the code's below:

#!/usr/bin/python3 -B

import csv
import sys

with open(sys.argv[1]) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        stripped = [col.replace('\n', ' ') for col in row]
        print(','.join(stripped))

I think the code above is very straightforward and easy to understand, without a need for complicated regular expressions.

The input file here has the following contents:

id,name
01,"this is
with newline"
02,no newline

To prove it works, its output is reproduced below:

➜  ~  ./test.py input.csv
id,name
01,this is with newline
02,no newline

You could call the python script from some other program and feed filenames to it. You just need to add a minor update for the python program to write out files, if that's what you really need.

I've replaced the newlines with spaces to avoid a potentially unwanted concatenation (e.g. this iswith newline), but you can replace the newline with whatever you want, including the empty string ''.

like image 1
code_dredd Avatar answered Oct 19 '22 09:10

code_dredd