Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

processing a headered CSV file with gnu parallel

Is it possible to invoke gnu parallel in a way that it would repeat the first line of original input to the STDIN of each child job?

I have a CSV file that contains a header line at the top. For example:

> cat large.csv
id,count
abc,123
def,456

I have a tool that can extract columns by name rather than position:

> csv_extract large.csv count
123
456

I can sum the values serially as:

> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579

The actual file I have is much larger, and the operation more complex than summing, but the same principles would apply. I'd like to use gnu parallel to process the file, but I don't know if it is possible to tell gnu parallel to repeat the CSV header for each job.

Ideally I could run the operation with something like:

> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579

I've made up the --repeat-first-line option above to represent the functionality I cannot figure out. I've watched the YouTube videos, and read the man page, but I'm just not able to see how it can be done, if at all possible.

Thanks!

  • danboo
like image 593
danboo Avatar asked Nov 04 '11 20:11

danboo


1 Answers

Today you can --skip-first-line and add the header using echo:

seq 10 | parallel --skip-first-line --pipe '(echo hea,der; cat) | my_prog'

In a future version you will have the option '--header' which will be a regexp that matches the end of your header (e.g: '\n' for one line or '\n.*\n' for two lines or '---' for up to and including the first ---)

-- Edit --

Newest version of GNU Parallel can now do:

parallel --pipe --header : my_program
like image 133
Ole Tange Avatar answered Sep 28 '22 18:09

Ole Tange