processing a headered CSV file with gnu parallel

Question

Is it possible to invoke gnu parallel in a way that it would repeat the first line of original input to the STDIN of each child job?

I have a CSV file that contains a header line at the top. For example:

> cat large.csv
id,count
abc,123
def,456

I have a tool that can extract columns by name rather than position:

> csv_extract large.csv count
123
456

I can sum the values serially as:

> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579

The actual file I have is much larger, and the operation more complex than summing, but the same principles would apply. I'd like to use gnu parallel to process the file, but I don't know if it is possible to tell gnu parallel to repeat the CSV header for each job.

Ideally I could run the operation with something like:

> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579

I've made up the --repeat-first-line option above to represent the functionality I cannot figure out. I've watched the YouTube videos, and read the man page, but I'm just not able to see how it can be done, if at all possible.

Thanks!

danboo

Ole Tange · Accepted Answer

Today you can --skip-first-line and add the header using echo:

seq 10 | parallel --skip-first-line --pipe '(echo hea,der; cat) | my_prog'

In a future version you will have the option '--header' which will be a regexp that matches the end of your header (e.g: ' ' for one line or ' .* ' for two lines or '---' for up to and including the first ---)

-- Edit --

Newest version of GNU Parallel can now do:

parallel --pipe --header : my_program

processing a headered CSV file with gnu parallel

Tags:

shell

csv

parallel-processing

gnu

danboo

1 Answers

Ole Tange

Recent Activity

Donate For Us

processing a headered CSV file with gnu parallel

Tags:

shell

csv

parallel-processing

gnu

danboo

1 Answers

Ole Tange

Related questions

Recent Activity

Donate For Us