Is it possible to invoke gnu parallel in a way that it would repeat the first line of original input to the STDIN of each child job?
I have a CSV file that contains a header line at the top. For example:
> cat large.csv
id,count
abc,123
def,456
I have a tool that can extract columns by name rather than position:
> csv_extract large.csv count
123
456
I can sum the values serially as:
> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579
The actual file I have is much larger, and the operation more complex than summing, but the same principles would apply. I'd like to use gnu parallel to process the file, but I don't know if it is possible to tell gnu parallel to repeat the CSV header for each job.
Ideally I could run the operation with something like:
> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579
I've made up the --repeat-first-line option above to represent the functionality I cannot figure out. I've watched the YouTube videos, and read the man page, but I'm just not able to see how it can be done, if at all possible.
Thanks!
Today you can --skip-first-line
and add the header using echo
:
seq 10 | parallel --skip-first-line --pipe '(echo hea,der; cat) | my_prog'
In a future version you will have the option '--header' which will be a regexp that matches the end of your header (e.g: '\n' for one line or '\n.*\n' for two lines or '---' for up to and including the first ---)
-- Edit --
Newest version of GNU Parallel can now do:
parallel --pipe --header : my_program
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With