Read file line by line with GNU parallel

Question

I have a file that looks like the following:

chr1  1  5  ID1 HK1
chr2  2  8  ID2 HK3
...

I want to extract all lines for each ID and write these to a corresponding file for this ID. The following Code works just fine, but i would like to parallelize it with GNU parallel, as this is too slow with just one core (and I have 72):

while IFS= read -r line
    do  
        a=$(echo "$line" | cut -f 4- | cut -f -1)
        b=$(echo "$line" | cut -f -3)
        echo $b >> "$a.bed"
    done < "file"

I did this before with grep, but as some of the file have >800M lines, this was too slow either. How would i pass this to GNU parallel the right way? Thank you!

tirichl · Accepted Answer

Turns out GNU parallel has an option to read a file line by line and pass the line as an argument: parallel -a. I changed my code to:

parallel -j 60 -a temp ./make_file.sh {}

Ole Tange · Answer

If you have 800M lines, I think you need something faster than running a job for each line.

So how about:

sort --parallel=100 -k4 input.tsv |
  parallel --pipe --group-by 4 --colsep '\s+' -kN1 'cat > num{#}.bed'

newname() {
    head -n1 "$1" | parallel --colsep '\s+' mv "$1" {4}.bed
}
export -f newname
ls num*bed | parallel newname

On my system this does 100M lines in 15 minutes.

Read file line by line with GNU parallel

Tags:

bash

shell

unix

parallel-processing

tirichl

2 Answers

tirichl

Ole Tange

Recent Activity

Donate For Us

Read file line by line with GNU parallel

Tags:

bash

shell

unix

parallel-processing

tirichl

2 Answers

tirichl

Ole Tange

Related questions

Recent Activity

Donate For Us