Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read file line by line with GNU parallel

I have a file that looks like the following:

chr1  1  5  ID1 HK1
chr2  2  8  ID2 HK3
...

I want to extract all lines for each ID and write these to a corresponding file for this ID. The following Code works just fine, but i would like to parallelize it with GNU parallel, as this is too slow with just one core (and I have 72):

while IFS= read -r line
    do  
        a=$(echo "$line" | cut -f 4- | cut -f -1)
        b=$(echo "$line" | cut -f -3)
        echo $b >> "$a.bed"
    done < "file"

I did this before with grep, but as some of the file have >800M lines, this was too slow either. How would i pass this to GNU parallel the right way? Thank you!

like image 438
tirichl Avatar asked May 30 '26 09:05

tirichl


2 Answers

Turns out GNU parallel has an option to read a file line by line and pass the line as an argument: parallel -a. I changed my code to:

parallel -j 60 -a temp ./make_file.sh {}
like image 110
tirichl Avatar answered Jun 02 '26 20:06

tirichl


If you have 800M lines, I think you need something faster than running a job for each line.

So how about:

sort --parallel=100 -k4 input.tsv |
  parallel --pipe --group-by 4 --colsep '\s+' -kN1 'cat > num{#}.bed'

newname() {
    head -n1 "$1" | parallel --colsep '\s+' mv "$1" {4}.bed
}
export -f newname
ls num*bed | parallel newname 

On my system this does 100M lines in 15 minutes.

like image 20
Ole Tange Avatar answered Jun 02 '26 21:06

Ole Tange