I have a file that looks like the following:
chr1 1 5 ID1 HK1
chr2 2 8 ID2 HK3
...
I want to extract all lines for each ID and write these to a corresponding file for this ID. The following Code works just fine, but i would like to parallelize it with GNU parallel, as this is too slow with just one core (and I have 72):
while IFS= read -r line
do
a=$(echo "$line" | cut -f 4- | cut -f -1)
b=$(echo "$line" | cut -f -3)
echo $b >> "$a.bed"
done < "file"
I did this before with grep, but as some of the file have >800M lines, this was too slow either. How would i pass this to GNU parallel the right way? Thank you!
Turns out GNU parallel has an option to read a file line by line and pass the line as an argument: parallel -a.
I changed my code to:
parallel -j 60 -a temp ./make_file.sh {}
If you have 800M lines, I think you need something faster than running a job for each line.
So how about:
sort --parallel=100 -k4 input.tsv |
parallel --pipe --group-by 4 --colsep '\s+' -kN1 'cat > num{#}.bed'
newname() {
head -n1 "$1" | parallel --colsep '\s+' mv "$1" {4}.bed
}
export -f newname
ls num*bed | parallel newname
On my system this does 100M lines in 15 minutes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With