Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.

For example, in a regular while-read loop I would code:

while read line
do
   <some code>
done < input file or variable...

But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file. I tried :

split -n 14 input_file
find . -name "xa*" | \
        parallel -j 14 | \
        while read line
        do
        <lot of stuff>
        done

also tried

split -n 14 input_file
function loop {
            while read line
            do
                <lot of stuff>
            done
}
export -f loop
parallel -j 14 ::: loop 

But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"

An example of the input file (a list of strings)

AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580

EDIT

This is the code. The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made. The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):

Rate of taxonomies for this sequence in %:        KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment     num_ocurrences_in_databank    %alingment/databank
D_6__Bacillus_atrophaeus   50%     1       20      5%
D_6__Bacillus_amyloliquefaciens    50%     1       154     0.649351%



$ head input file  
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580

Two additional files are also used inside the loop. They are not the loop input. 1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:

$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0   431     0       0       1       431     1       431     -1      0
KF625180.1.1799 KP143082.1.1457 99.3    431     1       2       1       431     1       429     -1      0
KP143082.1.1457 KF625180.1.1799 99.3    431     1       2       1       429     1       431     -1      0    

2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:

$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus

So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.

TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
        hits=$(grep $line alnout_file | cut -f 2)
        completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
        printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
        printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"

        #find hits and calculate the frequence (%) of the taxonomy in the alignment output
        # ex.: Bacillus_subtilis 33
        freqHits=$(grep "${hits[@]}" $TAXONOMY | \
                cut -f 2 | \
                awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
                sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
                sort -k2 -hr)

        # print frequence of each taxonomy in the databank

        freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
        #print cols with taxonomy and calculations
        paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'

done < input_file

It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that. Thank you all for being so patient with me =)

like image 401
Leonardo Avatar asked Dec 20 '18 13:12

Leonardo


2 Answers

You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):

export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
        hits=$(grep $line alnout_file | cut -f 2)
        completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
        printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
        printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"

        #find hits and calculate the frequence (%) of the taxonomy in the alignment output
        # ex.: Bacillus_subtilis 33
        freqHits=$(grep "${hits[@]}" $TAXONOMY | \
                cut -f 2 | \
                awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
                sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
                sort -k2 -hr)

        # print frequence of each taxonomy in the databank

        freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
        #print cols with taxonomy and calculations
        paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'

done
}
export -f doit
parallel -a input_file --pipepart doit

This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.

That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.

like image 107
Ole Tange Avatar answered Oct 02 '22 22:10

Ole Tange


As an alternative I threw together a quick test.

#! /bin/env bash
mkfifo PIPELINE             # create a single queue
cat "$1" > PIPELINE &       # supply it with records
{ declare -i cnt=0 max=14
  while (( ++cnt <= max ))  # spawn loop creates worker jobs
  do printf -v fn "%02d" $cnt
     while read -r line     # each work loop reads common stdin...
     do echo "$fn:[$line]"
        sleep 1
     done >$fn.log 2>&1 &   # these run in background in parallel
  done                      # this one exits
} < PIPELINE                # *all* read from the same queue
wait
cat [0-9][0-9].log

Doesn't need split, but does need a mkfifo.

Obviously, change the code inside the internal loop.

like image 45
Paul Hodges Avatar answered Oct 02 '22 23:10

Paul Hodges