I am a beginner with bash. I need some help in making this job more efficient.
while read line
do
echo "$line"
file="Species.$line"
grep -A 1 "$line" /project/ag-grossart/ionescu/DB/rRNADB/SILVA_123.1_SSURef_one_line.fasta > $file
done < species1
The file species contains about 100,000 species names. The file in which I am searching is 24 GB fasta (text) file.
The format of the large file is:
Domain;Phylum;Class;Order;Family;Genus;Species
AGCT----AGCT (50,000 characters per line)
Here is a sample of the species file (no empty lines in between)
Alkanindiges_illinoisensis
Alkanindiges_sp._JJ005
Alligator_sinensis
Allisonella_histaminiformans
'Allium_cepa'
Alloactinosynnema_album
Alloactinosynnema_sp._Chem10
Alloactinosynnema_sp._CNBC1
Alloactinosynnema_sp._CNBC2
Alloactinosynnema_sp._FMA
Alloactinosynnema_sp._MN08-A0205
Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum
Here is the first line of the large file:
HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina.............................................................................-UC-U-G--G-U---------------------------
(this goes one for 50,000 characters per line)
Here are some more headers:
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
>AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55
I need for each species a file containing all the matching sequences.
The code above works but in 16 hours it managed to do less than 2000 species.
I would like to run this in parallel to speed this up. Any other tips on making this search more efficient are welcome as well.
Thank
(1) Run grep on multiple files in parallel, in this case all files in a directory and its subdirectories. Add /dev/null to force grep to prepend the filename to the matching line, because you're gonna want to know what file matched. Adjust the number of process -P for your machine.
Replace df -h with the command you want to use, and replace head -1 and grep '/$' with the two commands you want to apply to it. The output of both will be displayed in your terminal, though it may be that the output of the former command is displayed after the latter.
You can use grep with -A n option to print N lines after matching lines. Using -B n option you can print N lines before matching lines. Using -C n option you can print N lines before and after matching lines.
A little trickier than I first thought since matched lines need to go to separate files - please post performance if you get the chance - this solution can be used in parallel too - the species list file can be chunked and/or the fasta file can be chunked and fed to parallel runs of the script
This takes about 1 minute on an Intel Xeon E5 with a 6GB fake data file checked for 10,000 species - but increasing the species list to 100,0000 even in chunks of 10,000 was problematic as I ran into disk issues with that many files being created and appended to in one directory - the problems began when the species list crossed 50,000 - this number will be different on other systems - I modified the script to create 100 subdirectories and limited each directory to 1000 files - this worked well and all 100,000 files were generated without having to chunk the species list or the 6GB datafile
Also to give you an idea of how fast grep is - it took 6 seconds to match 100,000 species in the 6GB file
specieslist=$1
nspecies=$(wc -l $specieslist|cut -f1 -d' ')
echo -e "grep $nspecies species from $specieslist\n"
grep -A1 -F -f $specieslist|
awk '
# skip context marker
/^--$/{next}
# process pair of lines
# first line is matching species header line
# species is semicolon-delimited field 7 of first line
# second line is sequence - both lines are written to a file with sanitized species name
{
split($0, flds, ";")
species=flds[7]
filekey=gensub(/\W/,".","g",species)
file="fastaout." filekey
if(!(filekey in outfiles)) {
outfiles[filekey]=file
printf("species \"%s\" outfile \"%s\" first match line %d: \"%s\"\n", species, file, NR, $0)
print >file
}
getline; print >>file
# close may be needed on systems where awk cannot juggle too many open files
close(outfile)
}
'
outfiles=(fastaout.*)
noutfiles=${#outfiles[*]}
echo -e "\ncreated $noutfiles fastaout.* files"
head -5 fastaout*
output and slightly modified test inputs follow - species list has some actual matches - fasta file sequence line prefixed with lowercased species to verify correctness and avoid matching species again
output
$ head out.*
==> out.Brachyspira_innocens <==
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
brachyspira_innocens.2:-UC-U-G--G-U---------------------------
==> out.Methanohalophilus_portucalensis_FDF-1 <==
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------
==> out.Pucciniomycotina <==
pucciniomycotina:-UC-U-G--G-U---------------------------
species list
Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum
Methanohalophilus_portucalensis_FDF-1
Brachyspira_innocens
Pucciniomycotina
fasta file
HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina;.............................................................................
pucciniomycotina:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
plagiochila_adiantoides:-UC-U-G--G-U---------------------------
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
citeromyces_siamensis:-UC-U-G--G-U---------------------------
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
brachyspira_pulli:-UC-U-G--G-U---------------------------
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
lampropedia_hyalina:-UC-U-G--G-U---------------------------
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
streptomyces_sp._qls01:-UC-U-G--G-U---------------------------
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
streptomyces_sp._qls04:-UC-U-G--G-U---------------------------
>AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
trichosporon_debeurmannianum:-UC-U-G--G-U---------------------------
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
streptomyces_sp._qls11:-UC-U-G--G-U---------------------------
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
clostridium_sp._pml3-1:-UC-U-G--G-U---------------------------
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
Marinobacter_sp.:-UC-U-G--G-U---------------------------
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
methylobacterium_sp._cbmb38:-UC-U-G--G-U---------------------------
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
pseudomonas_sp.:-UC-U-G--G-U---------------------------
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55
pantoea_sp._cbmb55:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.2:-UC-U-G--G-U---------------------------
I would probably reach for something other than shell + grep for this.but certainly parallelizing it would be a big first step. Here's a bash4 + awk solution:
# read all 100,000 species names into a shell array
mapfile -t species <species1
# turn the names into a single big regular expression
regex=${species[0]}$(printf '|%s' "${species[@]:1}")
# use awk to print the matching lines into the respective files
awk -F';' '($7 ~ /^('"$regex"')$/) { print >"Species."$7 }' bigfile.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With