I have the following (big) file with 30233088 strings:
head mystringfile.txt:
GAATGAACACGAAGAA
GAATGAACACGAAGAC
GAATGAACACGAAGAG
GAATGAACACGAAGCA
cat sequence.txt
AAATAGAGGGCGGTCCAGGCGTGTCGAAACACTGGGTCCAGGGCAAGAGCGGTTCGGGTGTCAGGAAAGCCCCCAAGGGGGTTCGCGCGGTTTGCAGTGAGGTAGAGGCCGGTGTATGGGTAGACAATTGGGGTCCCAAAGAAAAAGGCTCGTCCAACATCATAATAAACCCAAGCACGATAAAAAGCAAACGCAGACTTCAATAGGGTACGAGCAATTGTGGCAGGGTGCTCGCTGTCAGGGTTAGATCTTCTTGGAGTCGCGTCGCTCGGGGGGGCAAGGCCAACGTAAGATCGTGGCTGATCGCTGGCAATGCGGTCGGTTGGGTGGTCGCTAGTAGGGGCACGGCGGTCTCTTATGGCGTCGTAAAATGCGTCTCCAAAGCGAAAAGGGGCGGCAGACAAGTCACCGGGCAAGCTTAGAGGTCTGGGGCCCGTGGCTTTAGGGGAATGAACACGAAGACGCGAAACGAAGTCGTGTTTCTTGTTGGCTGTAGAGGGGAAAACCGTCTGGGGCGATCTGGCGTAGTAGTGCGTGTCTTGCAGTGAGCTCCCCGTCCGTAAGGATTCGCAGGAATCCTGCGTGAAGCTCGGTCGTCTCGGCCGTGTCTCGGGGTTTGATTGCGGGTTCAGATTGGAAAGGTCTCCTCGGGTCGTTTGCTGCATTTGCTCGCAACCCTGACGTGAAAGGGGTGAGCTGTCTCCAATCTGCCACGCTGGGTGTTGCGTCGTCAGTAAAAGACTTGGTCAAGCTGGGACCTCGCAAGATCGCGAGAGGGTTAAGCACAAAAGGTATGGCGAAGCTCCCGGGTGCTCTTGTGGCCACCCAGAATCATGGTGACGTAGGTTTTGCGAAGCCATCAAAAATTCAGGCGGCAAAACGAGCCAGTAGGGTCCTGGGCAGCTGGGCTTGTAGTGGGTAGGCGGCAAAACGCAAAGAATGAACACGAAGCAACTCCGTAGTGTGACGGGGGTTCTGACAAACGTCCTGCAAGAAGTTCGTCTTGGG
which I need to grep
in another sequence file to determine the position of the match, which I do as following:
while read line; do grep -b -o $line sequence.txt >>sequence.txt.count; done<mystringfile.txt
Running the code like this of course takes a very long time and just runs part of 1 thread, so how can I modify it (with parallel
or xargs
?) so that it is running on as many threads as I want to specify?
Your idea is wrong with using shell loop to process text. You are opening one new file descriptor for re-directing to output file for each of the 30233088 iterations on your input file. It is prone to a huge performance impact or a case of running out of open file descriptors.
Use the right tool for the job. Awk
is your friend here. If the sequence.txt
is just a huge pattern as you say, you can just slurp it into a variable for regex match as below. The solutions involves no memory overhead of having to store entries in RAM
awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt
This should be relatively faster than the approach you have and to speed up things further, change your locale
settings to match the C
local,
LC_ALL=C awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt
To match with the grep
's option of -b
to print the byte offset start, use n-1
in the answer above instead of just n
.
If you still want to use GNU parallel, use --pipepart
to split the file physically into parts and specify the --block
size to how much MB of file content to read
parallel -a mystringfile.txt --pipepart --block=20M -q awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}'
If all your search strings in mystringfile.txt
are of the same length (like in your sample file, 16 bytes), you could store all 16 byte strings from sequence.txt
to an associative array (if you have the memory) and speed up the searching. Let's try that. First we need some test material, let's create a 2400000 byte sequence.txt
, takes about a second:
$ awk -v seed=$RANDOM 'BEGIN{a[0]="A";a[1]="C";a[2]="G";a[3]="T";srand(seed);for(i=1;i<=2400000;i++)printf "%s", a[int(4*rand())];print ""}' > mystringfile.txt
and mystringfile.txt
with 30233088 16 byte search strings (4 mins 50 secs):
$ awk -v seed=$RANDOM 'BEGIN{a[0]="A";a[1]="C";a[2]="G";a[3]="T";srand(seed);for(i=1;i<=30233088;i++){for(j=1;j<=16;j++)printf "%s", a[int(4*rand())];print ""}}' > mystringfile.txt
Then the script:
$ awk '
NR==FNR { # process the sequence file
l=length($1)-15 # length-15 16 byte strings coming up
for(i=1;i<=l;i++) { # using l as variable name is stupid
s=substr($0,i,16)
a[s]=a[s] (a[s]==""?"":",") i # hash string start indexes to a, string as key
} # a["ACTGTGCACGTATAGC"]=3,141,592
next
} # the search part
$1 in a { # if search string is found
print a[$1], $1 # output index and string
}' sequence.txt mystringfile.txt
Storing the 2400000 byte sequence.txt
to the hash took 13 seconds and used 721 MB of memory on my mini laptop. The whole script ran 35 seconds and found around 17000 hits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With