I have a large text file with over 100 million rows, called reads.fastq
. Further I have another file called takeThese.txt
that contains row numbers from the file reads.fastq
that should be printed (one per row).
Currently I use
awk 'FNR == NR { h[$1]; next } (FNR in h)' takeThese.txt reads.fastq > subsample.fastq
Apparently that takes really long. Is there any way to extract rows from a text file using row numbers stored in another file? Would it speed things up, if the takeThese.txt
file would be sorted?
EDIT:
A few example rows of the files I have:
reads.fastq
:
@HWI-1KL157:36:C2468ACXX
TGTTCAGTTTCTTCGTTCTTTTTTTGGAC
+
@@@DDDDDFF>FFGGC@F?HDHIHIFIGG
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
@HWI-1KL157:36:C2468ACXX
TCATATTTTCTGATTTCTCCGTCACTCAA
takeThese.txt
:
5
6
7
8
So that the output would look like this:
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
EDIT: Comparison of the suggested scripts:
$ time perl AndreasWederbrand.pl takeThese.txt reads.fastq > /dev/null
real 0m1.928s
user 0m0.819s
sys 0m1.100s
$ time ./karakfa takeThese_numbered.txt reads_numbered.fastq > /dev/null
real 0m8.334s
user 0m9.973s
sys 0m0.226s
$ time ./EdMorton takeThese.txt reads.fastq > /dev/null
real 0m0.695s
user 0m0.553s
sys 0m0.130s
$ time ./ABrothers takeThese.txt reads.fastq > /dev/null
real 0m1.870s
user 0m1.676s
sys 0m0.186s
$ time ./GlenJackman takeThese.txt reads.fastq > /dev/null
real 0m1.414s
user 0m1.277s
sys 0m0.147s
$ time ./DanielFischer takeThese.txt reads.fastq > /dev/null
real 0m1.893s
user 0m1.744s
sys 0m0.138s
Thanks for all suggestions and efforts!
The script in your question will be extremely fast since all it does is a hash lookup of the current line number in the array h
. This will be faster still though unless you want to print the last line number from reads.fastq since it exits after the last desired line number is printed instead of continuing reading the rest of reads.fastq:
awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq
You could throw in a delete h[FNR];
after the print;
to reduce the array size and so MAYBE speed up the lookup time but idk if that will really improve performance any since the array access is a hash lookup and so will be extremely fast so adding a delete
may end up slowing the script down overall.
Actually, this will be faster still since it avoids testing NR==FNR for every line in both files:
awk -v nums='takeThese.txt' '
BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
NR in h{print; if (!--c) exit}
' reads.fastq
Whether that is faster or the script @glennjackman posted is faster depends on how many lines are in takeThese.txt and how close to the end of the reads.fastq they occur. Since Glenns reads the whole of reads.fastq no matter what the contents of takeThese.txt it'll execute in about constant time while mine will be significantly faster the further from the end of reads.fastq the last line number in takeThese.txt occurs. e.g.
$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq
.
$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt
$ time awk -v nums=takeThese.txt '
function next_index() {
("sort -n " nums) | getline i
return i
}
BEGIN { linenum = next_index() }
NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real 0m28.720s
user 0m27.876s
sys 0m0.450s
$ time awk -v nums=takeThese.txt '
BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real 0m50.060s
user 0m47.564s
sys 0m0.405s
.
$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt
$ time awk -v nums=takeThat.txt '
function next_index() {
("sort -n " nums) | getline i
return i
}
BEGIN { linenum = next_index() }
NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real 0m26.738s
user 0m23.556s
sys 0m0.310s
$ time awk -v nums=takeThat.txt '
BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real 0m0.094s
user 0m0.015s
sys 0m0.000s
but you can have the best of both worlds with:
$ time awk -v nums=takeThese.txt '
function next_index() {
if ( ( ("sort -n " nums) | getline i) > 0 ) {
return i
}
else {
exit
}
}
BEGIN { linenum = next_index() }
NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real 0m28.057s
user 0m26.675s
sys 0m0.498s
$ time awk -v nums=takeThat.txt '
function next_index() {
if ( ( ("sort -n " nums) | getline i) > 0 ) {
return i
}
else {
exit
}
}
BEGIN { linenum = next_index() }
NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real 0m0.094s
user 0m0.030s
sys 0m0.062s
which if we assume takeThese.txt is already sorted can be reduced to just:
$ time awk -v nums=takeThese.txt '
BEGIN { getline linenum < nums }
NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real 0m27.362s
user 0m25.599s
sys 0m0.280s
$ time awk -v nums=takeThat.txt '
BEGIN { getline linenum < nums }
NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real 0m0.047s
user 0m0.030s
sys 0m0.016s
I think that solution in the question stores all lines from takeThese.txt into array, h[], and then for each line in reads.fastq does a linear search in h[] for that line number.
There are several simple improvements to that in different languages. I would try perl if you're not comfortable with java.
Basically you should make sure takeThese.txt is sorted, then just go through reads.fastq one line at at time, scanning for a line number that matches the next line number from takeThese.txt, then pop that and continue.
Since rows are of different length you have no choice but to scan for newline character (basic for each line
-construct in most languages).
Example in perl, quick and dirty but works
open(F1,"reads.fastq");
open(F2,"takeThese.txt");
$f1_pos = 1;
foreach $index (<F2>) {
while ($f1_pos <= $index) {
$out = <F1>; $f1_pos++;
}
print $out;
}
I would try one of these
may result in false positives:
cat -n reads.fastq | grep -Fwf takeThese.txt | cut -d$'\t' -f20
requires one of {bash,ksh,zsh}:
sed -n -f <(sed 's/$/p/' takeThese.txt) reads.fastq
this is similar to Andreas Wederbrand's perl answer, implemented in awk
awk -v nums=takeThese.txt '
function next_index() {
("sort -n " nums) | getline i
return i
}
BEGIN { linenum = next_index() }
NR == linenum { print; linenum = next_index() }
' reads.fastq
But, if you're dealing with a lot of data, text processing tools will take time. Your other option is to import the data into a proper database and use SQL to extract it: database engines are built for this kind of stuff.
As I have the flu and am bored I tested some approaches to try to quicken the original solution. The test files:
$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq
and
$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThat.txt
The first file is just numbers from 1 to 100000000. It does not represent the real data but I was curious about the execution times between awk solutions so I assumed that the real data would only multiply the result times with a constant (before the memory starts to run out).
The second file represents one percent evenly distributed hit ratio to the first file:
100
200
300
...
First, the OP's original script:
$ time awk 'FNR==NR {h[$1]; next} (FNR in h)' takeThese.txt reads.fastq > /dev/null
real 0m52.901s
user 0m52.596s
sys 0m0.284s
My solution:
BEGIN {
j=1
while((getline a[++i] < "takeThese.txt") > 0 ); # row numbers to a
}
NR<a[j] { next } # skip rows before next match
j++ # print and iterate j
j==i { exit } # exit after last hit
Timed run:
$ time awk -f program.awk reads.fastq > /dev/null
real 0m25.894s
user 0m25.676s
sys 0m0.208s
The row number file takeThese.txt
is expected to be ordered.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With