Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Print many specific rows from a text file using an index file

Tags:

bash

unix

sed

awk

I have a large text file with over 100 million rows, called reads.fastq. Further I have another file called takeThese.txt that contains row numbers from the file reads.fastq that should be printed (one per row).

Currently I use

awk 'FNR == NR { h[$1]; next } (FNR in h)' takeThese.txt reads.fastq > subsample.fastq

Apparently that takes really long. Is there any way to extract rows from a text file using row numbers stored in another file? Would it speed things up, if the takeThese.txt file would be sorted?

EDIT:

A few example rows of the files I have:

reads.fastq:

@HWI-1KL157:36:C2468ACXX
TGTTCAGTTTCTTCGTTCTTTTTTTGGAC
+
@@@DDDDDFF>FFGGC@F?HDHIHIFIGG
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
@HWI-1KL157:36:C2468ACXX
TCATATTTTCTGATTTCTCCGTCACTCAA

takeThese.txt :

5
6
7
8

So that the output would look like this:

@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD

EDIT: Comparison of the suggested scripts:

$ time perl AndreasWederbrand.pl takeThese.txt reads.fastq  > /dev/null

real    0m1.928s
user    0m0.819s
sys     0m1.100s

$ time ./karakfa  takeThese_numbered.txt reads_numbered.fastq  > /dev/null

real    0m8.334s
user    0m9.973s
sys     0m0.226s

$ time ./EdMorton takeThese.txt reads.fastq  > /dev/null

real    0m0.695s
user    0m0.553s
sys     0m0.130s

$ time ./ABrothers  takeThese.txt reads.fastq  > /dev/null

real    0m1.870s
user    0m1.676s
sys     0m0.186s

$ time ./GlenJackman takeThese.txt reads.fastq  > /dev/null

real    0m1.414s
user    0m1.277s
sys     0m0.147s

$ time ./DanielFischer takeThese.txt reads.fastq  > /dev/null

real    0m1.893s
user    0m1.744s
sys     0m0.138s

Thanks for all suggestions and efforts!

like image 725
Daniel Fischer Avatar asked Nov 14 '16 19:11

Daniel Fischer


4 Answers

The script in your question will be extremely fast since all it does is a hash lookup of the current line number in the array h. This will be faster still though unless you want to print the last line number from reads.fastq since it exits after the last desired line number is printed instead of continuing reading the rest of reads.fastq:

awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq

You could throw in a delete h[FNR]; after the print; to reduce the array size and so MAYBE speed up the lookup time but idk if that will really improve performance any since the array access is a hash lookup and so will be extremely fast so adding a delete may end up slowing the script down overall.

Actually, this will be faster still since it avoids testing NR==FNR for every line in both files:

awk -v nums='takeThese.txt' '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq

Whether that is faster or the script @glennjackman posted is faster depends on how many lines are in takeThese.txt and how close to the end of the reads.fastq they occur. Since Glenns reads the whole of reads.fastq no matter what the contents of takeThese.txt it'll execute in about constant time while mine will be significantly faster the further from the end of reads.fastq the last line number in takeThese.txt occurs. e.g.

$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

.

$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt

$ time awk -v nums=takeThese.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.720s
user    0m27.876s
sys     0m0.450s

$ time awk -v nums=takeThese.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m50.060s
user    0m47.564s
sys     0m0.405s

.

$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt

$ time awk -v nums=takeThat.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m26.738s
user    0m23.556s
sys     0m0.310s

$ time awk -v nums=takeThat.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.015s
sys     0m0.000s

but you can have the best of both worlds with:

$ time awk -v nums=takeThese.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.057s
user    0m26.675s
sys     0m0.498s


$ time awk -v nums=takeThat.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.030s
sys     0m0.062s

which if we assume takeThese.txt is already sorted can be reduced to just:

$ time awk -v nums=takeThese.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m27.362s
user    0m25.599s
sys     0m0.280s

$ time awk -v nums=takeThat.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m0.047s
user    0m0.030s
sys     0m0.016s
like image 153
Ed Morton Avatar answered Nov 16 '22 15:11

Ed Morton


I think that solution in the question stores all lines from takeThese.txt into array, h[], and then for each line in reads.fastq does a linear search in h[] for that line number.

There are several simple improvements to that in different languages. I would try perl if you're not comfortable with java.

Basically you should make sure takeThese.txt is sorted, then just go through reads.fastq one line at at time, scanning for a line number that matches the next line number from takeThese.txt, then pop that and continue.

Since rows are of different length you have no choice but to scan for newline character (basic for each line-construct in most languages).

Example in perl, quick and dirty but works

open(F1,"reads.fastq");
open(F2,"takeThese.txt");
$f1_pos = 1;
foreach $index (<F2>) {
   while ($f1_pos <= $index) {
      $out = <F1>; $f1_pos++;
   } 
   print $out;
}
like image 31
Andreas Wederbrand Avatar answered Nov 16 '22 16:11

Andreas Wederbrand


I would try one of these

  1. may result in false positives:

    cat -n reads.fastq | grep -Fwf takeThese.txt | cut -d$'\t' -f20
    
  2. requires one of {bash,ksh,zsh}:

    sed -n -f <(sed 's/$/p/' takeThese.txt) reads.fastq
    
  3. this is similar to Andreas Wederbrand's perl answer, implemented in awk

    awk -v nums=takeThese.txt '
        function next_index() {
            ("sort -n " nums) | getline i
            return i
        }
        BEGIN { linenum = next_index() }
        NR == linenum { print; linenum = next_index() }
    ' reads.fastq
    

But, if you're dealing with a lot of data, text processing tools will take time. Your other option is to import the data into a proper database and use SQL to extract it: database engines are built for this kind of stuff.

like image 2
glenn jackman Avatar answered Nov 16 '22 16:11

glenn jackman


As I have the flu and am bored I tested some approaches to try to quicken the original solution. The test files:

$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq 

and

$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThat.txt

The first file is just numbers from 1 to 100000000. It does not represent the real data but I was curious about the execution times between awk solutions so I assumed that the real data would only multiply the result times with a constant (before the memory starts to run out).

The second file represents one percent evenly distributed hit ratio to the first file:

100
200
300
...

First, the OP's original script:

$ time awk 'FNR==NR {h[$1]; next} (FNR in h)' takeThese.txt reads.fastq > /dev/null

real    0m52.901s
user    0m52.596s
sys     0m0.284s

My solution:

BEGIN {
    j=1
    while((getline a[++i] < "takeThese.txt") > 0 );  # row numbers to a
} 
NR<a[j] { next }                                     # skip rows before next match
j++                                                  # print and iterate j
j==i { exit }                                        # exit after last hit

Timed run:

$ time awk -f program.awk reads.fastq > /dev/null

real    0m25.894s
user    0m25.676s
sys     0m0.208s

The row number file takeThese.txt is expected to be ordered.

like image 2
James Brown Avatar answered Nov 16 '22 16:11

James Brown