Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need suggestions with reading text files by every n-th line in Raku [closed]

Tags:

raku

I am looking for some suggestions on how I can read text files by every n-th file in Raku/perl6.

In bioinformatics research, sometimes we need to parse text files in a somewhat less than straightforward manner. Such as Fastq files, which store data in groups of 4 lines at a time. Even more, these Fastq files come in pairs. So if we need to parse such files, we may need to do something like reading 4 lines from the first Fastq file, and reading 4 lines from the second Fastq file, then read the next 4 lines from the first Fastq file, and then read the next 4 lines from the second fastq file, ......

May I have some suggestions regarding what is the best way to approach this problem? Raku's "IO.lines" approach seems to be able to handle each line one at a time. but not sure how to handle every n-th line

An example fastq file pair: https://github.com/wtwt5237/perl6-for-bioinformatics/tree/master/Come%20on%2C%20sister/fastq What we tried before with "IO.lines": https://github.com/wtwt5237/perl6-for-bioinformatics/blob/master/Come%20on%2C%20sister/script/benchmark2.p6

like image 293
Tao Wang Avatar asked Nov 09 '19 04:11

Tao Wang


2 Answers

Reading 4 lines at a time from 2 files and processing them into a single thing, can be easily done with zip and batch:

my @filenames = <file1 file2>;
for zip @filenames.map: *.IO.lines.batch(4) {
    # expect ((a,b,c,d),(e,f,g,h))
}

This will keep producing until at least one of the files is fully handled. An alternate for batch is rotor: this will keep going while both files fill up 4 lines completely. Other ways of finishing the loop are with also specifying the :partial flag with rotor, and using roundrobin instead of zip. YMMV.

like image 102
Elizabeth Mattijsen Avatar answered Dec 31 '22 08:12

Elizabeth Mattijsen


You can use the lines method. Raku Sequences are lazy. This means that iterating over an expression like "somefile".IO.lines will only ever read one line into memory, never the whole file. In order to do the latter you would need to assign the Sequence to an Array.

The pairs method helps you getting the index of the lines. In combination with the divisible by operator %% we can write

"somefile".IO.lines.pairs.grep({ .key && .key %% 4 }).map({ .value })

in order to get a sequence of every 4th line in a file.

like image 36
Holli Avatar answered Dec 31 '22 07:12

Holli