Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce several large tab delimited matrices based on rows and columns using awk

Tags:

bash

awk

I have several very large (several gigabytes) tab delimited files with named rows (4.5e6 rows) and columns (ranging from 10 to several hundred).

Ie. InputFile1.txt

            A           B           C          D
Row1        1           2           1          3
Row2        2           4           5          3
Row3        3           6           6          4
Row4        4           8           9          4
Row5        5           2           0          1

InputFile2.txt

            E           F           G        
Row1        7           1           5          
Row2        7           5           5          
Row3        6           4           7          
Row4        5           4           8          
Row5        4           9           0        

I also have two index files, one for rows and one for columns. Ie:

IndexRows.txt (all of these rows are going to be in all files)

Row1
Row3
Row4

IndexCols.txt (no duplicate columns across the files)

B
C
F

I want to efficiently extract the rows and columns specified in the in the index files from the tab delimited files and then merge all the columns into one file. I'm experienced with R and would be able to do this using R, but these files are very large and using R would be pushing the limits/if possible at all.

Can anyone suggest an efficient way to do this, using bash/awk?

In this example, output would look like this:

            B       C       F  
Row1        2       1       1
Row3        6       6       4
Row4        8       9       4

Thanks

like image 384
Floris Avatar asked Dec 17 '25 16:12

Floris


1 Answers

I would approach the problem as follows.

library(data.table)

DT   <- fread(f.txt,          sep="\t",  header=TRUE)
ROWS <- fread(file_rows.txt,  sep="\t",  header=FALSE)
COLS <- fread(file_cols.txt,  sep="\t",  header=FALSE)

setkey(DT, id)
setkey(ROWS) # sets key to the single column

## Note that this filters DT to only those rows with `id` in ROWS$V1
DT[ROWS]

Finally, to filter columns and rows:

DT[ROWS, .SD, .SDcols=COLS$V1]
like image 145
Ricardo Saporta Avatar answered Dec 20 '25 15:12

Ricardo Saporta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!