Reduce several large tab delimited matrices based on rows and columns using awk

Question

I have several very large (several gigabytes) tab delimited files with named rows (4.5e6 rows) and columns (ranging from 10 to several hundred).

Ie. InputFile1.txt

            A           B           C          D
Row1        1           2           1          3
Row2        2           4           5          3
Row3        3           6           6          4
Row4        4           8           9          4
Row5        5           2           0          1

InputFile2.txt

            E           F           G        
Row1        7           1           5          
Row2        7           5           5          
Row3        6           4           7          
Row4        5           4           8          
Row5        4           9           0

I also have two index files, one for rows and one for columns. Ie:

IndexRows.txt (all of these rows are going to be in all files)

Row1
Row3
Row4

IndexCols.txt (no duplicate columns across the files)

B
C
F

I want to efficiently extract the rows and columns specified in the in the index files from the tab delimited files and then merge all the columns into one file. I'm experienced with R and would be able to do this using R, but these files are very large and using R would be pushing the limits/if possible at all.

Can anyone suggest an efficient way to do this, using bash/awk?

In this example, output would look like this:

            B       C       F  
Row1        2       1       1
Row3        6       6       4
Row4        8       9       4

Thanks

Ricardo Saporta · Accepted Answer

I would approach the problem as follows.

library(data.table)

DT   <- fread(f.txt,          sep="	",  header=TRUE)
ROWS <- fread(file_rows.txt,  sep="	",  header=FALSE)
COLS <- fread(file_cols.txt,  sep="	",  header=FALSE)

setkey(DT, id)
setkey(ROWS) # sets key to the single column

## Note that this filters DT to only those rows with `id` in ROWS$V1
DT[ROWS]

Finally, to filter columns and rows:

DT[ROWS, .SD, .SDcols=COLS$V1]

Reduce several large tab delimited matrices based on rows and columns using awk

Tags:

bash

awk

Floris

1 Answers

Finally, to filter columns and rows:

Ricardo Saporta

Recent Activity

Donate For Us

Reduce several large tab delimited matrices based on rows and columns using awk

Tags:

bash

awk

Floris

1 Answers

Finally, to filter columns and rows:

Ricardo Saporta

Related questions

Recent Activity

Donate For Us