Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read very large files line by line matching patterns in R

I have a minimum of 10 GB size file that I need to read in R. To limit the memory use, I only want to read those lines matching patterns. For example, in the text file mytext.tsv below, I would like to read from wanted line which is going to be the header. Then read the lines that matches coding and synonymous from col2, i.e. patterns.

patterns <- c("coding", "synonymous")

mytext.tsv:

## lines unwanted
## lines unwanted1
## lines unwanted2
## lines unwanted3
wanted  col1       col2    
aaa     variant1   coding
jhjh    variant2   non-coding
ggg     variant3   synonymous
fgg     variant4   coding
gdg     variant6   missense  

My expected dataframe should be:

wanted  col1       col2    
aaa     variant1   coding
ggg     variant3   synonymous

I know I can use connection and scan then loop over each pattern, but is there any efficient method to do this in R?

Example of my actual data:

enter image description here

like image 343
MAPK Avatar asked Jul 03 '20 06:07

MAPK


1 Answers

Using data.table with cmd option and grep (not tested):

library(data.table)

fread(cmd = "grep 'coding\|synonymous' mytext.tsv",
      col.names = c("wanted", "col1", "col2"))

Note:

  • This will work on *nix systems. On Windows there is a findstr command.
  • Regex needs updating to fit with your data. This is just an example, and it will also return "non-coding" rows when grepping for "coding", etc.
like image 107
zx8754 Avatar answered Nov 09 '22 19:11

zx8754