How to read very large files line by line matching patterns in R

Question

I have a minimum of 10 GB size file that I need to read in R. To limit the memory use, I only want to read those lines matching patterns. For example, in the text file mytext.tsv below, I would like to read from wanted line which is going to be the header. Then read the lines that matches coding and synonymous from col2, i.e. patterns.

patterns <- c("coding", "synonymous")

mytext.tsv:

## lines unwanted
## lines unwanted1
## lines unwanted2
## lines unwanted3
wanted  col1       col2    
aaa     variant1   coding
jhjh    variant2   non-coding
ggg     variant3   synonymous
fgg     variant4   coding
gdg     variant6   missense

My expected dataframe should be:

wanted  col1       col2    
aaa     variant1   coding
ggg     variant3   synonymous

I know I can use connection and scan then loop over each pattern, but is there any efficient method to do this in R?

Example of my actual data:

enter image description here

zx8754 · Accepted Answer

Using data.table with cmd option and grep (not tested):

library(data.table)

fread(cmd = "grep 'coding\|synonymous' mytext.tsv",
      col.names = c("wanted", "col1", "col2"))

Note:

This will work on *nix systems. On Windows there is a findstr command.
Regex needs updating to fit with your data. This is just an example, and it will also return "non-coding" rows when grepping for "coding", etc.

How to read very large files line by line matching patterns in R

Tags:

r

bioinformatics

bigdata

Example of my actual data:

MAPK

1 Answers

zx8754

Recent Activity

Donate For Us

How to read very large files line by line matching patterns in R

Tags:

r

bioinformatics

bigdata

Example of my actual data:

MAPK

1 Answers

zx8754

Related questions

Recent Activity

Donate For Us