I have a minimum of 10 GB size file that I need to read in R. To limit the memory use, I only want to read those lines matching patterns. For example, in the text file mytext.tsv
below, I would like to read from wanted line which is going to be the header. Then read the lines that matches coding
and synonymous
from col2
, i.e. patterns
.
patterns <- c("coding", "synonymous")
mytext.tsv:
## lines unwanted
## lines unwanted1
## lines unwanted2
## lines unwanted3
wanted col1 col2
aaa variant1 coding
jhjh variant2 non-coding
ggg variant3 synonymous
fgg variant4 coding
gdg variant6 missense
My expected dataframe should be:
wanted col1 col2
aaa variant1 coding
ggg variant3 synonymous
I know I can use connection and scan then loop over each pattern, but is there any efficient method to do this in R?
Using data.table with cmd option and grep (not tested):
library(data.table)
fread(cmd = "grep 'coding\|synonymous' mytext.tsv",
col.names = c("wanted", "col1", "col2"))
Note:
findstr
command.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With