How to read only lines that fulfil a condition from a csv into R?

Question

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). This is a much smaller dataset.

I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.

Karsten W. · Accepted Answer

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where `Sepal.Length` > 5", eol = "
")

Dirk Eddelbuettel · Answer

By far the easiest (in my book) is to use pre-processing.

R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+           header=FALSE)
  V1 V2
1 21  U
2 22  V
3 23  W
4 24  X
5 25  Y
6 26  Z
R>

Here we use awk. We tell awk to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0).

The output from that command can be read by R via pipe().

This is going to be faster and more memory-efficient than reading everythinb into R.

How to read only lines that fulfil a condition from a csv into R?

Tags:

r

large-data

read.csv

Hernan

2 Answers

Karsten W.

Dirk Eddelbuettel

Recent Activity

Donate For Us

How to read only lines that fulfil a condition from a csv into R?

Tags:

r

large-data

read.csv

Hernan

2 Answers

Karsten W.

Dirk Eddelbuettel

Related questions

Recent Activity

Donate For Us