Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read only lines that fulfil a condition from a csv into R?

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). This is a much smaller dataset.

I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.

like image 753
Hernan Avatar asked Sep 02 '25 18:09

Hernan


2 Answers

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
like image 100
Karsten W. Avatar answered Sep 04 '25 06:09

Karsten W.


By far the easiest (in my book) is to use pre-processing.

R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+           header=FALSE)
  V1 V2
1 21  U
2 22  V
3 23  W
4 24  X
5 25  Y
6 26  Z
R> 

Here we use awk. We tell awk to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0).

The output from that command can be read by R via pipe().

This is going to be faster and more memory-efficient than reading everythinb into R.

like image 33
Dirk Eddelbuettel Avatar answered Sep 04 '25 07:09

Dirk Eddelbuettel