I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3
). This is a much smaller dataset.
I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.
You could use the read.csv.sql
function in the sqldf
package and filter using SQL select. From the help page of read.csv.sql
:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
By far the easiest (in my book) is to use pre-processing.
R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+ header=FALSE)
V1 V2
1 21 U
2 22 V
3 23 W
4 24 X
5 25 Y
6 26 Z
R>
Here we use awk
. We tell awk
to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0
).
The output from that command can be read by R via pipe()
.
This is going to be faster and more memory-efficient than reading everythinb into R.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With