Let's say I have a CSV with dozens or hundreds of columns and I want to pull in just about 2 or 3 columns. I know about the colClasses
solution as described here but the code gets very unreadable.
I want something like usecols
from pandas' read_csv
.
Loading everything and just selecting afterwards is not a solution (the file is super big, it doesn't fit in memory).
One way is to use package sqldf
. If you know SQL, it is possible to read in large files filtering only the parts you want.
I will use built-in dataset iris
to make the example reproducible, saving it to disk first.
write.csv(iris, "iris.csv", row.names = FALSE)
Now the problem.
Argument row.names
is like in the write.csv
instruction.
Note the backticks around Sepal.Length
. This is due to the dot character in the column name.
library(sqldf)
sql <- "select `Sepal.Length`, Species from file"
sub_iris <- read.csv.sql("iris.csv", sql = sql, row.names = FALSE)
head(sub_iris)
# Sepal.Length Species
#1 5.1 "setosa"
#2 4.9 "setosa"
#3 4.7 "setosa"
#4 4.6 "setosa"
#5 5.0 "setosa"
#6 5.4 "setosa"
And final clean up.
unlink("iris.csv")
I will use package data.table
and then with fread()
specify columns to keep/drop by arguments select
or drop
. From ?fread
select Vector of column names or numbers to keep, drop the rest.
drop Vector of column names or numbers to drop, keep the rest.
Best!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With