Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read CSV in R and filter columns by name

Tags:

r

csv

readr

Let's say I have a CSV with dozens or hundreds of columns and I want to pull in just about 2 or 3 columns. I know about the colClasses solution as described here but the code gets very unreadable.

I want something like usecols from pandas' read_csv.

Loading everything and just selecting afterwards is not a solution (the file is super big, it doesn't fit in memory).

like image 299
gsmafra Avatar asked Jan 01 '23 07:01

gsmafra


2 Answers

One way is to use package sqldf. If you know SQL, it is possible to read in large files filtering only the parts you want.

I will use built-in dataset iris to make the example reproducible, saving it to disk first.

write.csv(iris, "iris.csv", row.names = FALSE)

Now the problem.
Argument row.names is like in the write.csv instruction.
Note the backticks around Sepal.Length. This is due to the dot character in the column name.

library(sqldf)

sql <- "select `Sepal.Length`, Species from file"
sub_iris <- read.csv.sql("iris.csv", sql = sql, row.names = FALSE)

head(sub_iris)
#  Sepal.Length  Species
#1          5.1 "setosa"
#2          4.9 "setosa"
#3          4.7 "setosa"
#4          4.6 "setosa"
#5          5.0 "setosa"
#6          5.4 "setosa"

And final clean up.

unlink("iris.csv")
like image 152
Rui Barradas Avatar answered Jan 14 '23 03:01

Rui Barradas


I will use package data.table and then with fread() specify columns to keep/drop by arguments selector drop. From ?fread

select Vector of column names or numbers to keep, drop the rest.

drop Vector of column names or numbers to drop, keep the rest.

Best!

like image 31
LocoGris Avatar answered Jan 14 '23 04:01

LocoGris