Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R using fread colClasses or skip arguments to read csv with no column headers

I would like to be able to skip a column that is read into R via data.table's fread function in v1.8.9. But the csv I am reading in, has no column headers…which appears to be a problem for fread... is there a way to just specify that I don't want specific columns?

Would it be better to just pre-allocate a column name and then let it read it in so that it can be skipped?

To give an example, I downloaded the data from the following URL

http://www.truefx.com/dev/data/2013/MAY-2013/AUDUSD-2013-05.zip

unzipped it…

and read the csv into R using fread and it has pretty much the same file name just with the csv extension.

system.time(pp <- fread("AUDUSD-2013-05.csv",sep=","))
  user  system elapsed 
16.427   0.257  16.682 

head(pp)
       V1                    V2      V3      V4
1: AUD/USD 20130501 00:00:04.728 1.03693 1.03721
2: AUD/USD 20130501 00:00:21.540 1.03695 1.03721
3: AUD/USD 20130501 00:00:33.789 1.03694 1.03721
4: AUD/USD 20130501 00:00:37.499 1.03692 1.03724
5: AUD/USD 20130501 00:00:37.524 1.03697 1.03719
6: AUD/USD 20130501 00:00:39.789 1.03697 1.03717

str(pp)
Classes ‘data.table’ and 'data.frame':  4060762 obs. of  4 variables:
$ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
$ V2: chr  "20130501 00:00:04.728" "20130501 00:00:21.540" "20130501 00:00:33.789" "20130501 00:00:37.499" ...
$ V3: num  1.04 1.04 1.04 1.04 1.04 ...
$ V4: num  1.04 1.04 1.04 1.04 1.04 ...
- attr(*, ".internal.selfref")=<externalptr> 

I tried using the new(ish) colClasses or skip arguments to ignore the fact that the first column is all the same…and is unnecessary.

but doing:

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",skip=1)

doesn't omit the reading in of the first column

and using colClasses leads to the following error

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",colClasses=list(NULL,"character","numeric","numeric"))

Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = list(NULL,  : 
 colClasses is type list but has no names

other attempts incude

pp1 <- fread("AUDUSD-2013-06.csv",sep=",", colClasses=c(V1=NULL,V2="character",V3="numeric",V4="numeric"))
str(pp1)
Classes ‘data.table’ and 'data.frame':  5524877 obs. of  4 variables:
 $ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
 $ V2: chr  "20130603 00:00:00.290" "20130603 00:00:00.291" "20130603 00:00:00.292" "20130603 00:00:03.014" ...
 $ V3: num  0.962 0.962 0.962 0.962 0.962 ...
 $ V4: num  0.962 0.962 0.962 0.962 0.962 ...
 - attr(*, ".internal.selfref")=<externalptr>

i.e pretty much exactly the same as if I had not used colClasses...

Are there any suggestions to be able to speed up the reading in of data by omitting the first column?

Also perhaps a bit much to ask, but is it possible to directly read a zip file rather than unzipping it first and then reading in the csv?

Oh and if it wasn't clear I'm using data.table v1.8.9

like image 679
h.l.m Avatar asked Jul 10 '13 09:07

h.l.m


People also ask

What is the difference between fread and read CSV?

Not only was fread() almost 2.5 times faster than readr's functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr's 27 GB. Interestingly, even though very slow, base R also spent less memory than the tidyverse suite.

Can fread read CSV?

The benefit of reading the csv file with fread function is that there will be a variable added to the original csv file which contains the id as integers starting from 1 to the length of column values.

What does fread mean in R?

Its fread() function is meant to import data from regular delimited files directly into R, without any detours or nonsense. Note that “regular” in this case means that every row of your data needs to have the same number of columns.


1 Answers

I think the argument you're looking for is drop. Try:

require(data.table)  # 1.9.2+
pp <- fread("AUDUSD-2013-05.csv", drop = 1)

Note that you can drop by name or position.

fread("AUDUSD-2013-05.csv", drop = c("columThree","anotherColumnName"))

fread("AUDUSD-2013-05.csv", drop = 10:15)  # read all columns other than 10:15

And you can select by name or position, too.

fread("AUDUSD-2013-05.csv", select = 10:15)  # read only columns 10:15

fread("AUDUSD-2013-05.csv", select = c("columnA","columnName2"))

These arguments were added to v1.9.2 (released to CRAN in Feb 2014) and are documented in ?fread. You'll need to upgrade to use them.

like image 180
SCallan Avatar answered Sep 29 '22 21:09

SCallan