I would like to be able to skip a column that is read into R via <code>data.table</code>'s <code>fread</code> function in v1.8.9. But the csv I am reading in, has no column headers…which appears to be a problem for fread... is there a way to just specify that I don't want specific columns? Would it be better to just pre-allocate a column name and then let it read it in so that it can be skipped? To give an example, I downloaded the data from the following URL http://www.truefx.com/dev/data/2013/MAY-2013/AUDUSD-2013-05.zip unzipped it… and read the csv into R using fread and it has pretty much the same file name just with the csv extension. <pre class="prettyprint"><code>system.time(pp <- fread("AUDUSD-2013-05.csv",sep=",")) user system elapsed 16.427 0.257 16.682 head(pp) V1 V2 V3 V4 1: AUD/USD 20130501 00:00:04.728 1.03693 1.03721 2: AUD/USD 20130501 00:00:21.540 1.03695 1.03721 3: AUD/USD 20130501 00:00:33.789 1.03694 1.03721 4: AUD/USD 20130501 00:00:37.499 1.03692 1.03724 5: AUD/USD 20130501 00:00:37.524 1.03697 1.03719 6: AUD/USD 20130501 00:00:39.789 1.03697 1.03717 str(pp) Classes ‘data.table’ and 'data.frame': 4060762 obs. of 4 variables: $ V1: chr "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ... $ V2: chr "20130501 00:00:04.728" "20130501 00:00:21.540" "20130501 00:00:33.789" "20130501 00:00:37.499" ... $ V3: num 1.04 1.04 1.04 1.04 1.04 ... $ V4: num 1.04 1.04 1.04 1.04 1.04 ... - attr(*, ".internal.selfref")=<externalptr> </code></pre> I tried using the new(ish) colClasses or skip arguments to ignore the fact that the first column is all the same…and is unnecessary. but doing: <pre class="prettyprint"><code>pp1 <- fread("AUDUSD-2013-05.csv",sep=",",skip=1) </code></pre> doesn't omit the reading in of the first column and using colClasses leads to the following error <pre class="prettyprint"><code>pp1 <- fread("AUDUSD-2013-05.csv",sep=",",colClasses=list(NULL,"character","numeric","numeric")) Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = list(NULL, : colClasses is type list but has no names </code></pre> other attempts incude <pre class="prettyprint"><code>pp1 <- fread("AUDUSD-2013-06.csv",sep=",", colClasses=c(V1=NULL,V2="character",V3="numeric",V4="numeric")) str(pp1) Classes ‘data.table’ and 'data.frame': 5524877 obs. of 4 variables: $ V1: chr "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ... $ V2: chr "20130603 00:00:00.290" "20130603 00:00:00.291" "20130603 00:00:00.292" "20130603 00:00:03.014" ... $ V3: num 0.962 0.962 0.962 0.962 0.962 ... $ V4: num 0.962 0.962 0.962 0.962 0.962 ... - attr(*, ".internal.selfref")=<externalptr> </code></pre> i.e pretty much exactly the same as if I had not used colClasses... Are there any suggestions to be able to speed up the reading in of data by omitting the first column? Also perhaps a bit much to ask, but is it possible to directly read a zip file rather than unzipping it first and then reading in the csv? Oh and if it wasn't clear I'm using data.table v1.8.9

I think the argument you're looking for is <code>drop</code>. Try: <pre class="prettyprint"><code>require(data.table) # 1.9.2+ pp <- fread("AUDUSD-2013-05.csv", drop = 1) </code></pre> Note that you can <code>drop</code> by name or position. <pre class="prettyprint"><code>fread("AUDUSD-2013-05.csv", drop = c("columThree","anotherColumnName")) fread("AUDUSD-2013-05.csv", drop = 10:15) # read all columns other than 10:15 </code></pre> And you can <code>select</code> by name or position, too. <pre class="prettyprint"><code>fread("AUDUSD-2013-05.csv", select = 10:15) # read only columns 10:15 fread("AUDUSD-2013-05.csv", select = c("columnA","columnName2")) </code></pre> These arguments were added to v1.9.2 (released to CRAN in Feb 2014) and are documented in <code>?fread</code>. You'll need to upgrade to use them.

R using fread colClasses or skip arguments to read csv with no column headers

Tags:

r

csv

data.table

fread

I would like to be able to skip a column that is read into R via data.table's fread function in v1.8.9. But the csv I am reading in, has no column headers…which appears to be a problem for fread... is there a way to just specify that I don't want specific columns?

Would it be better to just pre-allocate a column name and then let it read it in so that it can be skipped?

To give an example, I downloaded the data from the following URL

http://www.truefx.com/dev/data/2013/MAY-2013/AUDUSD-2013-05.zip

unzipped it…

and read the csv into R using fread and it has pretty much the same file name just with the csv extension.

system.time(pp <- fread("AUDUSD-2013-05.csv",sep=","))
  user  system elapsed 
16.427   0.257  16.682 

head(pp)
       V1                    V2      V3      V4
1: AUD/USD 20130501 00:00:04.728 1.03693 1.03721
2: AUD/USD 20130501 00:00:21.540 1.03695 1.03721
3: AUD/USD 20130501 00:00:33.789 1.03694 1.03721
4: AUD/USD 20130501 00:00:37.499 1.03692 1.03724
5: AUD/USD 20130501 00:00:37.524 1.03697 1.03719
6: AUD/USD 20130501 00:00:39.789 1.03697 1.03717

str(pp)
Classes ‘data.table’ and 'data.frame':  4060762 obs. of  4 variables:
$ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
$ V2: chr  "20130501 00:00:04.728" "20130501 00:00:21.540" "20130501 00:00:33.789" "20130501 00:00:37.499" ...
$ V3: num  1.04 1.04 1.04 1.04 1.04 ...
$ V4: num  1.04 1.04 1.04 1.04 1.04 ...
- attr(*, ".internal.selfref")=<externalptr>

I tried using the new(ish) colClasses or skip arguments to ignore the fact that the first column is all the same…and is unnecessary.

but doing:

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",skip=1)

doesn't omit the reading in of the first column

and using colClasses leads to the following error

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",colClasses=list(NULL,"character","numeric","numeric"))

Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = list(NULL,  : 
 colClasses is type list but has no names

other attempts incude

pp1 <- fread("AUDUSD-2013-06.csv",sep=",", colClasses=c(V1=NULL,V2="character",V3="numeric",V4="numeric"))
str(pp1)
Classes ‘data.table’ and 'data.frame':  5524877 obs. of  4 variables:
 $ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
 $ V2: chr  "20130603 00:00:00.290" "20130603 00:00:00.291" "20130603 00:00:00.292" "20130603 00:00:03.014" ...
 $ V3: num  0.962 0.962 0.962 0.962 0.962 ...
 $ V4: num  0.962 0.962 0.962 0.962 0.962 ...
 - attr(*, ".internal.selfref")=<externalptr>

i.e pretty much exactly the same as if I had not used colClasses...

Are there any suggestions to be able to speed up the reading in of data by omitting the first column?

Also perhaps a bit much to ask, but is it possible to directly read a zip file rather than unzipping it first and then reading in the csv?

Oh and if it wasn't clear I'm using data.table v1.8.9

679

asked Jul 10 '13 09:07

h.l.m

1 Answers

I think the argument you're looking for is drop. Try:

require(data.table)  # 1.9.2+
pp <- fread("AUDUSD-2013-05.csv", drop = 1)

Note that you can drop by name or position.

fread("AUDUSD-2013-05.csv", drop = c("columThree","anotherColumnName"))

fread("AUDUSD-2013-05.csv", drop = 10:15)  # read all columns other than 10:15

And you can select by name or position, too.

fread("AUDUSD-2013-05.csv", select = 10:15)  # read only columns 10:15

fread("AUDUSD-2013-05.csv", select = c("columnA","columnName2"))

These arguments were added to v1.9.2 (released to CRAN in Feb 2014) and are documented in ?fread. You'll need to upgrade to use them.

180

answered Sep 29 '22 21:09

SCallan

Related questions
                            
                                How to calculate autocorrelation in r (zoo object)
                            
                                Export UTF-8 BOM to .csv in R
                            
                                paste method for a dataframe
                            
                                How to use the function curve in [R] to graph a normal curve?
                            
                                R: how to get row and column names of the true elements of a matrix?
                            
                                Efficiently compute histogram of pairwise differences in a large vector in R?
                            
                                geom_boxplot() from ggplot2 : forcing an empty level to appear
                            
                                Shade part of an R plot
                            
                                Fastest way for calculating rank of 2*2 matrix?
                            
                                merge a data.table with itself after a reference lookup
                            
                                Download economic official data from a Central Bank web page
                            
                                Extracting Words of specific length in R using regular expressions
                            
                                S3 method help (roxygen2)
                            
                                How to fill (date-) gaps in data.frame?
                            
                                How can I create raster mosaic using list of rasters?
                            
                                Modifying geom_ribbon borders
                            
                                Fast function to add vector elements by their names
                            
                                modify glm function to adopt user-specified link function in R
                            
                                Plot gradient circles
                            
                                How to access elements of a vector in a Rcpp::List

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With