Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Select" argument in R's data.table::fread

I'm trying to read selected columns from a csv using fread(). I've found that I can use a vector of column numbers, but not with the column names. In regards to the "select" argument, the documentation just says "Vector of column names or numbers to keep, drop the rest." They also provide the example of

fread(data, select=c("A","D"))

Thus, why does my code throw a subscript out of bounds error? Here's the gist of my code, hopefully generalizable to other users:

test = data.frame(matrix(c(1:50),ncol = 5))
names(test) = c("A", "B", "C", "D", "E")
write.table(test, file = "/Users/me/Desktop/test.txt", sep = ",")
fread("/Users/me/Desktop/test.txt", sep = ",", header = TRUE, select = c("A","B"))

Giving

Error in ans[[1]] : subscript out of bounds

However this gives the first column as well as the row number as a column:

fread("/Users/me/Desktop/test.txt", sep = ",", header = TRUE, select = c(1,2))
    1  1
1:  2  2
2:  3  3
3:  4  4
4:  5  5
5:  6  6
6:  7  7
7:  8  8
8:  9  9
9: 10 10

...And read.table() is able to uneventfully read the whole data set:

read.table("/Users/me/Desktop/test.txt", sep = ",", header = TRUE)
    A  B  C  D  E
1   1 11 21 31 41
2   2 12 22 32 42
3   3 13 23 33 43
4   4 14 24 34 44
5   5 15 25 35 45
6   6 16 26 36 46
7   7 17 27 37 47
8   8 18 28 38 48
9   9 19 29 39 49
10 10 20 30 40 50

Something is obviously going on with the rownames and the header, but I'm not sure how to resolve it. I've tried with and without headers. The data set I'm using (not in this example) already has rownames, so re-writing it with rownames = FALSE isn't an option.

like image 931
Nancy Avatar asked May 24 '26 12:05

Nancy


2 Answers

This answer assumes your original data was not produced via write.table(), that you were given a file and are attempting to read it via fread() (which is also stated in the question).


I believe you are having this problem because of the row names in the file. I have yet to come up with a direct way to apply fread() to the data, but I think this work-around will be safe and won't cost you much in terms of efficiency. Here are the steps ...

1) Read the first line of the file with scan() and add an extra "" element at the beginning. This is to offset the header row to account for the row names in the file.

nm <- c("", scan("test.txt", "", nlines = 1, sep = ","))

2) Define the columns you want and find them in nm. Instead of 1 and 4, the offset now gives us 2 and 5 and accounts for the row names.

sel <- nm %in% c("A", "D")

3) Read the file, starting at the second line (i.e. without the header), and use sel in the selection argument.

library(data.table)
dt <- fread("test.txt", skip = 1, select = which(sel))

4) Now that we've read the data that we want, we can reset the column names.

setnames(dt, nm[sel])[]
#      A  D
#  1:  1 31
#  2:  2 32
#  3:  3 33
#  4:  4 34
#  5:  5 35
#  6:  6 36
#  7:  7 37
#  8:  8 38
#  9:  9 39
# 10: 10 40

If the example you give is a good representation of the actual data, I don't see any reason why this wouldn't work. Hope it works for you.

like image 142
Rich Scriven Avatar answered May 27 '26 04:05

Rich Scriven


This example shows why you always need to check carefully the format of the file you are producing. There are some differences between read.table and fread; here the issue comes from the row names and how they are written by write.table. As always, reading carefully the doc (?write.table) helps a lot.

write.table by default writes the row names. But here is how:

filename<-"somefilename.txt"
write.table(test, file = filename, sep = ",")
readLines(filename,2)
#[1] "\"A\",\"B\",\"C\",\"D\",\"E\"" 
#"\"1\",1,11,21,31,41"

I read the first two lines of the produced file. Reading them carefully, you can see that this is not a "standard" CSV. Why? Because the header has a 4 commas while the "data" lines 5. For a standard CSV, you should put a comma before the first column name. This is achieved by adding col.names=NA in write.table:

write.table(test, file = filename, sep = ",", col.names=NA)
#now works
fread(filename, sep = ",", header = TRUE, select = c("A","B"))

You can check and see that now a comma as the first character of the file appears. Alternatively, you can avoid to write the row names putting row.names=FALSE in write.table, but this is not always possible, since some times they are meaningful.

like image 28
nicola Avatar answered May 27 '26 03:05

nicola



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!