I want to extract the 4th, 5th, and 6th column from a data table named dt
the following method works:
dt[, c(4,5,6)]
but the following doesn't:
a = c(4,5,6)
dt[, a]
In fact, the second method gives me a reult of:
4 5 6
Can someone tell me why this is happening? The two method looks equivalent to me.
By using the R base df[] notation or select() function from dplyr package you can select a single column or select multiple columns by index position (column number) from the R Data Frame.
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
We can use double dots (..
) before the object 'a' to extract the columns
dt[, ..a]
# col4 col5 col6
#1: 4 5 6
#2: 5 6 7
#3: 6 7 8
#4: 7 8 9
Or another option is with = FALSE
dt[, a, with = FALSE]
dt <- data.table(col1 = 1:4, col2 = 2:5, col3 = 3:6, col4 = 4:7, col5 = 5:8, col6 = 6:9)
@akrun's answer gives you the correct alternative. If you want to know why you need it, here's the more detailed explanation:
The way the data.table subset operation works, in most cases the j
expression in dt[i, j, by]
with no i
or by
, is evaluated in the frame of the data table, and returned as is, whether or not it has anything to do with the data table outside the brackets. In versions earlier than 1.9.8, your first code snippet: dt[,c(4, 5, 6)]
evaluates to the numeric vector c(4, 5, 6)
, not the 4th, 5th, and 6th columns. This changed as of data.table v1.9.8 (released November 2016) ( scroll down to v.1.9.8 potentially breaking changes), because people, unsurprisingly, expected dt[,c(4, 5, 6)]
to give the 4th 5th and 6th columns. Now, if the j expression is the variable names or numbers, with
is automatically set to FALSE
. This effectively produces behavior similar to subsetting a data frame (not exactly the same, but similar).
So your second code snippet (where dt[, a]
evaluates to a
, rather than uses a
to subset the columns) is actually the default, and the first is a special case.
To illustrate the odd but standard behavior here, try:
dt[, diag(5)]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 0 0 0 0
# [2,] 0 1 0 0 0
# [3,] 0 0 1 0 0
# [4,] 0 0 0 1 0
# [5,] 0 0 0 0 1
No matter what your dt
is, so long as it is a data.table, it will evaluate to the 5*5 identity matrix
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With