I'm not sure how to get the row indices resulting from the join of two data.tables.
To setup a simplified example, suppose dt is a data.table having column 'a' which is a letter from the alphabet, 'b' is some other piece of information.
I want to add a column 'c' and set it to either 'vowel' or 'consonant' depending on column 'a'. I have another data table dtv which serves as a table of vowels. Can I use the join capability of a data.table to efficiently perform this operation?
require(data.table)
dt <- data.table ( a = sample(letters, 25, replace = T),
b = sample(50:100, 25, replace = F))
dtv <- data.table( vowel = c( 'a','e','i','o','u') )
setkey(dt,a)
The next line of code gives me a data.table of rows with vowels
dt[dtv, nomatch=0]
But how do I grab the row indices so I can tag the row's as vowels or consonants?
dt[, c := 'consonant']
dt[{ `a` found in vowel list }, c := 'vowel']
# I want to do this where column 'a' is a vowel
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).
data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.
Since V 1.9.4 data.table
is optimized to use a binary join on %in%
in case the data set is already keyed. So @Richards answer should have the same perfomance for the newest data.table
versions (btw, %in%
had a bug when used while datatable.auto.index = TRUE
, so please make sure you have data.table
v 1.9.6+ installed if you are going to use it)
Below is an illustration of data.table
using a binary join when using the %in%
function
require(data.table)
set.seed(123)
dt <- data.table ( a = sample(letters, 25, replace = T),
b = sample(50:100, 25, replace = F))
dtv <- data.table( vowel = c( 'a','e','i','o','u') )
setkey(dt, a)
options(datatable.verbose = TRUE)
dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
# a b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77
Either way, you were almost there and you can easily modify c
while joining
dt[, c := 'consonant']
dt[dtv, c := 'vowel']
Or if you want to avoid joining unnecessary columns from dtv
(in case they are present) you could join only to the first column in dtv
dt[dtv$vowel, c := 'consonant']
Notice that I haven't use .()
or J()
. data.table
will perform a binary join instead of row indexing by default in case i
th element is not of type integer
or numeric
. This is matters if you, for example, would want to perform a binary join over column b
(which is of type integer
). Compare
setkey(dt, b)
dt[80:85]
# a b <~~~ binary join wan't triggered, instead an attempt to subset by rows 80:85 was made
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA
# 5: NA NA
# 6: NA NA
And
dt[.(80:85)] # or dt[J(80:85)]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
# a b
# 1: x 80
# 2: x 81
# 3: NA 82
# 4: NA 83
# 5: o 84
# 6: NA 85
Another difference between the two methods is that %in%
won't return unmatched instances, compare
setkey(dt, a)
dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs
# a b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77
And
dt[dtv$vowel]
# Starting bmerge ...done in 0 secs
# a b
# 1: a NA <~~~ unmatched values returned
# 2: e NA <~~~ unmatched values returned
# 3: i 87
# 4: o 84
# 5: o 62
# 6: u 77
For this specific case it doesn't matter because the :=
won't modify unmatched values, but you can use nomatch = 0L
in other cases
dt[dtv$vowel, nomatch = 0L]
# Starting bmerge ...done in 0 secs
# a b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77
Don't forget to set options(datatable.verbose = FALSE)
if you don't want data.table
to be so verbose.
There's really no need to use a merge/join. We can use %in%
.
dt[, c := "consonant"]
dt[a %in% dtv$vowel, c := "vowel"]
or the same thing in one line -
dt[, c := "consonant"][a %in% dtv$vowel, c := "vowel"]
Alternatively (and better), we can do both of those steps in a single call with the following.
dt[, c := c("consonant", "vowel")[a %in% dtv$vowel + 1L]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With