Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use the R data.table join capability to select rows and perform some operation?

Tags:

join

r

data.table

I'm not sure how to get the row indices resulting from the join of two data.tables.

To setup a simplified example, suppose dt is a data.table having column 'a' which is a letter from the alphabet, 'b' is some other piece of information.

I want to add a column 'c' and set it to either 'vowel' or 'consonant' depending on column 'a'. I have another data table dtv which serves as a table of vowels. Can I use the join capability of a data.table to efficiently perform this operation?

require(data.table)
dt <- data.table ( a = sample(letters, 25, replace = T), 
                   b = sample(50:100,   25, replace = F))
dtv <- data.table( vowel  = c( 'a','e','i','o','u') )
setkey(dt,a)

The next line of code gives me a data.table of rows with vowels

dt[dtv, nomatch=0]  

But how do I grab the row indices so I can tag the row's as vowels or consonants?

dt[, c := 'consonant']
dt[{ `a` found in vowel list }, c := 'vowel']  
# I want to do this where column 'a' is a vowel
like image 894
Kerry Avatar asked Dec 07 '15 01:12

Kerry


People also ask

How do you join data tables in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

When I is a data table or character vector the columns to join by must be specified using?

table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).

What is data table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.


2 Answers

Since V 1.9.4 data.table is optimized to use a binary join on %in% in case the data set is already keyed. So @Richards answer should have the same perfomance for the newest data.table versions (btw, %in% had a bug when used while datatable.auto.index = TRUE, so please make sure you have data.table v 1.9.6+ installed if you are going to use it)

Below is an illustration of data.table using a binary join when using the %in% function

require(data.table)
set.seed(123)
dt <- data.table ( a = sample(letters, 25, replace = T), 
                   b = sample(50:100,   25, replace = F))
dtv <- data.table( vowel  = c( 'a','e','i','o','u') )
setkey(dt, a)

options(datatable.verbose = TRUE)

dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
#    a  b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77

Either way, you were almost there and you can easily modify c while joining

dt[, c := 'consonant']
dt[dtv, c := 'vowel']  

Or if you want to avoid joining unnecessary columns from dtv (in case they are present) you could join only to the first column in dtv

dt[dtv$vowel, c := 'consonant']

Notice that I haven't use .() or J(). data.table will perform a binary join instead of row indexing by default in case ith element is not of type integer or numeric. This is matters if you, for example, would want to perform a binary join over column b (which is of type integer). Compare

setkey(dt, b)
dt[80:85]
#     a  b <~~~ binary join wan't triggered, instead an attempt to subset by rows 80:85 was made
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA
# 5: NA NA
# 6: NA NA

And

dt[.(80:85)] # or dt[J(80:85)]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
#     a  b
# 1:  x 80
# 2:  x 81
# 3: NA 82
# 4: NA 83
# 5:  o 84
# 6: NA 85

Another difference between the two methods is that %in% won't return unmatched instances, compare

setkey(dt, a)
dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs 
#    a  b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77

And

dt[dtv$vowel]
# Starting bmerge ...done in 0 secs
#    a  b
# 1: a NA <~~~ unmatched values returned
# 2: e NA <~~~ unmatched values returned
# 3: i 87
# 4: o 84
# 5: o 62
# 6: u 77

For this specific case it doesn't matter because the := won't modify unmatched values, but you can use nomatch = 0L in other cases

dt[dtv$vowel, nomatch = 0L]
# Starting bmerge ...done in 0 secs
#    a  b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77

Don't forget to set options(datatable.verbose = FALSE) if you don't want data.table to be so verbose.

like image 59
David Arenburg Avatar answered Oct 05 '22 20:10

David Arenburg


There's really no need to use a merge/join. We can use %in%.

dt[, c := "consonant"]
dt[a %in% dtv$vowel, c := "vowel"]

or the same thing in one line -

dt[, c := "consonant"][a %in% dtv$vowel, c := "vowel"]

Alternatively (and better), we can do both of those steps in a single call with the following.

dt[, c := c("consonant", "vowel")[a %in% dtv$vowel + 1L]]
like image 29
Rich Scriven Avatar answered Oct 05 '22 21:10

Rich Scriven