I'm not sure how to get the row indices resulting from the join of two data.tables. To setup a simplified example, suppose dt is a data.table having column 'a' which is a letter from the alphabet, 'b' is some other piece of information. I want to add a column 'c' and set it to either 'vowel' or 'consonant' depending on column 'a'. I have another data table dtv which serves as a table of vowels. Can I use the join capability of a data.table to efficiently perform this operation? <pre class="prettyprint"><code>require(data.table) dt <- data.table ( a = sample(letters, 25, replace = T), b = sample(50:100, 25, replace = F)) dtv <- data.table( vowel = c( 'a','e','i','o','u') ) setkey(dt,a) </code></pre> The next line of code gives me a data.table of rows with vowels <pre class="prettyprint"><code>dt[dtv, nomatch=0] </code></pre> But how do I grab the row indices so I can tag the row's as vowels or consonants? <pre class="prettyprint"><code>dt[, c := 'consonant'] dt[{ `a` found in vowel list }, c := 'vowel'] # I want to do this where column 'a' is a vowel </code></pre>

Since V 1.9.4 <code>data.table</code> is optimized to use a binary join on <code>%in%</code> in case the data set is already keyed. So @Richards answer should have the same perfomance for the newest <code>data.table</code> versions (btw, <code>%in%</code> had a bug when used while <code>datatable.auto.index = TRUE</code>, so please make sure you have <code>data.table</code> v 1.9.6+ installed if you are going to use it) Below is an illustration of <code>data.table</code> using a binary join when using the <code>%in%</code> function <pre class="prettyprint"><code>require(data.table) set.seed(123) dt <- data.table ( a = sample(letters, 25, replace = T), b = sample(50:100, 25, replace = F)) dtv <- data.table( vowel = c( 'a','e','i','o','u') ) setkey(dt, a) options(datatable.verbose = TRUE) dt[a %in% dtv$vowel] # Starting bmerge ...done in 0 secs <~~~ binary join was triggered # a b # 1: i 87 # 2: o 84 # 3: o 62 # 4: u 77 </code></pre> <hr> Either way, you were almost there and you can easily modify <code>c</code> while joining <pre class="prettyprint"><code>dt[, c := 'consonant'] dt[dtv, c := 'vowel'] </code></pre> Or if you want to avoid joining unnecessary columns from <code>dtv</code> (in case they are present) you could join only to the first column in <code>dtv</code> <pre class="prettyprint"><code>dt[dtv$vowel, c := 'consonant'] </code></pre> Notice that I haven't use <code>.()</code> or <code>J()</code>. <code>data.table</code> will perform a binary join instead of row indexing by default in case <code>i</code>th element is not of type <code>integer</code> or <code>numeric</code>. This is matters if you, for example, would want to perform a binary join over column <code>b</code> (which is of type <code>integer</code>). Compare <pre class="prettyprint"><code>setkey(dt, b) dt[80:85] # a b <~~~ binary join wan't triggered, instead an attempt to subset by rows 80:85 was made # 1: NA NA # 2: NA NA # 3: NA NA # 4: NA NA # 5: NA NA # 6: NA NA </code></pre> And <pre class="prettyprint"><code>dt[.(80:85)] # or dt[J(80:85)] # Starting bmerge ...done in 0 secs <~~~ binary join was triggered # a b # 1: x 80 # 2: x 81 # 3: NA 82 # 4: NA 83 # 5: o 84 # 6: NA 85 </code></pre> <hr> Another difference between the two methods is that <code>%in%</code> won't return unmatched instances, compare <pre class="prettyprint"><code>setkey(dt, a) dt[a %in% dtv$vowel] # Starting bmerge ...done in 0 secs # a b # 1: i 87 # 2: o 84 # 3: o 62 # 4: u 77 </code></pre> And <pre class="prettyprint"><code>dt[dtv$vowel] # Starting bmerge ...done in 0 secs # a b # 1: a NA <~~~ unmatched values returned # 2: e NA <~~~ unmatched values returned # 3: i 87 # 4: o 84 # 5: o 62 # 6: u 77 </code></pre> For this specific case it doesn't matter because the <code>:=</code> won't modify unmatched values, but you can use <code>nomatch = 0L</code> in other cases <pre class="prettyprint"><code>dt[dtv$vowel, nomatch = 0L] # Starting bmerge ...done in 0 secs # a b # 1: i 87 # 2: o 84 # 3: o 62 # 4: u 77 </code></pre> <hr> Don't forget to set <code>options(datatable.verbose = FALSE)</code> if you don't want <code>data.table</code> to be so verbose.

There's really no need to use a merge/join. We can use <code>%in%</code>. <pre class="prettyprint"><code>dt[, c := "consonant"] dt[a %in% dtv$vowel, c := "vowel"] </code></pre> or the same thing in one line - <pre class="prettyprint"><code>dt[, c := "consonant"][a %in% dtv$vowel, c := "vowel"] </code></pre> <hr> Alternatively (and better), we can do both of those steps in a single call with the following. <pre class="prettyprint"><code>dt[, c := c("consonant", "vowel")[a %in% dtv$vowel + 1L]] </code></pre>

Can I use the R data.table join capability to select rows and perform some operation?

Tags:

join

r

data.table

I'm not sure how to get the row indices resulting from the join of two data.tables.

To setup a simplified example, suppose dt is a data.table having column 'a' which is a letter from the alphabet, 'b' is some other piece of information.

I want to add a column 'c' and set it to either 'vowel' or 'consonant' depending on column 'a'. I have another data table dtv which serves as a table of vowels. Can I use the join capability of a data.table to efficiently perform this operation?

require(data.table)
dt <- data.table ( a = sample(letters, 25, replace = T), 
                   b = sample(50:100,   25, replace = F))
dtv <- data.table( vowel  = c( 'a','e','i','o','u') )
setkey(dt,a)

The next line of code gives me a data.table of rows with vowels

dt[dtv, nomatch=0]

But how do I grab the row indices so I can tag the row's as vowels or consonants?

dt[, c := 'consonant']
dt[{ `a` found in vowel list }, c := 'vowel']  
# I want to do this where column 'a' is a vowel

894

asked Dec 07 '15 01:12

Kerry

2 Answers

Since V 1.9.4 data.table is optimized to use a binary join on %in% in case the data set is already keyed. So @Richards answer should have the same perfomance for the newest data.table versions (btw, %in% had a bug when used while datatable.auto.index = TRUE, so please make sure you have data.table v 1.9.6+ installed if you are going to use it)

Below is an illustration of data.table using a binary join when using the %in% function

require(data.table)
set.seed(123)
dt <- data.table ( a = sample(letters, 25, replace = T), 
                   b = sample(50:100,   25, replace = F))
dtv <- data.table( vowel  = c( 'a','e','i','o','u') )
setkey(dt, a)

options(datatable.verbose = TRUE)

dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
#    a  b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77

Either way, you were almost there and you can easily modify c while joining

dt[, c := 'consonant']
dt[dtv, c := 'vowel']

Or if you want to avoid joining unnecessary columns from dtv (in case they are present) you could join only to the first column in dtv

dt[dtv$vowel, c := 'consonant']

Notice that I haven't use .() or J(). data.table will perform a binary join instead of row indexing by default in case ith element is not of type integer or numeric. This is matters if you, for example, would want to perform a binary join over column b (which is of type integer). Compare

setkey(dt, b)
dt[80:85]
#     a  b <~~~ binary join wan't triggered, instead an attempt to subset by rows 80:85 was made
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA
# 5: NA NA
# 6: NA NA

And

dt[.(80:85)] # or dt[J(80:85)]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
#     a  b
# 1:  x 80
# 2:  x 81
# 3: NA 82
# 4: NA 83
# 5:  o 84
# 6: NA 85

Another difference between the two methods is that %in% won't return unmatched instances, compare

setkey(dt, a)
dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs 
#    a  b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77

And

dt[dtv$vowel]
# Starting bmerge ...done in 0 secs
#    a  b
# 1: a NA <~~~ unmatched values returned
# 2: e NA <~~~ unmatched values returned
# 3: i 87
# 4: o 84
# 5: o 62
# 6: u 77

For this specific case it doesn't matter because the := won't modify unmatched values, but you can use nomatch = 0L in other cases

dt[dtv$vowel, nomatch = 0L]
# Starting bmerge ...done in 0 secs
#    a  b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77

Don't forget to set options(datatable.verbose = FALSE) if you don't want data.table to be so verbose.

answered Oct 05 '22 20:10

David Arenburg

There's really no need to use a merge/join. We can use %in%.

dt[, c := "consonant"]
dt[a %in% dtv$vowel, c := "vowel"]

or the same thing in one line -

dt[, c := "consonant"][a %in% dtv$vowel, c := "vowel"]

Alternatively (and better), we can do both of those steps in a single call with the following.

dt[, c := c("consonant", "vowel")[a %in% dtv$vowel + 1L]]

answered Oct 05 '22 21:10

Rich Scriven

Related questions
                            
                                Error: unused arguments in Shiny (R)
                            
                                remove /n/r in tinyMCE
                            
                                What is a fool-proof way of permanently setting R working directory?
                            
                                Where do I find the definition of Rf_protect() in R's sources?
                            
                                Perfect fit of ggplot2 plot in plot
                            
                                how to extract information from apriori R (association rules)
                            
                                Change the shape and color of the points with ggplot
                            
                                probability of survival at particular time points using randomForestSRC
                            
                                Efficiently accessing pairwise distances
                            
                                Display LaTeX equations in a shiny dashboard app in R
                            
                                Use of braces {...} in j
                            
                                Optimize R code to create distance matrix based on customized distance function
                            
                                R: Remove leading zeroes from the beginning of a character string
                            
                                How to extract numbers inbetween characters in R
                            
                                incorporate code listings from an external file in knitr/markdown
                            
                                I can't see the result of silhouette plot except for the axis(in R)
                            
                                mutate and rowSums exclude columns
                            
                                Sorting a list of unequal-size vectors in r
                            
                                Determine file type in R based on the content
                            
                                How do I get annotation_custom() grob to display along with scale_y_reverse() using R and ggplot2?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With