Which data.table syntax for left join (one column) to prefer

Tags:

1 Answers

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.

Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
  id = c("c", "d", "e", "e"), 
  ord = 1:4, 
  comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu) 
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: comment,i.comment 
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
#   Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

    id     amount comment comment2
 1:  a 0.20000990    <NA>     <NA>
 2:  b 1.42146573    <NA>     <NA>
 3:  c 0.73047544     big      big
 4:  d 0.04128676    slow     slow
 5:  e 0.82195377  nooice     nice
 6:  f 0.39013550    <NA>   nooice
 7:  g 0.27019768    <NA>     <NA>
 8:  h 0.36017876    <NA>     <NA>
 9:  i 1.81865721    <NA>     <NA>
10:  j 4.86711754    <NA>     <NA>

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.

I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

Another related example: Left join using data.table

answered Oct 11 '22 23:10

Frank

Related questions
                            
                                Calculate rolling correlation using rollapply
                            
                                How to perform multi-class classification using 'svm' of e1071 package in R
                            
                                error: could not find function install_github for R version 2.15.2
                            
                                R Shiny - Access an App on my Local Machine
                            
                                Add data to ggvis tooltip that's contained in the input dataset but not directly in the vis
                            
                                dplyr: using filter, group_by, from within mutate command [duplicate]
                            
                                R: Using rvest package instead of XML package to get links from URL
                            
                                What's the opposite function to lag for an R vector/dataframe?
                            
                                Skip comment line in csv file using R
                            
                                Split a file path into folder names vector
                            
                                Issue with geom_text when using position_dodge
                            
                                Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"
                            
                                R: Generate data from a probability density distribution
                            
                                Plotting expression trees in R
                            
                                R: Assign values to a new column based on values of another column where a condition is satisfied
                            
                                pandas equivalent for R dcast
                            
                                Extracting unique values from data frame using R
                            
                                Find start and end positions/indices of runs/consecutive values
                            
                                How to style an single individual selectInput menu in R Shiny?
                            
                                list members can be accessed with partial name? Is this a feature?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which data.table syntax for left join (one column) to prefer

Tags:

r

data.table

sindri_baldur

People also ask

1 Answers

Frank

Recent Activity

Donate For Us