How should I start thinking about which syntax I prefer?
My criteria is efficiency (this is number one) and also readability/maintainability.
This
A <- B[A, on = .(id)] # very concise!
Or that
A[B, on = .(id), comment := i.comment]
Or even (as PoGibas suggests):
A <- merge(A, B, all.x = TRUE)
For completeness then a more basic way is to use match()
:
A[, comment := B[chmatch(A[["id"]], id), comment]]
Example data:
library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B <- data.table(id = c("c", "d", "e"), comment = c("big", "slow", "nice"))
If you want to do a left join, you can use all. x = TRUE . If you want to do a full outer join, you can use all = TRUE .
The LEFT JOIN keyword returns all records from the left table (table1), and the matching records from the right table (table2). The result is 0 records from the right side, if there is no match.
In the Merge dialog box, under Right table for merge, select Countries. In the Sales table, select the CountryID column. In the Countries table, select the id column. In the Join kind section, select Left outer.
table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).
I prefer the "update join" idiom for efficiency and maintainability:**
DT[WHERE, v := FROM[.SD, on=, x.v]]
It's an extension of what is shown in vignette("datatable-reference-semantics")
under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.
This is efficient since it only uses the rows selected by WHERE
and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=]
.
It makes my code more readable since I can easily see that the point of the join is to add column v
; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.
It is useful for code maintenance since if I later want to find out how DT
got a column named v
, I can search my code for v :=
, while FROM[DT, on=]
obscures which new columns are being added. Also, it allows the WHERE
condition, while the left join does not. This may be useful, for example, if using FROM
to "fill" NAs in an existing column v
.
Compared with the other update join approach DT[FROM, on=, v := i.v]
, I can think of two advantages. First is the option of using the WHERE
clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM
conditional on the on=
rules. Here's an illustration extending the OP's example:
library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
id = c("c", "d", "e", "e"),
ord = 1:4,
comment = c("big", "slow", "nice", "nooice")
)
# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu)
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: comment,i.comment
# Assigning to 4 row subset of 10 rows
# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
# Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)
id amount comment comment2
1: a 0.20000990 <NA> <NA>
2: b 1.42146573 <NA> <NA>
3: c 0.73047544 big big
4: d 0.04128676 slow slow
5: e 0.82195377 nooice nice
6: f 0.39013550 <NA> nooice
7: g 0.27019768 <NA> <NA>
8: h 0.36017876 <NA> <NA>
9: i 1.81865721 <NA> <NA>
10: j 4.86711754 <NA> <NA>
In the left-join-flavored update, you silently get the final value of comment
even though there are two matches for id == "e"
; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE
with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.
I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.
** In this idiom, the on=
part should be filled in with the join column names and rules, like on=.(id)
or on=.(from_date >= dt_date)
. Further join rules can be passed with roll=
, mult=
and nomatch=
. See ?data.table
for details. Thanks to @RYoda for noting this point in the comments.
Here is a more complicated example from Matt Dowle explaining roll=
: Find time to nearest occurrence of particular value for each row
Another related example: Left join using data.table
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With