I am working with data.table
and I want to do a non-equi left join/merge.
I have one table with car prices and another table to identify which car class each car belongs to:
data_priceclass <- data.table()
data_priceclass$price_from <- c(0, 0, 200000, 250000, 300000, 350000, 425000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000, 1300000, 1400000, 1500000, 1600000, 1700000, 1800000)
data_priceclass$price_to <- c(199999, 199999, 249999, 299999, 349999, 424999, 499999, 599999, 699999, 799999, 899999, 999999, 1099999, 1199999, 1299999, 1399999, 1499999, 1599999, 1699999, 1799999, 1899999)
data_priceclass$price_class <- c(1:20, 99)
I use a non-equi join to merge the two tables. But the x[y]-join syntax of data.table
removes duplicates.
cars <- data.table(car_price = c(190000, 500000))
cars[data_priceclass, on = c("car_price >= price_from",
"car_price < price_to"),
price_class := i.price_class,]
cars
Notice that the car with value 190000 is supposed to get matches on two rows in the data_priceclass
table, but since x[y] removes duplicates, I can't see this in the output. Normally when I join I always use the merge
function instead of x[y], because I'm losing control when I use x[y].
But the following does not work with non-equi joins:
merge(cars, data_priceclass,
by = c("car_price >= price_from",
"car_price < price_to"),
all.x = T , all.y = F)
Any tips how I can do a non-equi join with data.table that does not remove duplicates?
The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.
The merge() function in R combines two data frames. The most crucial requirement for connecting two data frames is that the column type is the same on which the merging occurs. The merge() function is similar to the join function in a Relational Database Management System (RDMS).
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
An inner join in R is a merge operation between two data frames where the merge returns all of the rows that match from both tables. You are going to need to specify a common key for R use to use to match the data elements.
The merge () function in base R and the various join () functions from the dplyr package can both be used to join two data frames together. There are two main differences between these two functions: 1. The join () functions from dplyr tend to be much faster than merge () on extremely large data frames. 2.
The data frames must have same column names on which the merging happens. Merge () Function in R is similar to database join operation in SQL. The different arguments to merge () allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join.
The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others.
As not all rows in the first data frame match all the rows in the second, the output is filled with NA values in those cases. The left join in R consist on matching all the rows in the first data frame with the corresponding values on the second. Recall that ‘Jack’ was on the first table but not on the second.
As noted in comments, a left join on cars
is done by using cars
as subsetting condition i
in the DT[i,j,by]
syntax.
This puts cars
on the right, which might be counter-intuitive compared to SQL
, and I found this tutorial useful to compare both syntaxes.
cars <- data.table(car_price = c(190000, 500000))
data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]
car_price x.price_from x.price_to price_class
1: 190000 0e+00 199999 1
2: 190000 0e+00 199999 2
3: 500000 5e+05 599999 8
If you increase car price:
cars <- cars * 10
data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]
car_price x.price_from x.price_to price_class
1: 1900000 NA NA NA
2: 5000000 NA NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With