Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: unequi join with merge function

I am working with data.table and I want to do a non-equi left join/merge.

I have one table with car prices and another table to identify which car class each car belongs to:

data_priceclass <- data.table()
data_priceclass$price_from <- c(0, 0, 200000, 250000, 300000, 350000, 425000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000, 1300000, 1400000, 1500000, 1600000, 1700000, 1800000) 
data_priceclass$price_to <- c(199999, 199999, 249999, 299999, 349999, 424999, 499999, 599999, 699999, 799999, 899999, 999999, 1099999, 1199999, 1299999, 1399999, 1499999, 1599999, 1699999, 1799999, 1899999)
data_priceclass$price_class <- c(1:20, 99)

I use a non-equi join to merge the two tables. But the x[y]-join syntax of data.table removes duplicates.

cars <- data.table(car_price = c(190000, 500000))
cars[data_priceclass, on = c("car_price >= price_from", 
                             "car_price < price_to"),
     price_class := i.price_class,]
cars

Notice that the car with value 190000 is supposed to get matches on two rows in the data_priceclass table, but since x[y] removes duplicates, I can't see this in the output. Normally when I join I always use the merge function instead of x[y], because I'm losing control when I use x[y].

But the following does not work with non-equi joins:

merge(cars, data_priceclass,
      by = c("car_price >= price_from", 
             "car_price < price_to"),
      all.x = T , all.y = F)

Any tips how I can do a non-equi join with data.table that does not remove duplicates?

like image 536
Helen Avatar asked Apr 20 '21 07:04

Helen


People also ask

What is the difference between merge and join in R?

The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.

What does merge () do in R?

The merge() function in R combines two data frames. The most crucial requirement for connecting two data frames is that the column type is the same on which the merging occurs. The merge() function is similar to the join function in a Relational Database Management System (RDMS).

How do I combine two datasets in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

Is merge in R an inner join?

An inner join in R is a merge operation between two data frames where the merge returns all of the rows that match from both tables. You are going to need to specify a common key for R use to use to match the data elements.

What is the difference between merge() and join() functions in R?

The merge () function in base R and the various join () functions from the dplyr package can both be used to join two data frames together. There are two main differences between these two functions: 1. The join () functions from dplyr tend to be much faster than merge () on extremely large data frames. 2.

What are the different types of joins in R?

The data frames must have same column names on which the merging happens. Merge () Function in R is similar to database join operation in SQL. The different arguments to merge () allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join.

How to merge two data frames in R?

The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others.

What is a left join in R?

As not all rows in the first data frame match all the rows in the second, the output is filled with NA values in those cases. The left join in R consist on matching all the rows in the first data frame with the corresponding values on the second. Recall that ‘Jack’ was on the first table but not on the second.


1 Answers

As noted in comments, a left join on cars is done by using cars as subsetting condition i in the DT[i,j,by] syntax.
This puts cars on the right, which might be counter-intuitive compared to SQL, and I found this tutorial useful to compare both syntaxes.

cars <- data.table(car_price = c(190000, 500000))
data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]

   car_price x.price_from x.price_to price_class
1:    190000        0e+00     199999           1
2:    190000        0e+00     199999           2
3:    500000        5e+05     599999           8

If you increase car price:

cars <- cars * 10
data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]

   car_price x.price_from x.price_to price_class
1:   1900000           NA         NA          NA
2:   5000000           NA         NA          NA
like image 138
Waldi Avatar answered Oct 01 '22 01:10

Waldi