Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive: work around for non equi left join

Hive does not support non equi joins: The common work around is to move the join condition to the where clause, which work fine when you want an inner join. but what about a left join?

Contrived example. Let say we have an orderLineItem table, and we need to join to a ProductPrice table that has a productID, price & a date range for which the price applies. We want to join to this where ProductID=ProductID & OrderDate between start and End date. If a productID or a valid date range do not match, I'd still want to see all orderLineItems.

This SQL fiddle is an example of how we'd do this in MSSQL: http://sqlfiddle.com/#!6/fb877/7

Problem If I apply the typical workaround, and move the non equi filter to the where clause, it becomes an inner join. In the case above, in the sql fiddle & below, I have a product ID that is not in the lookup.

Question: Provided hive does not support non eqi-joins, How can a left non-eqi be achieved ?

[SQLFiddle Content]

Tables:

CREATE TABLE OrderLineItem(
  LineItemIDId int IDENTITY(1,1),
  OrderID int  NOT NULL,
  ProductID int NOT NULL,
  OrderDate Date
);


CREATE TABLE ProductPrice(
  ProductID int,
  Cost float,
  startDate  Date,
  EndDate  Date


);

loading The data & how we'd join in MSSQL:

--Old Price. Should be ignored
INSERT INTO ProductPrice(ProductID, COST,startDate,EndDate) VALUES  (1, 50,'12/1/2012','1/1/2013');
INSERT INTO ProductPrice(ProductID, COST,startDate,EndDate) VALUES (2, 55,'12/1/2012','1/1/2013');

--Price for Order 2. Should be applied to Order 1
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(1, 20,'12/1/2013','1/1/2014');
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(2, 25,'12/1/2013','1/1/2014');

--Price for Order 2. Should be applied to Order 2
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(1, 15,'1/2/2014','3/1/2014');
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(2, 20,'1/2/2014','3/1/2014');


--January 1st 2014 Order
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (1, 1,'1/1/2014') ;
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (1, 2,'1/1/2014');

--Feb 1st 2014 Order
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (2, 1,'2/1/2014');
INSERT INTO OrderLineItem (OrderID,ProductID,OrderDate) VALUES(2, 2,'2/1/2014');
INSERT INTO OrderLineItem (OrderID,ProductID,OrderDate) VALUES(2, 3,'2/1/2014'); -- no price

SELECT * FROM OrderLineItem;

SELECT * FROM OrderLineItem li LEFT OUTER JOIN  ProductPrice p on
p.ProductID=li.ProductID AND  OrderDate BETWEEN  startDate AND  EndDate;
like image 815
brian chandley Avatar asked Apr 16 '14 14:04

brian chandley


People also ask

Does Hive support non equi join?

Hive only supports EQUI join, so using any other joins types on its source tables will result in inefficient queries.

Does Hive support left anti join?

The key with any data set where we use a LEFT ANTI JOIN is that must have columns that will be used in our join condition where we do the filtering. In other words, we would filter on trait or multiple traits depending on how we broke down our data in the example.

Which join is not supported in hive?

Join Syntax Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive.

Is inner join and Non Equi join same?

Non-Equi Join is also a type of INNER Join in which we need to retrieve data from multiple tables. Non-Equi Join matches the column values from different tables based on an inequality based on the operators like <,>,<=,>=,!= , BETWEEN, etc.


1 Answers

  1. Create a copy of the left table with added serial row numbers:

    CREATE TABLE OrderLineItem_serial AS
    SELECT ROW_NUMBER() OVER() AS serial, * FROM OrderLineItem;
    

    Remark: This may work better for some tables formats (must be WITHOUT COMPRESSION):

    CONCAT(INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE) AS serial
    
  2. Do an inner join:

    CREATE TABLE OrderLineItem_inner AS
    SELECT * FROM OrderLineItem_serial li JOIN ProductPrice p
    on p.ProductID = li.ProductID WHERE OrderDate BETWEEN startDate AND EndDate;
    
  3. Left join by serial:

    SELECT * FROM OrderLineItem_serial li
    LEFT OUTER JOIN OrderLineItem_inner i on li.serial = i.serial;
    
like image 195
o17t H1H' S'k Avatar answered Sep 17 '22 18:09

o17t H1H' S'k