Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join Tables on Date Range in Hive

I need to join tableA to tableB on employee_id and the cal_date from table A need to be between date start and date end from table B. I ran below query and received below error message, Would you please help me to correct and query. Thank you for you help!

Both left and right aliases encountered in JOIN 'date_start'.

select a.*, b.skill_group 
from tableA a 
  left join tableB b 
    on a.employee_id= b.employee_id 
    and a.cal_date >= b.date_start 
    and a.cal_date <= b.date_end
like image 366
Mixer Avatar asked Mar 11 '16 16:03

Mixer


2 Answers

RTFM - quoting LanguageManual Joins

Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.

You may try to move the BETWEEN filter to a WHERE clause, resulting in a lousy partially-cartesian-join followed by a post-processing cleanup. Yuck. Depending on the actual cardinality of your "skill group" table, it may work fast - or take whole days.

like image 76
Samson Scharfrichter Avatar answered Nov 19 '22 00:11

Samson Scharfrichter


If your situation allows, do it in two queries.

First with the full join, which can have the range; Then with an outer join, matching on all the columns, but include a where clause for where one of the fields is null.

Ex:

create table tableC as
select a.*, b.skill_group 
    from tableA a 
    ,    tableB b 
    where a.employee_id= b.employee_id 
      and a.cal_date >= b.date_start 
      and a.cal_date <= b.date_end;

with c as (select * from TableC)
insert into tableC
select a.*, cast(null as string) as skill_group
from tableA a 
  left join c
    on (a.employee_id= c.employee_id 
    and a.cal_date  = c.cal_date)
where c.employee_id is null ;
like image 2
markwusinich Avatar answered Nov 19 '22 00:11

markwusinich