Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Hive Query: Multi-join

Tags:

sql

hadoop

hive

How can I do sub-selections in Hive? I think I might be making a really obvious mistake that's not so obvious to me...

Error I'm receiving: FAILED: Parse Error: line 4:8 cannot recognize input 'SELECT' in expression specification

Here are my three source tables:

aaa_hit -> [SESSION_KEY, HIT_KEY, URL]
aaa_event-> [SESSION_KEY,HIT_KEY,EVENT_ID]
aaa_session->[SESSION_KEY,REMOTE_ADDRESS]

...and what I want to do is insert the result into a result table like this:

result -> [url, num_url, event_id, num_event_id, remote_address, num_remote_address]

...where column 1 is the URL, column 3 is the top 1 "event" per URL, and column 5 is the top 1 REMOTE_ADDRESS to visit that URL. (Even columns are "count"s of the previous column.)

Soooooo... what did I do wrong here?

INSERT OVERWRITE TABLE result2
SELECT url, 
       COUNT(url) AS access_url, 
       (SELECT events.event_id as evt, 
               COUNT(events.event_id) as access_evt
        FROM   aaa_event events 
               LEFT OUTER JOIN aaa_hit hits 
                 ON ( events.hit_key = hit_key )
                 ORDER BY access_evt DESC LIMIT 1), 
       (SELECT sessions.remote_address as remote_address, 
               COUNT(sessions.remote_address) as access_addr
        FROM   aaa_session sessions 
               RIGHT OUTER JOIN aaa_hit hits 
                 ON ( sessions.session_key = session_key )
                 ORDER BY access_addr DESC LIMIT 1) 
FROM   aaa_hit
ORDER  BY access_url DESC;

Thank you so much :)

like image 294
batman Avatar asked Jun 17 '11 20:06

batman


1 Answers

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

Hive supports subqueries only in the FROM clause.

You can't use a subquery as a 'column' in Hive.

To work around this you'll want to use that subquery in a FROM clause and JOIN to it. (the below won't work, but is the idea)

SELECT url, 
       COUNT(url) AS access_url, 
       t2.col1, t2.col2 ...
FROM   aaa_hit
JOIN (SELECT events.event_id as evt, 
               COUNT(events.event_id) as access_evt
        FROM   aaa_event events 
               LEFT OUTER JOIN aaa_hit hits 
                 ON ( events.hit_key = hit_key )
                 ORDER BY access_evt DESC LIMIT 1), 
       (SELECT sessions.remote_address as remote_address, 
               COUNT(sessions.remote_address) as access_addr
        FROM   aaa_session sessions 
               RIGHT OUTER JOIN aaa_hit hits 
                 ON ( sessions.session_key = session_key )
                 ORDER BY access_addr DESC LIMIT 1) t2
ON (aaa_hit.THING = t2.THING)

Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins for more information on using JOINs in Hive.

like image 88
QuinnG Avatar answered Nov 03 '22 01:11

QuinnG