I have two tables in Hive, t1
and t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03
, I want to count distinct userid from 2011-01-01
to 2011-01-03
.
for date 2011-01-04
, I want to count distinct userid from 2011-01-02
to 2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
That said, the following part does not take effect.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
EDIT: One solution can be (but it is super slow):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL
.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by
and count
. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join
.
The following should seem to work in standard SQL...
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
AND to_date(t2.createddate) < date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.
As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
If you store these values as dates, you don't need to_date()
. And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.
NOTE: The format of the ON
clause means that you do not need to round t2.createddate
down to a daily value.
EDIT Why your code didn't work...
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
This joins t1 to t2 with an ON
clause of (to_date(t2.createddate) = to_date(t1.date_id))
. As the join is a LEFT OUTER JOIN, the values in t2.createddate
MUST now either be NULL (no matches) or be the same as t1.date_id
.
The WHERE
clause allows a much wider range (3 days). But the ON
clause of the JOIN
has already restricted you data down to a single day.
The example I gave above simply takes your WHERE
clause and put's it in place of the old ON
clause.
EDIT
Hive doesn't allow <=
and >=
in the ON clause? Are you really fixed in to using HIVE???
If you really are, what about BETWEEN?
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
Alternatively, refactor your table of dates to enumerate the dates you want to include...
TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
2011-01-03, 2011-01-02
2011-01-03, 2011-01-03
2011-01-04, 2011-01-02
2011-01-04, 2011-01-03
2011-01-04, 2011-01-04
2011-01-05, 2011-01-03
2011-01-05, 2011-01-04
2011-01-05, 2011-01-05 }
SELECT
to_date(t1.calendar_date),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
to_date(t1.calendar_date)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With