I have two tables in Hive, t1 and t2
>describe t1;
>date_id    string
>describe t2;
>messageid string,
 createddate string,
 userid int
> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 
> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03, I want to count distinct userid from 2011-01-01 to 2011-01-03.
for date 2011-01-04, I want to count distinct userid from 2011-01-02 to 2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive. 
That said, the following part does not take effect.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
EDIT: One solution can be (but it is super slow):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL.
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by and count. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join.
The following should seem to work in standard SQL...
SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)
It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.
As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
If you store these values as dates, you don't need to_date().  And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.
NOTE: The format of the ON clause means that you do not need to round t2.createddate down to a daily value.
EDIT Why your code didn't work...
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);
This joins t1 to t2 with an ON clause of (to_date(t2.createddate) = to_date(t1.date_id)).  As the join is a LEFT OUTER JOIN, the values in t2.createddate MUST now either be NULL (no matches) or be the same as t1.date_id.
The WHERE clause allows a much wider range (3 days).  But the ON clause of the JOIN has already restricted you data down to a single day.
The example I gave above simply takes your WHERE clause and put's it in place of the old ON clause.
EDIT
Hive doesn't allow <= and >= in the ON clause?  Are you really fixed in to using HIVE???
If you really are, what about BETWEEN?
SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)
Alternatively, refactor your table of dates to enumerate the dates you want to include...
TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03
  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04
  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 }
SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With