Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill the table with data for missing date (postgresql, redshift)

I'm trying to fill daily data for missing dates and can not find an answer, please help.

My daily_table example:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

Expected result: I wand to fill this table with data for every domain and every day which just copy data from previous date:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

I can move a part of the logic into php, but it is undesirable, because my table has billions of missing dates.

SUMMARY:

During a few last days I foud out that:

  1. Amazon-redshift works with 8-th version of PostgreSql, that's why it does not support such a beautiful command like JOIN LATERAL
  2. Redshift also does not support generate_series and CTEs
  3. But it supports simple WITH (thank you @systemjack) but WITH RECURSIVE does not
like image 400
D.Dimitrioglo Avatar asked Mar 12 '23 23:03

D.Dimitrioglo


1 Answers

Finally, I finished my task and I want to share some useful things.

Instead of generate_series I used this hook:

WITH date_range AS (
  SELECT trunc(current_date - (row_number() OVER ())) AS date
  FROM any_table  -- any of your table which has enough data
  LIMIT 365
) SELECT * FROM date_range;

To get list of URLs which I have to fill with the data I used this:

WITH url_list AS (
  SELECT
    url AS gapsed_url,
    MIN(timestamp_gmt) AS min_date,
    MAX(timestamp_gmt) AS max_date
  FROM daily_table
  WHERE url IN (
    SELECT url FROM daily_table GROUP BY url
    HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
  )
  GROUP BY url
) SELECT * FROM url_list;

Then I combinet given data, let's call it url_mapping:

SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;

And to get data by closest date I did the following:

SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
  sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)

I hope it will help someone.

like image 164
D.Dimitrioglo Avatar answered Apr 08 '23 16:04

D.Dimitrioglo