Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift PostgreSQL Distinct ON Operator

I have a data set that I want to parse for to see multi-touch attribution. The data set is made up by leads who responded to a marketing campaign and their marketing source.

Each lead can respond to multiple campaigns and I want to get their first marketing source and their last marketing source in the same table.

I was thinking I could create two tables and use a select statement from both. The first table would attempt to create a table with the most recent marketing source from every person (using email as their unique ID).

create table temp.multitouch1 as (
select distinct on (email) email, date, market_source as last_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE DESC);

Then I would create a table with deduped emails but this time for the first source.

create table temp.multitouch2 as (
select distinct on (email) email, date, market_source as first_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE ASC);

Finally I wanted to simply select the email and join the first and last market sources to it each in their own column.

select a.email, a.last_source, b.first_source, a.date 
from temp.multitouch1 a
left join temp.multitouch b on b.email = a.email

Since distinct on doesn't work on redshift's postgresql version I was hoping someone had an idea to solve this issue in another way.

EDIT 2/22: For more context I'm dealing with people and campaigns they've responded to. Each record is a "campaign response" and every person can have more than one campaign response with multiple sources. I'm trying make a select statement which would dedupe by person and then have columns for the first campaign/marketing source they've responded to and the last campaign/marketing source they've responded to respectively.

EDIT 2/24: Ideal output is a table with 4 columns: email, last_source, first_source, date.

The first and last source columns would be the same for people with only 1 campaign member record and different for everyone who has more than 1 campaign member record.

like image 996
Berra2k Avatar asked Sep 25 '22 07:09

Berra2k


People also ask

What is Postgres distinct on?

PostgreSQL also provides on an expression as DISTINCT ON that is used with the SELECT statement to remove duplicates from a query set result just like the DISTINCT clause.In addition to that it also keeps the “first row” of each row of duplicates in the query set result.

Does Count distinct include null redshift?

COUNT ( * ) counts all the rows in the target table whether they include nulls or not. COUNT ( expression ) computes the number of rows with non-NULL values in a specific column or expression. COUNT ( DISTINCT expression ) computes the number of distinct non-NULL values in a column or expression.

What does Listagg do in redshift?

For each group in a query, the LISTAGG aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string. LISTAGG is a compute-node only function.

What is Ilike in redshift?

ILIKE performs a case-insensitive pattern match for single-byte UTF-8 (ASCII) characters. To perform a case-insensitive pattern match for multibyte characters, use the LOWER function on expression and pattern with a LIKE condition.


1 Answers

I believe you could use row_number() inside case expressions like this:

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM sf.campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

tested here: SQL Fiddle

PostgreSQL 9.3 Schema Setup:

CREATE TABLE campaignmember
    (email varchar(3), date timestamp, market_source varchar(1))
;

INSERT INTO campaignmember
    (email, date, market_source)
VALUES
    ('a@a', '2016-01-02 00:00:00', 'x'),
    ('a@a', '2016-01-03 00:00:00', 'y'),
    ('a@a', '2016-01-04 00:00:00', 'z'),
    ('b@b', '2016-01-02 00:00:00', 'x')
;

Query 1:

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

Results:

| email | first_source |                first_date | last_source |                 last_date |
|-------|--------------|---------------------------|-------------|---------------------------|
|   a@a |            x | January, 02 2016 00:00:00 |           z | January, 04 2016 00:00:00 |
|   b@b |            x | January, 02 2016 00:00:00 |           x | January, 02 2016 00:00:00 |

& a small extension to the request, count the number of contact points.

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
    , MAX(numof) AS Numberof_Contacts 
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
          , COUNT(*) OVER (PARTITION BY email) as numof
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email
like image 167
Paul Maxwell Avatar answered Sep 29 '22 01:09

Paul Maxwell