First row for each group

Question

I have table containing row for each visit to an endpoint. Table looks like this:

user_id STRING
endpoint_id STRING
created_at TIMESTAMP

Example data:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-03 01:01:01 UTC

How to get first visit row for each user and resource.

What is the best way to construct such query?

Expected result:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC

Here's what I came up with, but this query will not work for big amount of data. I use window function to group together repeaing user/resource rows:

SELECT
    user_id,
    endpoint_id,
    created_at
FROM (
    SELECT 
        poll_id, 
        endpoint_id, 
        created_at,
        FIRST_VALUE(created_at) OVER (PARTITION BY user_id, endpoint_id ORDER BY created_at DESC) AS first_created_at
    FROM 
        [visits]
    )
WHERE
    created_at = first_created_at

Mikhail Berlyant · Accepted Answer

How to get first visit row for each user and resource?

In query you presented in question - should remove DESC in ORDER BY created_at DESC otherwise it returns last visit - not first

What is the best way to construct such query?

Another option would be to use ROW_NUMBER() as below

 SELECT
  user_id,
  endpoint_id,
  created_at
FROM (
  SELECT 
      user_id, 
      endpoint_id, 
      created_at,
      ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
  FROM [visits]
)
WHERE first_created = 1

... but this query will not work for big amount of data

This really depends. Resources Exceeded can happen If size of your user_id, endpoint_id partition is BIG enough (as ORDER BY requires all rows of partition to be on the same node).

If this is a case for you - you can use below trick

Step 1 - using JOIN

SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
FROM [visits] AS tab1
INNER JOIN (
  SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
  FROM [visits] 
  GROUP BY user_id, endpoint_id
) AS tab2
ON  tab1.user_id = tab2.user_id 
AND tab1.endpoint_id = tab2.endpoint_id 
AND tab1.created_at = tab2.min_time

Step 2 - There is still something else to take care here - in case if you have duplicate entries for same user / resource. In this case you still need to extract only one row for each partition. See below final query

 SELECT user_id, endpoint_id, created_at
FROM (
  SELECT user_id, endpoint_id, created_at, 
    ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn 
  FROM (
    SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
    FROM [visits]  AS tab1
    INNER JOIN (
      SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
      FROM [visits]  
      GROUP BY user_id, endpoint_id
    ) AS tab2
    ON  tab1.user_id = tab2.user_id 
    AND tab1.endpoint_id = tab2.endpoint_id 
    AND tab1.created_at = tab2.min_time
  )
)
WHERE rn = 1

and of course obvious and simplest Case - if those three fields are the ONLY fields in [visits] table

SELECT user_id, endpoint_id, MIN(created_at) AS created_at 
FROM [visits]
GROUP BY user_id, endpoint_id

Johnny V · Answer

I have another solution that avoids the use of window functions (which I believe are v slow in BQ), as well as subqueries (which add complexity):

select
   group_column
   ,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS first_row
from table AS t
group by 1

The array_agg returns an array, which has the struct of the first row of each group. This is extracted from the array with [offset(0)]. You can further extract from the struct using first_row.column_1. Or you could wrap a select statement around it to extract the columns from the struct:

select first_row.* from (
  select
     group_column
     ,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS 
  first_row
  from table AS t
  group by 1
)

David Masip · Answer

You can now use qualify for a more concise solution:

  select 
      user_id, 
      endpoint_id, 
      created_at,
  from [visits]
  where true
  qualify ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) = 1

First row for each group

Tags:

google-bigquery

Piotr Husiatyński

Video Answer

3 Answers

Mikhail Berlyant

Johnny V

David Masip

Recent Activity

Donate For Us

First row for each group

Tags:

google-bigquery

Piotr Husiatyński

Video Answer

3 Answers

Mikhail Berlyant

Johnny V

David Masip

Related questions

Recent Activity

Donate For Us