Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate function to detect trend in PostgreSQL

I'm using a psql DB to store a data structure like so:

datapoint(userId, rank, timestamp)

where timestamp is the Unix Epoch milliseconds timestamp.

In this structure I store the rank of each user each day, so it's like:

UserId   Rank  Timestamp
1        1     1435366459
1        2     1435366458
1        3     1435366457
2        8     1435366456
2        6     1435366455
2        7     1435366454

So, in the sample data above, userId 1 its improving it's rank with each measurement, which means it has a positive trend, while userId 2 is dropping in rank, which means it has a negative trend.

What I need to do is to detect all users that have a positive trend based on the last N measurements.

like image 719
maephisto Avatar asked Feb 26 '14 10:02

maephisto


People also ask

What are the aggregate functions available in PostgreSQL?

Like most other relational database products, PostgreSQL supports aggregate functions. An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the count , sum , avg (average), max (maximum) and min (minimum) over a set of rows.

Can we use aggregate function in where clause PostgreSQL?

Aggregate functions are not allowed because the WHERE clause is used for filtering data before aggregation. So while WHERE isn't for aggregation, it has other uses. To filter data based on an aggregate function result, you must use the HAVING clause.

How do I use coalesce in PostgreSQL?

The COALESCE function returns the first of its arguments that is not null. Null is returned only if all arguments are null. It is often used to substitute a default value for null values when data is retrieved for display, for example: SELECT COALESCE(description, short_description, '(none)') ...


1 Answers

One approach would be to perform a linear regression on the each user's rank, and check if the slope is positive or negative. Luckily, PostgreSQL has a builtin function to do that - regr_slope:

SELECT   user_id, regr_slope (rank1, timestamp1) AS slope
FROM     my_table
GROUP BY user_id

This query gives you the basic functionality. Now, you can dress it up a bit with case expressions if you like:

SELECT user_id, 
       CASE WHEN slope > 0 THEN 'positive' 
            WHEN slope < 0 THEN 'negative' 
            ELSE 'steady' END AS trend
FROM   (SELECT   user_id, regr_slope (rank1, timestamp1) AS slope
        FROM     my_table
        GROUP BY user_id) t

Edit:
Unfortunately, regr_slope doesn't have a built in way to handle "top N" type requirements, so this should be handled separately, e.g., by a subquery with row_number:

-- Decoration outer query
SELECT user_id, 
       CASE WHEN slope > 0 THEN 'positive' 
            WHEN slope < 0 THEN 'negative' 
            ELSE 'steady' END AS trend
FROM   (-- Inner query to calculate the slope
        SELECT   user_id, regr_slope (rank1, timestamp1) AS slope
        FROM     (-- Inner query to get top N
                  SELECT user_id, rank1, 
                         ROW_NUMER() OVER (PARTITION BY user_id 
                                           ORDER BY timestamp1 DESC) AS rn
                  FROM   my_table) t
        WHERE    rn <= N -- Replace N with the number of rows you need
        GROUP BY user_id) t2
like image 83
Mureinik Avatar answered Sep 21 '22 02:09

Mureinik