I'm using a psql DB to store a data structure like so:
datapoint(userId, rank, timestamp)
where timestamp is the Unix Epoch milliseconds timestamp.
In this structure I store the rank of each user each day, so it's like:
UserId Rank Timestamp
1 1 1435366459
1 2 1435366458
1 3 1435366457
2 8 1435366456
2 6 1435366455
2 7 1435366454
So, in the sample data above, userId 1 its improving it's rank with each measurement, which means it has a positive trend, while userId 2 is dropping in rank, which means it has a negative trend.
What I need to do is to detect all users that have a positive trend based on the last N measurements.
Like most other relational database products, PostgreSQL supports aggregate functions. An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the count , sum , avg (average), max (maximum) and min (minimum) over a set of rows.
Aggregate functions are not allowed because the WHERE clause is used for filtering data before aggregation. So while WHERE isn't for aggregation, it has other uses. To filter data based on an aggregate function result, you must use the HAVING clause.
The COALESCE function returns the first of its arguments that is not null. Null is returned only if all arguments are null. It is often used to substitute a default value for null values when data is retrieved for display, for example: SELECT COALESCE(description, short_description, '(none)') ...
One approach would be to perform a linear regression on the each user's rank, and check if the slope is positive or negative. Luckily, PostgreSQL has a builtin function to do that - regr_slope
:
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id
This query gives you the basic functionality. Now, you can dress it up a bit with case
expressions if you like:
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id) t
Edit:
Unfortunately, regr_slope
doesn't have a built in way to handle "top N" type requirements, so this should be handled separately, e.g., by a subquery with row_number
:
-- Decoration outer query
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (-- Inner query to calculate the slope
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM (-- Inner query to get top N
SELECT user_id, rank1,
ROW_NUMER() OVER (PARTITION BY user_id
ORDER BY timestamp1 DESC) AS rn
FROM my_table) t
WHERE rn <= N -- Replace N with the number of rows you need
GROUP BY user_id) t2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With