Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLite: How to SELECT "most recent record for each user" from single table with composite key?

I'm not a database guru and feel like I'm missing some core SQL knowledge to grok a solution to this problem. Here's the situation as briefly as I can explain it.

Context:

I have a SQLite database table that contains timestamped user event records. The records can be uniquely identified by the combination of timestamp and user ID (i.e., when the event took place and who the event is about). I understand this situation is called a "composite primary key." The table looks something like this (with a bunch of other columns removed, of course):

sqlite> select Last_Updated,User_ID from records limit 4;

Last_Updated   User_ID
-------------  --------
1434003858430  1   
1433882146115  3   
1433882837088  3   
1433964103500  2   

Question: How do I SELECT a result set containing only the most recent record for each user?

Given the above example, what I'd like to get back is a table that looks like this:

Last_Updated   User_ID
-------------  --------
1434003858430  1   
1433882837088  3   
1433964103500  2   

(Note that the result set only includes user 3's most recent record.)

In reality, I have approximately 2.5 million rows in this table.

Bonus: I've been reading answers about JOINs, de-dupe procedures, and a bunch more, and I've been googling for tutorials/articles in the hopes that I would find what I'm missing. I have extensive programming background so I could de-dupe this dataset in procedural code like I've done a hundred times before, but I'm tired of writing scripts to do what I believe should be possible in SQL. That's what it's for, right?

So, what do you think is missing from my understand of SQL, conceptually, that I need in order to understand why the solution you've provided to my question actually works? (A reference to a good article that actually explains the theory behind the practice would suffice.) I want to know WHY the solution actually works, not just that it does.

Many thanks for your time!

like image 747
M12 Avatar asked Nov 05 '15 06:11

M12


2 Answers

You could try this:

select user_id, max(last_updated) as latest
from records
group by user_id

This should give you the latest record per user. I assume you have an index on user_id and last_updated combined.

In the above query, generally speaking - we are asking the database to group user_id records. If there are more than 1 records for user_id 1, they will all be grouped together. From that recordset, maximum last_updated will be picked for output. Then the next group is sought and the same operation is applied there.

If you have a composite index, sqlite will likely just use the index because the index contains both fields addressed in the query. Indexes are smaller than the table itself, so scanning or seeking is faster.

like image 113
zedfoxus Avatar answered Oct 02 '22 14:10

zedfoxus


Well, in true "d'oh!" fashion, right after I ask this question, I find the answer.

For my case, the answer is:

SELECT MAX(Last_Updated),User_ID FROM records GROUP BY User_ID

I was making this more complicated than it needed to be by thinking I needed to use JOINs and stuff. Applying an aggregate function like MAX() is all that's needed to select only those rows whose content matches the function result. That means this statement…

SELECT MAX(Last_Updated),User_ID FROM records

…would therefor return a result set containing only 1 row, the most recent event.

By adding the GROUP BY clause, however, the result set contains a row for each "group" of results, i.e., for each user. My programmer-brain did not understand that GROUP BY is how we say "for each" in SQL. I think I get it now.

Note to self: keep it simple, stupid. :)

like image 36
M12 Avatar answered Oct 02 '22 14:10

M12