Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal way to fetch a sorted subset of database records

Scenario

Suppose I am building a database for a Messenger app. Let there be two tables, a User table and a Conversation table. Each conversation has a list of participating users, and each user has a list of conversations they are in. In short, there is a many-to-many relationship between Users and Conversations tables.

Now suppose I want to load the first 10 conversations of a user's list of conversations in descending chronological order when I open the app. Assuming that # Conversations in table >> # Conversations a user has >> 10, a brute force way is to load every conversation in the user's list, then order them in-memory, and finally return the first 10. I think this is how a normal SQL engine will process such a query.

Concern

My concern is that when # Conversations a user becomes very large, this operation becomes too resource consuming. Is there any faster way to achieve the same result (fetching a sorted sublist of records from a table) with possibly additional database setup?

Example

For instance, imagine a user have 300 conversations, and we want to page through these conversations in order. The above method would either download all 300 conversations to disk then do the sorting locally, or let the server do the sorting. The first method uses too much bandwidth and the information may not be up-to-date, and the second method requires pulling in all 300 conversations from the database each time we page.

Question

My question is this: is my concern of this particular case valid? If so, how should I modify my database setup to avoid this issue? How are some existing examples like Facebook Messenger handling this? If not, why is this not a performance concern?

Edit

I realised after asking the question that in an RDBMS we would simply create a third table to store the many-to-many relationship, and building an index on this table would solve this problem. However, would NoSQL databases that support storing lists in columns (more specifically, AWS DynamoDB) have an advantage over traditional RDBMS in this case?

like image 753
Flying_Banana Avatar asked Apr 17 '18 21:04

Flying_Banana


2 Answers

It looks as though the table list you posited is not adequate to represent the data you're trying to extract. Presuming that there can be no more than one creator of a conversation, that user id can safely be stored there.

But the likely structure of the tables will include a "comment" table, with (at a minimum) the following fields:

 *  Primary key       --  record id for _this_ comment
 *  conversation_id   --  reference to the conversation this comment is part of
 *  user_id       --  The user ID of the person making this comment
 *  parent_id     --  The comment that preceded this one (presuming threaded conversations)
 *  create_dt     --  Datetime that the comment was added to the thread
 *  comment_body  --  The actual comment itself.

If this is indeed the case, you'd be looking at a query that looks something like this:

  SELECT DISTINCT conversation_id FROM 
  (
     SELECT conversation_id, create_dt
       FROM Conversation
      WHERE person_id = {DesiredPerson}

            UNION 

      SELECT conversation_id, create_dt
        FROM Comment
       WHERE person_id = {DesiredPerson}
   } ORDER BY create_dt DESC
   LIMIT 10

...will give the the id of the 10 most recent conversations in which the DesiredPerson has participated.

Contrary to your belief, database optimizers are smart enough that the query will NOT end up requiring the two queries to be entirely evaluated to produce the desired result. If there are appropriate indices on the table, this should be a pretty efficient query (e.g. compound index on both tables of conversation_id + create_dt). In fact, this query would likely be satisfied without having to reference the tables at all--the result can be calculated entirely from the indexes. Using the MySQL TOP modifier with both count and skip values should allow you to handle paging pretty efficiently.

like image 187
Curt Avatar answered Oct 15 '22 21:10

Curt


Is there any faster way to achieve the same result (fetching a sorted sublist of records from a table) with possibly additional database setup?

Yes, there is.

This "additional database setup" is called "index". I think every relational DBMS allows to create indexes.

There can be several types of indexes, but most common is a b-tree index, where data is stored in a balanced tree, which allows to quickly find the necessary element(s) and read the data in the order by which the index is sorted.

Index is a supplementary structure stored and maintained by the database engine on disk in addition to the main table data. You can usually create many different indexes on the same table. The engine would try to pick the most suitable index when running the specific query. Different queries may use different indexes.

Since index structure has to be maintained when underlying data changes, it means that usually creating an index helps the SELECT queries, but somewhat slows down UPDATE, DELETE and INSERT. This is why it is usually a trade-off and requires some skill to identify what set of indexes should exist. It largely depends on what kind of queries run and their relative importance.


For a specific example of how to implement efficient pagination with the help of appropriate index have a look at Pagination Done the Right Way from the web-site, that is called Use the index, Luke.

It also has a good intro into Anatomy of an SQL Index and many other useful articles.

Is my concern of this particular case valid?

It is not valid for 300 rows, but becomes more and more important as your tables grow in size. For 300 million rows most likely it would be rather important.

like image 24
Vladimir Baranov Avatar answered Oct 15 '22 21:10

Vladimir Baranov