Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid many database round trips and a lot of irrelevant data?

I have worked with various applications and encountered this situation many times. Until now I have not figured out what is the best approach.

Here's the scenario:

  • I have an application either desktop or web
  • I need to retrieve simple documents from the database. The document has a general details and item details so the database:

GeneralDetails table:

| DocumentID | DateCreated | Owner     |
| 1          | 07/07/07    | Naruto    |
| 2          | 08/08/08    | Goku      |
| 3          | 09/09/09    | Taguro    |

ItemDetails table

| DocumentID | Item        | Quantity  |
| 1          | Marbles     | 20        |
| 1          | Cards       | 56        |
| 2          | Yo-yo       | 1         |
| 2          | Chess board | 3         |
| 2          | GI Joe      | 12        |
| 3          | Rubber Duck | 1         |

As you can see, the tables have a one-to-many relationship. Now, in order to retrieve all the documents and their respective items, I always do either of the two:

Method 1 - Many round trips (pseudo-code):

 Documents = GetFromDB("select DocumentID, Owner " +
                       "from GeneralDetails") 
 For Each Document in Documents
{
    Display(Document["CreatedBy"])
    DocumentItems = GetFromDB("select Item, Quantity " + 
                              "from ItemDetails " + 
                              "where DocumentID = " + Document["DocumentID"] + "")
    For Each DocumentItem in DocumentItems
    {
        Display(DocumentItem["Item"] + " " + DocumentItem["Quantity"])
    }
}

Method 2 - Much irrelevant data (pseudo-code):

DocumentsAndItems = GetFromDB("select g.DocumentID, g.Owner, i.Item, i.Quantity " + 
                              "from GeneralDetails as g " +
                              "inner join ItemDetails as i " +
                              "on g.DocumentID = i.DocumentID")
//Display...

I used the first method when I was in college for desktop applications, the performance was not bad so I realized it was okay.

Until one day, I saw an article "Make the web faster", it says that many round trips to the database is bad; so ever since then I have used the second method.

On the second method, I avoided round trips by using inner join to retrieve the first and the second table at once, but it produces unecessary or redundant data. See the result set.

| DocumentID | Owner     | Item        | Quantity  |
| 1          | Naruto    | Marbles     | 20        |
| 1          | Naruto    | Cards       | 56        |
| 2          | Goku      | Yo-yo       | 1         |
| 2          | Goku      | Chess board | 3         |
| 2          | Goku      | GI Joe      | 12        |
| 3          | Taguro    | Rubber Duck | 1         |

The result set has redundant DocumentID and Owner. It looks like an unnormalized database.

Now, the question is, how do I avoid round trips and at the same time avoid redundant data?

like image 423
dpp Avatar asked Aug 18 '11 06:08

dpp


2 Answers

The inner join is better because the database has more possibilities to optimize.

In general you can not create a query like this which does not produce redundant results. For that, the relational model is too restrictive. I would just live with it: the database is responsible to optimize these cases.

If you really encounter performance problems (mainly because of a network bottleneck) you could write a stored procedure, that makes the query and denormalizes it. In your example you create create a result like:

| DocumentID | Owner     | Items                                   | Quantity    |
| 1          | Naruto    | Marbles, Cards                          | 20, 56      |
| 2          | Goku      | Yo-yo, Chess board, GI Joe, Rubber Duck | 1, 3, 12, 1 |

But this of course does not conform to the first normal form - so you will need to parse it on the client. If you use a database with XML support (like Oracle or MS SQL Server) you could even create an XML file on the server and send this to the client.

But whatever you do, remember: premature optimization is the root of all evil. Don't do this kind of stuff before you are not 100% sure, that you are really facing a problem you can solve like this.

like image 33
Markus Pilman Avatar answered Nov 11 '22 05:11

Markus Pilman


The method used by ActiveRecord and other ORMs is to select the first table, batch together the IDs and then use those IDs in an IN clause for the second select.

SELECT * FROM ItemDetails WHERE DocumentId IN ( [Comma Separated List of IDs here] )

Advantages:

  • No redundant data

Disadvantages:

  • Two queries

Generally speaking, the first method is referred to as the "N+1 query problem" and the solutions are referred to as "eager loading". I tend to see your "Method 2" as preferable as the latency to the database generally trumps the size of the redundant data over the data transfer rate, but YRMV. As with almost everything in software, it's a tradeoff.

like image 103
Stefan Mai Avatar answered Nov 11 '22 06:11

Stefan Mai