Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is BigTable slow or am I dumb?

I basically have the classic many to many model. A user, an award, and a "many-to-many" table mapping between users and awards.

Each user has on the order of 400 awards and each award is given to about 1/2 the users.

I want to iterate over all of the user's awards and sum up their points. In SQL it would be a table join between the many-to-many and then walk through each of the rows. On a decent machine with a MySQL instance, 400 rows should not be a big deal at all.

On app engine I'm seeing around 10 seconds to do the sum. Most of the time being spent in Google's datastore. Here is the first few rows of cProfile

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)       462    6.291    0.014    6.868    0.015 {google3.apphosting.runtime._apphosting_runtime___python__apiproxy.Wait}       913    0.148    0.000    1.437    0.002 datastore.py:524(_FromPb)      8212    0.130    0.000    0.502    0.000 datastore_types.py:1345(FromPropertyPb)       462    0.120    0.000    0.458    0.001 {google3.net.proto._net_proto___parse__python.MergeFromString} 

Is my data model wrong? Am I doing the lookups wrong? Is this a shortcoming that I have to deal with with caching and bulkupdating (which would be a royal pain in the ass).

like image 697
Paul Tarjan Avatar asked Jun 05 '09 08:06

Paul Tarjan


People also ask

Is bigtable key value?

BigTable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns. The data is stored peristantly on disk. BigTable's data is distributed among many independent machines. At Google, BigTable is built on top of GFS (Google File System).

Is bigtable column oriented?

Bigtable is a row-oriented database, so all data for a single row are stored together, organized by column family, and then by column.

Is bigtable transactional?

Bigtable does not support transactions that atomically update more than one row. However, Bigtable supports some write operations that would require a transaction in other databases. In effect, Bigtable uses single-row transactions to complete these operations.

What is Hotspotting in bigtable?

To optimize performance and scale, tablets are split and rebalanced across the nodes based on access patterns such as read, write, and scan operations. A hot tablet is a tablet that uses a disproportionately large percentage of a node's CPU compared to other tablets associated with that node.


1 Answers

Could be a bit of both ;-)

If you're doing 400 queries on the Awards table, one for each result returned for a query on the mapping table, then I would expect that to be painful. The 1000-result limit on queries is there because BigTable thinks that returning 1000 results is at the limit of its ability to operate in a reasonable time. Based on the architecture, I'd expect the 400 queries to be way slower than the one query returning 400 results (400 log N vs. (log M) + 400).

The good news is that on GAE, memcaching a single hashtable containing all the awards and their points values is pretty straightforward (well, looked pretty straightforward when I cast an eye over the memcache docs a while back. I've not needed to do it yet).

Also, if you didn't already know, for result in query.fetch(1000) is way faster than for result in query, and you're restricted to 1000 results either way. The advantages of the latter are (1) it might be faster if you bail out early, and (2) if Google ever increases the limit beyond 1000, it gets the benefit without a code change.

You might also have problems when you delete a user (or an award). I found on one test that I could delete 300 objects inside the time limit. Those objects were more complex than your mapping objects, having 3 properties and 5 indices (including the implicit ones), whereas your mapping table probably only has 2 properties and 2 (implicit) indices. [Edit: just realised that I did this test before I knew that db.delete() can take a list, which is probably much faster].

BigTable does not necessarily do the things that relational databases are designed to do well. Instead, it distributes data well across many nodes. But almost all websites run fine with a bottleneck on a single db server, and hence don't strictly need the thing that BigTable does.

One other thing: if you're doing 400 datastore queries on a single http request, then you will find that you hit your datastore fixed quota well before you hit your request fixed quota. Of course if you're well within quotas, or if you're hitting something else first, then this might be irrelevant for your app. But the ratio between the two quotas is something like 8:1, and I take this as a hint what Google expects my data model to look like.

like image 79
Steve Jessop Avatar answered Oct 02 '22 09:10

Steve Jessop