Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bigtable Practical Example

Tags:

bigtable

Can someone provide a real-world example of how data would be structured within a Bigtable? Please talk from a search engine, social networking or any other familiar point of view which illustrates clearly and pragmatically how the row -> column family -> column combo is superior to traditional normalized relational approaches.

like image 212
S. Valmont Avatar asked Jul 21 '11 10:07

S. Valmont


People also ask

What do you use Bigtable for?

Bigtable is ideal for applications that need high throughput and scalability for key/value data, where each value is typically no larger than 10 MB. Bigtable also excels as a storage engine for batch MapReduce operations, stream processing/analytics, and machine-learning applications.

Is Bigtable an example of NoSQL?

Bigtable is a NoSQL database that is designed to support large, scalable applications.

Does Gmail use Bigtable?

Cloud Bigtable is Google's fully managed NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.


2 Answers

Reading the original Google white paper was helpful:

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf

As was this comprehensive list of information sources on Google data architecture:

http://highscalability.com/google-architecture


Update: 11/4/14

A new version of the Google white paper PDF can be found here:

http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf

like image 50
S. Valmont Avatar answered Sep 25 '22 18:09

S. Valmont


I believe the difference is more about the way the data are queried rather the way they are stored.

The main difference between relational databases and NoSQL is that there is, um, no SQL in the latter.

This means you (not the query optimizer) write the query plans yourself.

This may increase the query performance if you know how to do that.

Consider a typical search engine query: find top 10 pages with all (or some) words included, say, "wet t-shirt contest", ordered by relevance (we're leaving word proximity aside for simplicity sake).

To do this, you need all words split and kept in a searchable and iterable list ordered by (word, relevance, source). Then you partition this list into (3 * ranks) sets (each starting at the top of the words in your search query at a given rank), where ranks is the possible number or ranks, say, 1 to 10; and join the sets on source, .

In a relational database it would look like this:

SELECT  w1.source
FROM    ranks r1
JOIN    words w1
ON      w1.word = 'wet'
        AND w1.rank = r1.value
CROSS JOIN
        ranks r2
JOIN    words w2
ON      w2.word = 'shirt'
        AND w2.rank = r2.value
        AND w2.source = w1.source
CROSS JOIN
        ranks r3
JOIN    words w3
ON      w3.word = 'contest'
        AND w3.rank = r2.value
        AND w3.source = w1.source
ORDER BY
        relevance_formula (w1.rank, w2.rank, w3.rank)
LIMIT 10

This is best executed using a MERGE JOIN over the three sets partitioned by rank.

However, no optimizer I'm aware of will build this plan (leaving aside the fact that relevance_formula may not distribute over the individual ranks).

To work around this, you should implement your own query plan: start at the top of each word/rank pair and just descend all three sets simultaneously, skipping the missing values and using search rather then next if you feel that there will be too much to skip in one of the sets.

Thus said, relational approach gives you a more convenient way to query data at cost of possible performance penalty.

If you are developing a campus web server, then writing those SELECT * is OK even they are executed one microsecond longer than they possibly could be. But if you're developing a Google, it worth spending some time on optimizing the queries (which pure relational systems only allowing access to their data using SQL just would not let to do).

The such called NoSQL and relational databases sometimes diffuse into each other. For instance, Berkeley DB is a well-known NoSQL storage engine which was used by MySQL as its storage backend to allow SQL queries. And vice versa, HandlerSocket allows pure key-value queries to a relational InnoDB store with a MySQL database built over it.

like image 31
Quassnoi Avatar answered Sep 23 '22 18:09

Quassnoi