Can someone provide a real-world example of how data would be structured within a Bigtable? Please talk from a search engine, social networking or any other familiar point of view which illustrates clearly and pragmatically how the row -> column family -> column combo is superior to traditional normalized relational approaches.
Bigtable is ideal for applications that need high throughput and scalability for key/value data, where each value is typically no larger than 10 MB. Bigtable also excels as a storage engine for batch MapReduce operations, stream processing/analytics, and machine-learning applications.
Bigtable is a NoSQL database that is designed to support large, scalable applications.
Cloud Bigtable is Google's fully managed NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
Reading the original Google white paper was helpful:
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf
As was this comprehensive list of information sources on Google data architecture:
http://highscalability.com/google-architecture
Update: 11/4/14
A new version of the Google white paper PDF can be found here:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
I believe the difference is more about the way the data are queried rather the way they are stored.
The main difference between relational databases and NoSQL
is that there is, um, no SQL
in the latter.
This means you (not the query optimizer) write the query plans yourself.
This may increase the query performance if you know how to do that.
Consider a typical search engine query: find top 10
pages with all (or some) words included, say, "wet t-shirt contest", ordered by relevance (we're leaving word proximity aside for simplicity sake).
To do this, you need all words split and kept in a searchable and iterable list ordered by (word, relevance, source)
. Then you partition this list into (3 * ranks)
sets (each starting at the top of the words in your search query at a given rank), where ranks
is the possible number or ranks, say, 1
to 10
; and join the sets on source
, .
In a relational database it would look like this:
SELECT w1.source
FROM ranks r1
JOIN words w1
ON w1.word = 'wet'
AND w1.rank = r1.value
CROSS JOIN
ranks r2
JOIN words w2
ON w2.word = 'shirt'
AND w2.rank = r2.value
AND w2.source = w1.source
CROSS JOIN
ranks r3
JOIN words w3
ON w3.word = 'contest'
AND w3.rank = r2.value
AND w3.source = w1.source
ORDER BY
relevance_formula (w1.rank, w2.rank, w3.rank)
LIMIT 10
This is best executed using a MERGE JOIN
over the three sets partitioned by rank.
However, no optimizer I'm aware of will build this plan (leaving aside the fact that relevance_formula
may not distribute over the individual ranks).
To work around this, you should implement your own query plan: start at the top of each word/rank pair and just descend all three sets simultaneously, skipping the missing values and using search
rather then next
if you feel that there will be too much to skip in one of the sets.
Thus said, relational approach gives you a more convenient way to query data at cost of possible performance penalty.
If you are developing a campus web server, then writing those SELECT *
is OK even they are executed one microsecond longer than they possibly could be. But if you're developing a Google, it worth spending some time on optimizing the queries (which pure relational systems only allowing access to their data using SQL
just would not let to do).
The such called NoSQL
and relational databases sometimes diffuse into each other. For instance, Berkeley DB
is a well-known NoSQL
storage engine which was used by MySQL
as its storage backend to allow SQL
queries. And vice versa, HandlerSocket
allows pure key-value queries to a relational InnoDB
store with a MySQL
database built over it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With