Pro's of databases like BigTable, SimpleDB

Tags:

New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.

In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales? Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?

205

asked Oct 06 '08 20:10

Rik

2 Answers

Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).

That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.

Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.

But if you are in the 100 Tera range, by all means...

Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.

On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.

answered Sep 21 '22 03:09

SquareCog

I'm not too familiar with them (I've only read the same blog/news/examples as everyone else) but my take on it is that they chose to sacrifice a lot of the normal relational DB features in the name of scalability - I'll try explain.

Imagine you have 200 rows in your data-table.

In google's datacenter, 50 of these rows are stored on server A, 50 on B, and 100 on server C. Additionally server D contains redundant copies of data from server A and B, and server E contains redundant copies of data on server C.

(In real life I have no idea how many servers would be used, but it's set up to deal with many millions of rows, so I imagine quite a few).

To "select * where name = 'orion'", the infrastructure can fire that query to all the servers, and aggregate the results that come back. This allows them to scale pretty much linearly across as many servers as they like (FYI this is pretty much what mapreduce is)

This however means you need some tradeoffs.

If you needed to do a relational join on some data, where it was spread across say 5 servers, each of those servers would need to pull data from eachother for each row. Try do that when you have 2 million rows spread across 10 servers.

This leads to tradeoff #1 - No joins.

Also, depending on network latency, server load, etc, some of your data may get saved instantly, but some may take a second or 2. Again, when you have dozens of servers, this gets longer and longer, and the normal approach of 'everyone just waits until the slowest guy has finished' no longer becomes acceptable.

This leads to tradeoff #2 - Your data may not always be immediately visible after it's written.

I'm not sure what other tradeoffs there are, but off the top of my head those are the main 2.

answered Sep 21 '22 03:09

Orion Edwards

Related questions
                            
                                MySQL case-insensitive DISTINCT
                            
                                Up to Date - Zip Code Database? [closed]
                            
                                Chaining JSON_EXTRACT with CAST or STR_TO_DATE fails
                            
                                SSMS crashes when try to modify database diagram (v18.2)
                            
                                Is there any reason for numeric rather than int in T-SQL?
                            
                                Postgres: Find position of a specific row within a resultset?
                            
                                Ruby - Test whether database connection is possible [closed]
                            
                                Why and how do Databases use a single file to store all data? [closed]
                            
                                MySQL database optimization best practices
                            
                                Empty table data and reset IDENTITY columns
                            
                                Best way to handle Datarow DBNull [duplicate]
                            
                                Questionable SQL practice - Order By id rather than creation time
                            
                                difference between master and transaction table
                            
                                Find leaf nodes in hierarchical tree
                            
                                How would you design your database to allow user-defined schema
                            
                                Which NoSQL DB is best fitted for OLTP financial systems?
                            
                                Does a SELECT query always return rows in the same order? Table with clustered index
                            
                                Rails 3, ActiveRecord, PostgreSQL - ".uniq" command doesn't work?
                            
                                Generating migration from existing database in Yii or Laravel
                            
                                How can i change the name of databases in redis?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pro's of databases like BigTable, SimpleDB

Tags:

database

scalability

bigtable

amazon-simpledb

key-value-store

Rik

People also ask

2 Answers

SquareCog

Orion Edwards

Recent Activity

Donate For Us