Sharding vs DFS

Tags:

As far as I understand sharding (e.g in MongoDB) and distributed file systems (e.g. HDFS in HBase or HyperTable) are different mechanisms that databases use to scale-out, however I wonder how do they compare?

201

asked Aug 06 '11 03:08

Ali Shakiba

1 Answers

Traditional sharding involves breaking tables into a small number of pieces and running each piece (or "shard") in a separate database on a separate machine. Because of the large shard size, this mechanism can be prone to imbalances due to hot spots and unequal growth as was evidenced by the Foursquare incident. Also, because each shard is run on a separate machine, these systems can experience availability problems if one of the machines goes down. To mitigate this problem, most sharding systems, including MongoDB, implement replica groups. Each machine is replaced by a set of three machines in a master plus two slaves configuration. This way if a machine goes down, there are two remaining replicas to serve the data. There are a couple of problems with this design: First, if a replica fails in a replica group, and the group is only left with two members, to bring the replication count back to three, the data on one of these two machines needs to be cloned. Since there are only two machines in the entire cluster that can be used to re-create the replica, there will be enormous drag on one of these two machines while re-replication is taking place, causing serious performance problems on the shard in question (it takes over two hours to copy 1TB over a gigabit link). The second problem is that when one of the replicas goes down, it needs to be replaced with a new machine. Even if there is plenty of spare capacity across the cluster to resolve the replication problem, that spare capacity cannot be used to rectify the situation. The only way to solve it is to replace the machine. This becomes very challenging from an operational standpoint as cluster sizes grow up into the hundreds or thousands of machines.

The Bigtable+GFS design solves these problems. First, the table data is broken down into much finer grained "tablets". A typical machine in a Bigtable cluster will often have 500+ tablets. If an imbalance occurs, resolving it is just a simple matter of migrating a small number of tablets from one machine to another. If a TabletServer goes down, because the data set is broken down and replicated with such fine granularity, there can be hundreds of machines that participate in the recovery process, which distributes the recovery burden and speeds recovery time. Also, because the data is not tied to a specific machine or machines, the spare capacity on all machines in the cluster can be applied to the failure. There is no operational requirement to replace the machine since any of the spare capacity throughout the cluster can be used to rectify replication imbalance.

Doug Judd CEO, Hypertable Inc.

131

answered Oct 22 '22 13:10

Doug Judd

Related questions
                            
                                MongoDB or CouchDB as database for mobile devices?
                            
                                Best NoSQL approach to handle 100+ million records
                            
                                Need help conceptualizing in Redis/NoSQL
                            
                                DB solution for user activity feed
                            
                                ElasticSearch or Couchbase or something else
                            
                                Couchbase connection timeout with Java SDK
                            
                                mongo db design of following and feeds, where should I embed?
                            
                                Does the CAP theorem imply that ACID is not possible for distributed databases?
                            
                                Why Cassandra is used for Kong Api Gateway
                            
                                Membase vs. Cassandra?
                            
                                realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
                            
                                Should I treat Couchbase bucket as table, or more like a schema
                            
                                When is a graph database (like Neo4j) not a good use? [closed]
                            
                                How can I change/define default database of Mongodb in Spring-data?
                            
                                Retrieving data from Firebase Realtime Database in Android
                            
                                DocumentDB / CosmosDB - Entity with the specified id does not exist in the system
                            
                                Should I be using mySQL or MongoDB [closed]
                            
                                Getting CoreMongooseArray instead of normal array
                            
                                Keeping track of ids in NOSQL db (firebase)
                            
                                NoSQL databases - good candidates for log processing/aggregation and rollup? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sharding vs DFS

Tags:

nosql

distributed-computing

sharding

hdfs

distributed-filesystem

Ali Shakiba

People also ask

1 Answers

Doug Judd

Recent Activity

Donate For Us