From Cassandra's presentation slides (slide 2) link 1, alternate link:
scaling writes to a relational database is virtually impossible
I cannot understand this statement. Because when I shard my database, I am scaling writes isn't it? And they seem to claim against that.. does anyone know why isn't sharding a database scaling writes?
Scalability: Users have to scale relational database on powerful servers that are expensive and difficult to handle. To scale relational database it has to be distributed on to multiple servers. Handling tables across different servers is difficult . Complexity: In SQL server's data has to fit into tables anyhow.
There are three big problems with relational databases that make it difficult to scale: The poor time complexity characteristics of SQL joins; The difficulty in horizontally scaling; and. The unbounded nature of queries.
Vertical scaling refers to increasing the processing power of a single server or cluster. Both relational and non-relational databases can scale up, but eventually, there will be a limit in terms of maximum processing power and throughput.
Disadvantages of horizontal scalingIncreased complexity of maintenance and operation - Multiple servers are harder to maintain than a single server is. Additionally, you will need to add software for load balancing and possibly virtualization.
A sharded database is actually quite different to a normal SQL database. In a lot of ways it is more like a custom NoSQL system that just happens to use a database for storage. Unless your dataset consists of a lot of completely disconnected subsets, most queries more complex than get by ID won't work the same as they do on a single node database.
The other reason is that SQL writes tend to be fairly expensive due to the requirement for immediate consistency - the indexes that are required for decent read performance on a large database get updated as part of the write operation, and various constraints are checked. In systems designed for horizontal scalability these additional operations are usually either skipped entirely or performed separately from the write.
The slowness of physical disk subsystems is usually the single greatest challenge to overcome when trying to scale a database to service a very large number of concurrent writers. But it is not "virtually impossible" to optimize writes to a relational database. It can be done. Yet there is a trade-off: when you optimize writes, selects of large subsets of logically related data usually are slower.
The writes of the primary data to disk and the rebalancing of index trees can be disk-intensive. The maintenance of clustered indexes, whereby rows that belong logically together are stored physically contiguous on disk, is also disk-intensive. Such indexes make selects (reads) quicker while slowing writes. A heavily indexed table does not scale well therefore, and the lower the cardinality of the index, the less well it scales.
One optimization aimed at improving the speed of concurrent writers is to use sparse tables with hashed primary keys and minimal indexing. This approach eliminates the need for an index on the primary key value and permits an immediate seek to the disk location where a row lives, 'immediate' in the sense that the intermediary of an index read is not required. The hashed primary key algorithm returns the physical address of the row using the primary key value itself-- a simple computation that requires no disk access.
The sparse table is exactly the opposite of storing logically related data so they are physically contiguous. In a sparse table, writers do not step on each others toes, so to speak. Writes are like raindrops falling on a large field not like a crowd of people on a subway platform trying to step into the train through a few open doors. The sparse table helps to eliminate write bottlenecks.
However, because logically related data are not physically contiguous, but scattered, the act of gathering all rows in a certain zipcode, say, is expensive. This sparse-table hashed-pk optimization is therefore optimal only when the predominant activity is the insertion of records, the update of individual records, and the lookup of data relating to a single entity at a time rather than to a large set of entities, as in, say, an order-entry system. A company that sold merchandise on TV and had to service tens of thousands of simultaneous callers placing orders would be well served by a system that used sparse tables with hashed primary keys. A national security database that relied upon linked lists would also be well served by this approach. Many social networking applications could also use it to advantage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With