Recenctly I read some articles online that indicates relational databases have scaling issues and not good to use when it comes to big data. Specially in cloud computing where the data is big. But I could not find good solid reasons to why it isn't scalable much, by googling. Can you please explain me the limitations of relational databases when it comes to scalability?
Thanks.
The relational data model doesn't fit in with every domain. Difficult schema evolution due to an inflexible data model. Weak distributed availability due to poor horizontal scalability. Performance hit due to joins, ACID transactions and strict consistency constraints (especially in distributed environments).
The challenge of scalability The problem is known to many: the mega wave of data is overwhelming and there are only two options: get to surf it or get drowned. The International Data Corporation estimates that global data volume will reach 175 zettabytes by 2025.
Most SQL databases are vertically scalable, which means that you can increase the load on a single server by increasing components like RAM, SSD, or CPU. In contrast, NoSQL databases are horizontally scalable, which means that they can handle increased traffic simply by adding more servers to the database.
Apache Cassandra is the most established scalable massive database. It an Open source NoSQL key-value database that provides low latency, it is fault tolerant(using replicas), scalable and decentralized; meaning it does not follow a master-slave pattern to provide high availability.
Imagine two different kinds of crossroads.
One has traffic lights or police officers regulating traffic, motion on the crossroad is at limited speed, and there's a watchdog registering precisely what car drove on the crossroad at what time precisely, and what direction it went.
The other has none of that and everyone who arrives at the crossroad at whatever speed he's driving, just dives in and wants to get through as quick as possible.
The former is any traditional database engine. The crossroad is the data itself. The cars are the transactions that want to access the data. The traffic lights or police officer is the DBMS. The watchdog keeps the logs and journals.
The latter is a NOACID type of engine.
Both have a saturation point, at which point arriving cars are forced to start queueing up at the entry points. Both have a maximal throughput. That threshold lies at a lower value for the former type of crossroad, and the reason should be obvious.
The advantage of the former type of crossroad should however also be obvious. Way less opportunity for accidents to happen. On the second type of crossroad, you can expect accidents not to happen only if traffic density is at a much much lower point than the theoretical maximal throughput of the crossroad. And in translation to data management engines, it translates to a guarantee of consistent and coherent results, which only the former type of crossroad (the classical database engine, whether relational or networked or hierarchical) can deliver.
The analogy can be stretched further. Imagine what happens if an accident DOES happen. On the second type of crossroad, the primary concern will probably be to clear the road as quick as possible, so traffic can resume, and when that is done, what info is still available to investigate who caused the accident and how ? Nothing at all. It won't be known. The crossroad is open just waiting for the next accident to happen. On the regulated crossroad, there's the police officer regulating the traffic who saw what happened and can testify. There's the logs saying which car entered at what time precisely, at which entry point precisely, at what speed precisely, a lot of material is available for inspection to determine the root cause of the accident. But of course none of that comes for free.
Colourful enough as an explanation ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With