So an RDBMS is a good choice if the query and reporting needs are very critical. Real time analytics for operational data is better suited to a NoSQL setting. Further, in cases where data is brought together from many upstream systems to build an application (not just reporting), NoSQL is a must.
Scalability: Users have to scale relational database on powerful servers that are expensive and difficult to handle. To scale relational database it has to be distributed on to multiple servers. Handling tables across different servers is difficult . Complexity: In SQL server's data has to fit into tables anyhow.
RDBMS systems guarantee consistency. Sharding makes the system tolerant to partitioning. From the theorem follows that the system can therefor not guarantee availability. That's why a standard RDBMS cannot scale very well: it won't be able to guarantee availability.
Apache Cassandra is the most established scalable massive database. It an Open source NoSQL key-value database that provides low latency, it is fault tolerant(using replicas), scalable and decentralized; meaning it does not follow a master-slave pattern to provide high availability.
So I've been trying to figure out the real bottom-line when it comes to NoSQL vs RDBMS myself, and always end up with a response that doesn't quite cut it. In my search there are really 2 primary differences between NoSQL and SQL, with only 1 being a true advantage.
ACID vs BASE - NoSQL typically leaves out some of the ACID features of SQL, sort of 'cheating' it's way to higher performance by leaving this layer of abstraction to the programmer. This has already been covered by previous posters.
Horizontal Scaling - The real advantage of NoSQL is horizontal scaling, aka sharding. Considering NoSQL 'documents' are sort of a 'self-contained' object, objects can be on different servers without worrying about joining rows from multiple servers, as is the case with the relational model.
Let's say we want to return an object like this:
post {
id: 1
title: 'My post'
content: 'The content'
comments: {
comment: {
id: 1
}
comment: {
id: 2
}
...
views: {
view: {
user: 1
}
view: {
user: 2
}
...
}
}
In NoSQL, that object would basically be stored as is, and therefore can reside on a single server as a sort of self-contained object, without any need to join with data from other tables that could reside on other DB servers.
However, with Relational DBs, the post would need to join with comments from the comments
table, as well as views from the views
table. This wouldn't be a problem in SQL ~UNTIL~ the DB is broken into shards, in which case 'comment 1' could be on one DB server, while 'comment 2' yet on another DB server. This makes it much more difficult to create the very same object in a RDBMS that has been scaled horizontally than in a NoSQL DB.
Would any DB experts out there confirm or argue these points?
RDBMS have ACID ( http://en.wikipedia.org/wiki/ACID ) and supports transactions. Scaling "out" with RDBMS is harder to implement due to these concepts.
NoSQL solutions usually offer record-level atomicity, but cannot guarantee a series of operations will succeed (transaction).
It comes down to: to keep data integrity and support transactions, a multi-server RDBMS would need to have a fast backend communication channel to synchronize all possible transactions and writes, while preventing/handling deadlock.
This is why you usually only see 1 master (writer) and multiple slaves (readers).
Typical RDBMSs make strong guaranties about consistency. This requires to some extend communication between nodes for every transaction. This limites the ability to scale out, because more nodes means more communication
NoSql systems make different trade offs. For example they don't guarantee that a second session will see immediately data commited by a first session. Thereby decoupling the transaction of storing some data from the process of making that data available for every user. Google "eventually consistent". So a single transaction doesn't need to wait for any (or for much less) inter node communication. Therefore they are able to utilize a large amount of nodes much more easily.
Why NoSQL databases can be easily horizontally scaled than SQL ones? I have been trying to figure out why people keep saying this. I came across many articles which only confused me with their not-industry familiar terminologies and vague assumptions. I will suggest you read Designing Data-intensive applications by Martin Kleppman. Also, I will share some of my understanding of this subject.
JOINS - in the case of many-to-one or many-to-many relationships there is no way that any database invented till now can keep the data together in one table or document so if the data is sharded(or partitioned), either it is SQL or NoSQL, the latency will be same, the database has to look for both the documents. NoSQL seems to dominate only in the case of one to many relationships. For example:
NoSql
Student
{
"name": "manvendra",
"education": [
{
"id": 1,
"Degree": "High School"
},
{
"id": 2,
"Degree": "B.Tech"
}
]
}
Eduction Institute collection
[
{
"id": "1",
"name": "army public school"
},
{
"id": "2",
"name": "ABES Engineering College"
}
]
Sql
Student Table
id | name
1 | Manvendra
Education Institute
id | Name
1 | Army public school
2 | ABES Engineering college
Studies Table
student | education institute | degree
1 | 1 | high school
1 | 2 | B.tech
Now suppose in the case of NoSql if both collection's data is on different nodes there will some extra time required to resolve the ids of the education institute and this situation is similar in the case of SQL databases so where is the benefit? I can't think of any.
Also, you must be thinking why can't we store the education institute info also in the same student collection, then it will be like:
{
"name": "manvendra",
"education": [
{
"name": "Army public school",
"Degree": "High School"
},
{
"name": "ABES Engineering College",
"Degree": "B.Tech"
}
]
}
which is really a bad design because there is a many-to-many relationship between student and education institute, many students might have studied from the same institute so tomorrow if there is a change in name or any information of the institute it will be really a very difficult challenge to change at all places.
However, in the case of a one-to-many relationship, we can club all the info together for example: Consider a customer and an order relationship
{
"name": "manvendra",
"order": [
{
"item": "kindle",
"price": "7999"
},
{
"item":"iphone 12",
"price":"too much"
}
]
}
Since an order only belongs to one customer it makes sense to store order info in one place however storing item id or name is another choice anyway, if we use SQL database here, there will be two tables with orders and customers which will not give good results to queries if data is not stored in the same node.
So saying joins in an argument as to why the NoSql database is easier to scale horizontally does not make sense.
TRANSACTIONS
Both SQL(Postgres, MySQL, etc) and NoSQL(MongoDB, Amazon's DynamoDB, etc) support transactions so there is nothing left to discuss on that.
ACID
ACID is overused just like CAP actually it is all about showing a single copy of data to the client instead actually there might be multiple copies of data(to enhance availability, fault-tolerance, etc) and what strategies the database uses to do that. For example in Postgres in the case of a master-slave distributed system, one can opt for synchronous or asynchronous replication and the replication is made possible with WAL(Write ahead logs) and same is the case in MongoDB, only in place of WAL it has oplog(Operations Log), both support streaming replication and failovers. Then where is the difference? Actually, I can't find a very strong reason that why NoSql databases can be scaled easily. What I can say is NoSql is the latest so databases come with ready-made support for horizontal scaling for example consider Mongos in MongoDB, they do all the dirty work of sharding documents, routing requests to the specific shard, etc. So tomorrow if Postgres or MySQL come up with some mechanism of intelligently sharding tables so all the related data is mostly kept in one node then it may put an end to this debate because there is nothing intrinsic in a relational database that prevents it from horizontal scaling.
On an optimistic note, I believe in the near future it will all be about the strategies. How you are planning to scale and those strategies will be independent of how you are storing data either in tables or documents. For example in Amazon's DocumentDB, there is a concept of auto-scaling in and out but if you want to achieve this with sharding it will be a burden to copy data each time you are scaling in and out. In DocumentDB this is taken care of as a shared cluster volume(data storage is separated from computing) which is nothing but a shared disk to all the instances(primary or secondary) and to escape from the risk of the shared disk failure DocumentDB replicates data of the shared disk to six other disks in different availability zones. So point to be noted here is DocumentDB mixed the concept of the shared disk and standard replication strategy to achieve its goal. So it is all about the strategy you are using in your database which is what matters
In RDBMS when the data becomes huge then it may happen that tables are spread across multiple systems and in that case performing operations like JOIN are very slow.
In case of NoSQL in general related data are stored together on same machine (either in single document - in document oriented databases or in case of Wide column datastore the related columns are on same machine). Hence its easy to scale out on a number of low end machines, obviously in this case there will be duplicate data in multiple places which is not the case in RDBMS
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With