We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us. <ul> <li>we need to store small JSON documents, about 150 bytes, e.g. <code>"audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}"</code> or <code>"journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"</code> </li> <li>we are expecting about one million entries per day, about 150 MB data</li> <li>data will be stored and read but never modified</li> <li>data should stored in an efficient way, e.g. binary format used by Apache Avro</li> <li>after a retention time data may be deleted</li> <li>custom queries, such as <code>'get audit for user and time period'</code> or <code>'get journal for terminalid and time period'</code> </li> <li>replicated data base for failsafe</li> <li>scalable</li> </ul> Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best? Are there better options?

<ul> <li>One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)</li> <li>Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.</li> <li>"data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.</li> <li>custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query</li> <li>or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.</li> <li>Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)</li> </ul> Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous

Efficient and scalable storage for JSON data with NoSQL databases

2 Answers

One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)
Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.
"data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.
custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query
or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.
Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)

Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous

answered Sep 29 '22 12:09

tom.wilkie

Avro supports schema evolution and is a good fit for this kind of problem.

If your system does not require low latency data loads, consider receiving the data to files in a reliable file system rather than loading directly into a live database system. Keeping a reliable file system (such as HDFS) running is simpler and less likely to have outages than a live database system. Also, separating the responsibilities ensures that your query traffic won't ever impact the data collection system.

If you will only have a handful of queries to run, you could leave the files in their native format and write custom map reduces to generate the reports you need. If you want a higher level interface, consider running Hive over the native data files. Hive will let you run arbitrary friendly SQL-like queries over your raw data files. Or, since you only have 150MB/day, you could just batch load it into MySQL readonly compressed tables.

If for some reason you need the complexity of an interactive system, HBase or Cassandra or might be good fits, but beware that you'll spend a significant amount of time playing "DBA", and 150MB/day is so little data that you probably don't need the complexity.

answered Sep 29 '22 12:09

pawstrong

Related questions
                            
                                How to change the position of buttons in DataTables
                            
                                How to get response in JSON format using @ExceptionHandler in Spring MVC
                            
                                Newtonsoft.Json.Linq.JArray' to type 'System.Collections.Generic.IEnumerable
                            
                                Android Volley: Unexpected response code 405
                            
                                Symfony 3 - How to handle JSON request with a form
                            
                                Map JSON string column of a JPA entity to Java object automatically
                            
                                Convert a list to json objects
                            
                                Passing IOptions into .Net core middleware class for json config retrieval
                            
                                Replacing underscores in JSON using JQ
                            
                                Convert html source code to json object
                            
                                How to convert string int JSON into real int with json.loads
                            
                                Delphi parse JSON array or array
                            
                                JsonSchema: Validate type based on value of another property
                            
                                How do you convert an Elasticsearch JSON String Response, with an Aggregation, to an Elasticsearch SearchResponse Object
                            
                                Creating a sortable tree/grid in Javascript [closed]
                            
                                Cannot serialize parameter of type 'System.Linq.Enumerable... ' when using WCF, LINQ, JSON
                            
                                Using JavaScript eval to parse JSON
                            
                                Reading JSON file error
                            
                                Spring MVC returning JSONS and exception Handling
                            
                                Convert lat long into geojson object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient and scalable storage for JSON data with NoSQL databases

Tags:

json

mongodb

cassandra

couchdb

hadoop

Ismail

People also ask

2 Answers

tom.wilkie

pawstrong

Recent Activity

Donate For Us