Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient and scalable storage for JSON data with NoSQL databases

We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us.

  • we need to store small JSON documents, about 150 bytes, e.g. "audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}" or "journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"
  • we are expecting about one million entries per day, about 150 MB data
  • data will be stored and read but never modified
  • data should stored in an efficient way, e.g. binary format used by Apache Avro
  • after a retention time data may be deleted
  • custom queries, such as 'get audit for user and time period' or 'get journal for terminalid and time period'
  • replicated data base for failsafe
  • scalable

Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best? Are there better options?

like image 925
Ismail Avatar asked Aug 18 '11 15:08

Ismail


People also ask

Which database is best for storing JSON data?

The best database for JSON A JSON database like MongoDB stores the data in a JSON-like format (binary JSON), which is the binary encoded version of JSON, and is optimized for performance and space. This makes the MongoDB database the best natural fit for storing JSON data.

Can you store JSON in NoSQL?

You can store JSON documents in SQL Server or SQL Database and query JSON data as in a NoSQL database. This article describes the options for storing JSON documents in SQL Server or SQL Database.

Which type of NoSQL database can be used to store JSON objects?

MongoDB. MongoDB is the most popular NoSQL database. A free and open source, cross-platform, document-oriented database, MongoDB uses JSON-like documents with schemas.

Is NoSQL better for JSON?

A JSON database is arguably the most popular category in the NoSQL family of databases. NoSQL database management differs from traditional relational databases that struggle to store data outside of columns and rows.


2 Answers

  • One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)

  • Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.

  • "data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.

  • custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query

  • or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.

  • Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)

Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous

like image 69
tom.wilkie Avatar answered Sep 29 '22 12:09

tom.wilkie


Avro supports schema evolution and is a good fit for this kind of problem.

If your system does not require low latency data loads, consider receiving the data to files in a reliable file system rather than loading directly into a live database system. Keeping a reliable file system (such as HDFS) running is simpler and less likely to have outages than a live database system. Also, separating the responsibilities ensures that your query traffic won't ever impact the data collection system.

If you will only have a handful of queries to run, you could leave the files in their native format and write custom map reduces to generate the reports you need. If you want a higher level interface, consider running Hive over the native data files. Hive will let you run arbitrary friendly SQL-like queries over your raw data files. Or, since you only have 150MB/day, you could just batch load it into MySQL readonly compressed tables.

If for some reason you need the complexity of an interactive system, HBase or Cassandra or might be good fits, but beware that you'll spend a significant amount of time playing "DBA", and 150MB/day is so little data that you probably don't need the complexity.

like image 24
pawstrong Avatar answered Sep 29 '22 12:09

pawstrong