Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best NoSQL approach to handle 100+ million records

I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes). To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).

What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?

Some options I am thinking of are:

  1. Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).

  2. To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.

  3. Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id. What BigTable implementation will be suitable for this? I was considering HBase.

  4. Use MongoDB for storing data and for querying and lookup by id.

like image 509
tsolakp Avatar asked Jun 23 '11 17:06

tsolakp


2 Answers

recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.

like image 186
sirmak Avatar answered Oct 15 '22 05:10

sirmak


It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).

You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.

Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.

So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.

Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.

Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.

My advice would be to actually play around with a few different systems. Look at;

MongoDB - Document - CP

CouchDB - Document - AP

Redis - In memory key-value (not column family) - CP

Cassandra - Column Family - Available & Partition Tolerant (AP)

HBase - Column Family - Consistent & Partition Tolerant (CP)

Hadoop/Hive

VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).

Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

like image 41
NightWolf Avatar answered Oct 15 '22 06:10

NightWolf