Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which NoSQL database is the best for statistical report application

Tags:

database

nosql

We are creating an application that receives JSON message like shown below,

{ orderId: "00e8da9b", created: 12-22-2016, lineItems: [ { itemName: Item 1 qty: 1 price: 20.0 } { itemName: Item 2 qty: 3 price: 80.0 } ] pricing: { retail: 110, actual: 100, savings: 10, }, }

Mainly the application is used to generate statistical reports. Major operation that the system will perform are,
INSERT order details
UPDATE order status
SUM the total price of particular item across all orders by month and year
SEARCH by item name (also like partial text search) in all orders and display all the record with order number, item and price

We have reviewed some NoSQL DB's MongoDB, Cassandra and Elastic Search. On reviewing the below URL, it looks Elastic Search is the better choice than MongoDB for search and summation operation. But there is possibility of data loss in Elastic Search.
http://blog.quarkslab.com/mongodb-vs-elasticsearch-the-quest-of-the-holy-performances.html

Kindly suggest what NoSQL DB will be best suite for the requirement.

Thanks.

like image 323
ugendrang Avatar asked Mar 10 '23 08:03

ugendrang


2 Answers

I don't think it's possible to properly suggest a NoSQL solution based on the given information.

Instead I'll try to give you an indication where the sweet spot of the particular solution is. And then based on some assumptions if it could fit to you or not.

  • Cassandra: for data/requests of massive scale, some thousand requests/s plus, millions of inserts each day; good for list oriented data models, time series data, e.g. events from IoT devices, users etc.. The scalability comes with some drawbacks e.g. analytics and search capabilities are very limited. You can only access data by partion/clustering key fields. If you need analysis and search capabilities you need to explore/learn additional tools like Apache Spark, SOLR, Elasticsearch. --> Probably not the best choice for you as long as you don't have to handle data in massive scale.

  • Elasticsearch: An index for real-time full-text search and analysis solution. If you don't have full-text search requirements and you don't need real time analysis capabilities over millions of data points probably also not your preferred choice. Also in most use cases only used as additional index to a primary database. --> Your search requirements sounds more like something that can be done with a normal SQL "LIKE" operator. Your search and reporting requirements sounds more like a traditional report done at the end of the month/year. Therefore an additional search engine seems like an overhead you don't need.

  • MongoDB: for document-oriented data, for use cases were you would use a RDBMS system but you need a flexible schema etc. --> From a data modeling perspective probably not the best choice for you as you would have to model the relation between products and orders and you might need to discover and add data on both sides of the relation.

Conclusion: If you don't have to handle massive amounts of data, a database for statistical reports sound pretty much like RDBMS to me (even if I am not an RDBMS guy ;) ).

like image 109
Philip Avatar answered Apr 29 '23 06:04

Philip


I'm most familiar with Cassandra, so I can answer that part: Cassandra is excellent for INSERT, UPDATE and lookups (lookups having the caveat that you must query with the entire partition key for good performance. Partition key is effectively what the data is indexed by internally, and can loosely be thought of as a primary key or compound primary key).

Cassandra requires a bit of help via third party indexing engines (such as SOLR, which is integrated in the Datastax Enterprise Platform for Cassandra, or you can replicate to ElasticSearch for your indexing) for queries that do not include the entirety of your partition keys. Aggregate operations are not really all that possible without some help; An indexing engine (SOLR, Elasticsearch again) can make this possible via query, but if you have the full partition key for everything you need to sum, I prefer to fetch the records and process the aggregate client side (we use SOLR, and have found that a misconfigured SOLR or an overly complex SOLR query can do bad things to your cluster).

So, to sum things up in terms of your use case, Cassandra can provide: INSERT = good UPDATE = good SUM = possible with external indexing engine or performed client side as long as partitioned by month/year. SEARCH = only possible with external indexing engine in the manner you describe.

For external indexing engines, my project uses SOLR, but I have heard glowing recommendations for replicating Cassandra to ElasticSearch for search and aggregation, and using both in tandem.

I know little about Mongo, so I cannot really say much about it.

like image 28
Brandon McKenzie Avatar answered Apr 29 '23 06:04

Brandon McKenzie