Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch or Couchbase or something else

Background: I have a huge stream of data - getting up to 1000000 records per hour, ttl is 3 hours... Each "document" contains approx 20 properties, I need to search up to 15 properties at same time using "==", "IN" and "BETWEEN" comparison.

Since there are mostly no unsearchable properties there are no reason to store document twice (in Couchbase AND in ElasticSearch index) so I think it's a good idea to store it only in ElasticSearch. I'm right?

Or maybe someone can recommend me better database for such task? I need an easy horizontal scaling in future (custom sharding of MySQL is not an option)... This data is some kind of cache so eventual consistency and poor durability is OK...

According CAP theorem I need mostly A and P...

like image 378
dimzon Avatar asked Jul 22 '14 22:07

dimzon


People also ask

Is Elasticsearch considered NoSQL?

Completely open source and built with Java, Elasticsearch is a NoSQL database. That means it stores data in an unstructured way and that you cannot use SQL to query it. This Elasticsearch tutorial could also be considered a NoSQL tutorial.

Is couchbase SQL or NoSQL?

Couchbase is an award-winning distributed NoSQL cloud database that delivers unmatched versatility, performance, scalability, and financial value for all of your cloud, mobile, on-premises, hybrid, distributed cloud, and edge computing applications.

What is difference between couchbase and MongoDB?

Couchbase is multi-dimensionally distributed because the individual services such as indexing, querying, and data storage can be scaled depending on which service has the increased demand. MongoDB is uniformly distributed, with data distributed evenly across shards and using the mongod and mongos services.

What type of NoSQL is couchbase?

Couchbase Server, originally known as Membase, is an open-source, distributed (shared-nothing architecture) multi-model NoSQL document-oriented database software package optimized for interactive applications.


2 Answers

Regarding performance, provided you use appropriately sized hardware you should not have issues indexing 1M documents per hour. I've run Elasticsearch well above that with no issues. There is a detailed writeup here that you may find useful concerning benchmarking and sizing a large Elasticsearch cluster:

ElasticSearch setup for a large cluster with heavy aggregations

For an ephemeral caching system with a TTL of only 3 hours I agree it would be a waste to store the data in more than one repository. You could store the data in Couchbase and replicate it into Elasticsearch in realtime or near real time, but why bother with that? Not certain what benefit you would get from having the data in both places.

For performance issues concerning your specific use case I'd strongly suggest benchmarking. One strength of Elasticsearch (and Solr too) that I've found is their (to me) surprisingly strong performance when search on multiple, non-text fields. You tend to think of ES for text search purposes (where it does excel) but it's also a decent general purpose database too. I've found that in particular it has strong performance when searching on multiple parameters compared to some other NoSQL solutions.

Personally when benchmarking ES in this use case I'd look at a number of different indexing options. ES supports TTL for documents so automatically purging the cache is easy:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

However you may want to play around with having different indexes for each each hour - one thing about ES (due to it's use of Lucene underneath for indexing and file storage) is that deletes work differently than most databases. Documents are marked deleted but not removed, and then periodically the files underneath (called segments) will be merged, at which time new segments will be created without the deleted documents. This can cause a fair amount of disk activity for high volume delete heavy use cases in a single index. The way around this is to create a new index for each hour and then deleting the index in it's entirety after the data in it is over 3 hours old.

You may find this previous discussion about TTL vs. time series indexes in Elasticsearch useful: Performance issues using Elasticsearch as a time window storage

Finally, regarding easy horizontal scaling Elasticsearch is pretty good here - you add a new node with the correct cluster name and ES takes care of the rest, automatically migrating shards to the new node. In your use case, you may want to play with the replication factor, as more replicas across more nodes are the easy way to boost query performance.

like image 142
John Petrone Avatar answered Oct 06 '22 11:10

John Petrone


For the use case of a cache (cache-like system), I think Elasticsearch will only give you problems in the future. I assume you don't need indexing at all as you are not looking at search like features.

I haven't used Couchbase but I have heard good things about it. I have heard use-cases like using Couchbase for more filtering kind of purposes and Elasticsearch for more search-like purpose (and things that Couchbase can't do).

For scalability, as far as I can tell both look similar from a very high level point-of-view. Both support easy sharding and replication with re-balancing of shards and secondary replica promotion to primary when a node in the cluster goes down. The specifics may be different.

But in all honesty, you will have to try it out yourself and test with production like traffic. I have worked with Elasticsearch and I know that you can't always just say if it is the right choice for your use-case because how it behaves for an application in production may be different for how it behaves for another in terms of performance.

But I think you are on the right track.

like image 35
vaidik Avatar answered Oct 06 '22 09:10

vaidik