Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scaling of ElasticSearch

I'm searching for information on how ElasticSearch would scale with the amount of data in its indexes and am surprised how little I can find on that topic. Maybe some experience from the crowd here can help me.

We are currently using CloudSearch to index ≈ 7 million documents; in CloudSearch this results in 2 instances of type m2.xlarge. We are considering switching to ElasticSearch instead to reduce the cost. But all I find on the scaling of ElasticSearch is that it does scale well, can be distributed over several instances etc.

But what kind of machine (memory, disc) would I need for this kind of data?

How would that change if I increased the amount of data by the factor of 12 (≈ 80 million documents)?

like image 883
Alfe Avatar asked Apr 30 '13 12:04

Alfe


People also ask

How is Elasticsearch scalable?

Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes.

Is Elasticsearch horizontally scalable?

One of the great features of Elasticsearch is that it's designed from the ground up to be horizontally scalable, meaning that by adding more nodes to the cluster you're capable to grow the capacity of the cluster (as opposed to vertical scalability that requires you to have bigger machines to be able to grow your ...


2 Answers

As Javanna said, it depends. Mostly on: (1) rate of indexing; (2) size of documents; (3) rate and latency requirements for searches; and (4) type of searches.

Considering this, the best we can help is giving examples. On our site (news monitoring) we:

  1. Index more than 100 docs per minute. We have, currently, near 50 million documents. I've also heard of ES indexes with hundreds of millions of documents.
  2. Documents are news articles with some metadata, not short but not that large.
  3. Our search latency varies between ~50ms (for normal and rare terms) up to 800ms for common terms (stopwords, we index them). This variation is largely due to our custom scoring (thanks to Lucene/ES support for customizing it) and to the fact the dataset (inverted lists) do not fit entirely in memory (OS cache). So when it hits a cached inverted list, it's faster.
  4. We do OR queries with a lot of terms which are one of the hardest. Also we do faceting on two single-valued fields. And have some experiments with date facet (to show rate of publication through time).

We do all this with 4 EC2's m1.large instances. And now we're planning moving to ES, just released, 0.9 version to get all the goodies and performance improvements of Lucene 4.0.

Now leaving examples aside. ElasticSearch is pretty scalable. It is very simple to create an index with N shards and M replicas, and then create X machines with ES. It will distribute all shards and replicas accordingly. You can change the number of replicas anytime you want (for each index).

One downside is that you can't change the number of shards after the index creation. But you can still "overshard" it beforehand to leave room for scaling when needed. Or create a new index with the right number of shards and reindex everything (we do this).

Finally, ElasticSearch (and also Solr) uses, under the hood, the Lucene Search library, which is very mature and well known library.

like image 92
Felipe Hummel Avatar answered Sep 21 '22 10:09

Felipe Hummel


I've actually recently switched from using CloudSearch to a hosted ElasticSearch service at the company I work for. Our specific application has a little over 100 million documents and is growing daily. So far, our experience with ElasticSearch has been absolutely wonderful. Search performance averages at ~250ms, even with all the sorting, filtering, and faceting. Indexing documents is also relatively fast, despite the several MB load we pass through HTTP with the bulk API every couple of hours. Refresh rates seem to be near instant, as well.

For our ~100M doc / 12GB index, we used 4 shards / 2 replicas (will bump to 3 replicas if performance degrades) spread across 4 nodes. Prior to setting up the index, our team spent a couple of days researching ElasticSearch cluster deployment/maintenance, and opted to use http://qbox.io to save money and time. We were paralyzingly afraid of performance and scale issues choosing to host our index on a dedicated cluster like Qbox, but so far the experience has been seriously fantastic.

Since our index lives on a dedicated cluster, we don't have access to nuts-and-bolts node-level configuration settings, so my technical expertise with ES deployment is still pretty limited. That being said, I can't be sure of exactly what performance tweeks are needed for the performance we've experienced on our index. However, I do know Qbox's cluster uses SSD... so that could definitely have a significant impact.

Point in case, ElasticSearch has scaled seamlessly. I highly, highly recommend the switch (even if it's just to save $$, CloudSearch is crazy expensive). Hope this information helps!

like image 25
Ben at Qbox.io Avatar answered Sep 22 '22 10:09

Ben at Qbox.io