I am planning on using ElasticSearch to index my Cassandra database. I am wondering if anyone has seen the practical limits of ElasticSearch. Do things get slow in the petabyte range? Also, has anyone has any problems using ElasticSearch to index Cassandra?
Though there is technically no limit to how much data you can store on a single shard, Elasticsearch recommends a soft upper limit of 50 GB per shard, which you can use as a general guideline that signals when it's time to start a new index.
There are no hard limits on shard size, but experience shows that shards between 10GB and 50GB typically work well for logs and time series data. You may be able to use larger shards depending on your network and use case. Smaller shards may be appropriate for Enterprise Search and similar use cases.
A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
High throughput: Some clusters have up to 5TB data ingested per day, and some clusters take more than 400 million search requests per day. Requests would accumulate at upstream if Elasticsearch could not handle them in time.
See this thread from 2011, which mentions ElasticSearch configurations with 1700 shards each of 200GB, which would be in the 1/3 petabyte range. I would expect that the architecture of ElasticSearch would support almost limitless horizontal scalability, because each shard index works separately from all other shards.
The practical limits (which would apply to any other solution as well) include the time needed to actually load that much data in the first place. Managing a Cassandra cluster (or any other distributed datastore) of that size will also involve significant workload just for maintenance, load balancing etc.
Sonian is the company kimchy alludes to in that thread. We have over a petabyte on AWS across multiple ES clusters. There isn't a technical limitation to how far horizontally you can scale ES, but as DNA mentioned there are practical problems. The biggest by far is network. It applies to every distributed data storage. You can only move so much across the wire at a time. When ES has to recover from a failure, it has to move data. The best option is to use smaller shards across more nodes (more concurrent transfer), but you risk a higher rate of failure and exhorbitant cost per byte.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With