Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch vs. ElasticSearch+Cassandra

My main question is what is the benefit of integrating Cassandra and Elasticsearch versus using only Elasticsearch?

In fact, there are answers to similar questions on StackOverflow (e.g., here and here). But there are some points:

  • A lot of answers are old. Much may have changed in these years.
  • One point that is mentioned is that "Sometimes ElasticSearch loses writes". However, it can be imagined those alleged loses may had been because of some bugs that have been solved in these years. It is assumable that e.g., Cassandra may also have some bugs that cause data loses. Is there any fundamental differences between Cassandra and Elasticsearch that cause Elasticsearch to lose data but doesn't cause it for Cassandra?
  • It is mentioned that "Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading." This may not be a major problem for us, assuming that our data model is relatively stable or at-least backward-compatible. Also, because of dynamic mapping in Elasticsearch it may adapt itself with the new requirements (e.g., extra fields).
  • With respect to the indexing delay in Elasticsearch, Cassandra also does not provide consistency. So, in Cassandra you may also face delays in reading the written data.

Overall, what extra features does Cassandra offer when used in conjunction with Elasticsearch?

P.S. It may be better if the question is answered in general. But, if it is necessary, assume that we only append rows to the database and never delete or update anything. We want to be able to do full-text search in the data.

like image 532
Shayan Avatar asked Apr 15 '20 08:04

Shayan


People also ask

Does Elasticsearch use Cassandra?

The open source database and search engine for multi-cloud applications. Elassandra simplifies your data stack by integrating Elasticsearch® into Apache Cassandra®.

When should you not use Cassandra?

When you want many-to-many mappings or join tables. Cassandra doesn't support a relational schema with foreign keys and join tables. So if you want to write a lot of complex join queries, then Cassandra might not be the right database for you.

Is Cassandra still used at Facebook?

Though Facebook has all but abandoned Cassandra, the technology has gone on to power critical web infrastructure at companies like Twitter, Netflix, even Apple. And DataStax has built a version of the tool for all sorts of other businesses.


1 Answers

So as the author of one of the linked answers (Elasticsearch vs Cassandra vs Elasticsearch with Cassandra), I suppose that I should weigh in here.

those alleged loses may had been because of some bugs that have been solved in these years.

This is an absolutely true statement. The answer I wrote is almost six years old, and ElasticSearch has grown to be a much more reliable product in that time. That being said, there are some things which Cassandra can do that ElasticSearch just wasn't designed to do (and vice-versa).

what extra features does Cassandra offer...

I can think of a few, which I'll summarize here:

  • Write throughput/performance/latency

ElasticSearch is a search engine based on the Lucene project. Handling large amounts of write throughput at low latencies is just not something that it was designed to do; at least not "out of the box." There are ways to configure ElasticSearch to be better at this, as described here: Techniques to Achieve High Write Throughput With ElasticSearch. But in terms of building a new cluster with minimal config, you'll spend less time engineering Cassandra to accomplish this.

"Sometimes ElasticSearch loses writes"

Yes, I wrote that. Again, ElasticSearch has improved. A lot. But I still see this happen under high write throughput conditions. When a cluster is engineered for a certain level of throughput, and an application exceeds those tolerances causing a node to become overwhelmed from the write back-pressure, writes will be lost.

Cassandra is not immune to this problem, either. It just has a higher tolerance for it. If you were to use them both together, architecting something like Kafka to "throttle" the write throughput to each would be a good approach.

  • Multi Data center High Availability (MDHA)

With the ability to define logical data centers and availability zones (racks), Cassandra has always been good at replicating a data set over multiple regions. This is problematic for ElasticSearch, as it does not have a concept of a logical data center, and its "master" nodes are not active/active.

  • Peer nodes vs. role-based nodes

As a follow-up to my MDHA point, ElasticSearch now allows for nodes to be designated with a "role" in the cluster. You can specify multiple nodes to act as the "master" role, in-charge of adding and updating indexes. Any node can direct search traffic to the nodes which work under the "data" role. In fact, one way to improve write throughput (my first talking point), is to designate a node or two with the "ingest" role, which can prevent read and write traffic from interfering with each other.

This deviates from Cassandra's approach where every node is a peer, and can handle reads and writes. Being able to treat all nodes the same, simplifies maintenance and administration. And "no," despite popular misconception, a "seed" node not is not anything special.

  • Query vs. Search

To me, this is the fundamental difference between the two. Querying is not the same as searching. They may seem similar, but they are quite different.

Retrieving data by matching a pattern on one or multiple columns/properties is searching. Also with searching, the number of results is more of an unknown beforehand. Sure, Cassandra has added some features in the last few years to allow for pattern matching based on LIKE queries (I don't recommend its use). But when the ability to "search" a data set is required, Cassandra can't compete with ElasticSearch.

Retrieving data by providing a specific value on a specific key (column) is querying. With querying, it is also easier to have accurate expectations on the number of results to be returned. If I was building an app and I knew that I'd only ever have to retrieve data based on a static, pre-defined query with a specific key, I'd choose Cassandra every time.

With Cassandra, I can also tune query consistency, requiring operational acknowledgement from more or fewer replicas. Likewise, I can also direct those operations to a specific geographic region, based on the locality of the application.

...when used in conjunction with Elasticsearch?

They compliment each other well. Cassandra is good at some things (detailed above) that ElasicSearch is not (and vice-versa...saying that a lot). Requirements for an application may require both searching and querying. Sometimes you've got an app that needs that high-speed key lookup "oh, and we also want search."

Summary, tl;dr;

So while I've written quite a bit here, the main point that I'll keep coming back to, is picking the right tool for the job. When I need to search I'll pick ElasticSearch. When I need to query in a highly-available, geographically-aware scenario, I'll pick Cassandra. I still see applications use both (in tandem), so both have their merits.

like image 160
Aaron Avatar answered Oct 21 '22 08:10

Aaron