ElasticSearch Analytical queries

Question

I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.

For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.

Andrei Stefan · Accepted Answer

Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.

With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.

In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.

I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.

I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.

Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.

Jake Henningsgaard · Answer

Google Analytics Pros:

Easy to Install
Can be used in multiple environments (e.g. web, mobile, other)
Customized data collection

Google Analytics Cons:

Custom reporting is limited
Upgrading to Premium is expensive
Requires continual traning
Slices data into smaller samples to deal with large sampling issues

ElasticSearch Pros:

Distributed by design
Easier to scale horizontally
Good at full text search
Fast indexing & querying

ElasticSearch Cons:

Not a relational database therefore does not benefit from things like foreign-key constaints
Data consistency can be affected
No built-in authentication or authorization system

ElasticSearch Analytical queries

Tags:

scalability

elasticsearch

analytics

David542

2 Answers

Andrei Stefan

Jake Henningsgaard

Recent Activity

Donate For Us

ElasticSearch Analytical queries

Tags:

scalability

elasticsearch

analytics

David542

2 Answers

Andrei Stefan

Jake Henningsgaard

Related questions

Recent Activity

Donate For Us