Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch Analytical queries

I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.

For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.

like image 578
David542 Avatar asked Dec 02 '16 01:12

David542


2 Answers

Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.

With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.

In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.

I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.

I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.

Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.

like image 89
Andrei Stefan Avatar answered Nov 17 '22 06:11

Andrei Stefan


Google Analytics Pros:

  • Easy to Install
  • Can be used in multiple environments (e.g. web, mobile, other)
  • Customized data collection

Google Analytics Cons:

  • Custom reporting is limited
  • Upgrading to Premium is expensive
  • Requires continual traning
  • Slices data into smaller samples to deal with large sampling issues

ElasticSearch Pros:

  • Distributed by design
  • Easier to scale horizontally
  • Good at full text search
  • Fast indexing & querying

ElasticSearch Cons:

  • Not a relational database therefore does not benefit from things like foreign-key constaints
  • Data consistency can be affected
  • No built-in authentication or authorization system
like image 32
Jake Henningsgaard Avatar answered Nov 17 '22 05:11

Jake Henningsgaard