Analytics in Elasticsearch

Question

I am working with Analytic of events, I use hadoop to process the logs and store some results in Mysql. This did not work now due to scalability issues as logs are keep coming daily.

We need to show stats per year, month, week, day, hour along with filtering capability Our samples can grow for 100k users, each uses 20 websites each hour
100,000(users) * 20 (unique website) * 2 (locations) * 24 (hours) = 96,000,000 (96 Million max records per day)

Our table looks like
event_src_id, Time, User, Website, location, some stats

Some queries example are

1) select website, sum(stats), count(distinct(user_id)) from table group by website;
2) select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time);
3) select website, sum(stats), count(distinct(user_id)) from table group by website where event_src_id=XXXXXXXXXXX;
4) select website, sum(stats), count(distinct(user_id)) from table group by website where time > 1 jan 2014 and time <=31 jan 2014;
5) select website, location, sum(stats), count(distinct(user_id)) from table group by website, location;
6) select website, sum(stats) as stats_val from table group by website order by stats_val desc limit 10;
   select location, sum(stats) as stats_val from table group by location order by stats_val desc limit 10;
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)

I tried Hadoop elastic search and its seems like insertion part can fixed with that, I am more worried on the reading part. The aggregation framework seems to give some hope but I could not work as per query one. how to group and sum and distinct at same time? How can I best use Elasticsearch with Hadoop with given scalability and performance for OLAP based quires. Any help will be appreciated.

Radek Tomšej · Accepted Answer

Firstly I do not think that using ElasticSearch for OLAP-like queries is a good idea. I would recommend you use some Dremel-like technologies (Impala, TEZ, Storm etc.) which support sql you specified. It has some advantages like:

you do not have to transfer data from Hadoop into ElasticSearch.
you can use sql
you do not have to care about parsing json from ElasticSearch query answers.

Do not get me wrong, I love ElasticSearch/Logstash/Kibana, but for log collecting and visualization. Of course it is possible to make some advanced queries but it has some limitations which I found out in my personal projects.

Also consider using Kibana, it is a great tool for data statistics in ElasticSearch and you can do a lot of things with it.

Here are some queries example as you requested (i did not tested it):

1)

{
  "aggs": {
    "website": {
      "terms": {
        "field": "website"
      },
      "aggs": {
        "sum_stats": {
          "sum": {
            "field": "stats"
          },
          "aggs": {
            "distinct_user": {
              "cardinality": {
                "field": "user_id",
                "precision_threshold": 100
              }
            }
          }          
        }
      }
    }
  }
}

2-6 are similar, use things from 1) with different filters like this:

{
  "aggs": {
    "your_filter": {
      "filter": {
        "term": {"event_src_id" : "XXXXXXXXXXX"}
      }
    },
    "aggs": {
      "website": {
        "terms": {
          "field": "website"
        },
        "aggs": {
          "sum_stats": {
            "sum": {
              "field": "stats"
            },
            "aggs": {
              "distinct_user": {
                "cardinality": {
                  "field": "user_id",
                  "precision_threshold": 100
                }
              }
            }
          }
        }
      }
    }
  }
}

7) DELETE is quite easy

    "query" : {
        "term" : { "event_src_id" : "XXXXXXXXXXX" }
    }
}

BenG · Answer

how to group and sum and distinct at same time

Aggregations can have sub-aggregations.

First, the group functionality corresponds to the terms aggregation and (sometimes) the top_hits aggregation. Second, There is a sum aggregation, a simple stats metric aggregation. Finally, Your use of distinct in this case is to perform count(distinct), which corresponds to a cardinality aggregation, which can be approximate or exact depending on your needs.

7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)

There is a delete by query api which you can use, but be careful about high percentages of deleted documents; Lucene and Elasticsearch are not optimized for this, and you will incur an overhead due to delete markers in the data.

Examples

select website, sum(stats), count(distinct(user_id)) from table group by website

GET /_search
{
   "aggs": {
      "website_stats": {
        "terms": {
           "field": "website"
        },
        "aggs": {
           "sum_stats": {
             "sum": {
               "field": "stats"
             }
           },
           "count_users": {
             "cardinality": {
               "field": "user_id"
            }
          }
        }
      }
   }
}

select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time)

GET /_search
{
   "query": {
     "filter": {
       "range": {
           "Time": {
              "gt": "2009-01-01 00:00:00",
              "lt": "2010-01-01 00:00:00"
           }
       }
     }
   },
   "aggs": {
      "monthly_stats" {
        "terms": {
           "field": "website"
        },
        "aggs": {
           "months": {
              "date_histogram": {
                "field": "Time",
                "interval": "month"
              },
              "aggs" : {
                "sum_stats": {
                  "sum": {
                    "field": "stats"
                  }
                },
                "count_users": {
                  "cardinality": {
                    "field": "user_id"
                }
              }
            }
          }
        }
      }
   }
}

Analytics in Elasticsearch

Tags:

aggregate

elasticsearch

hadoop

olap-cube

maaz

2 Answers

Radek Tomšej

BenG

Recent Activity

Donate For Us

Analytics in Elasticsearch

Tags:

aggregate

elasticsearch

hadoop

olap-cube

maaz

2 Answers

Radek Tomšej

BenG

Related questions

Recent Activity

Donate For Us