I am working with Analytic of events, I use hadoop to process the logs and store some results in Mysql. This did not work now due to scalability issues as logs are keep coming daily.
We need to show stats per year, month, week, day, hour along with filtering capability
Our samples can grow for 100k users, each uses 20 websites each hour
100,000(users) * 20 (unique website) * 2 (locations) * 24 (hours) = 96,000,000 (96 Million max records per day)
Our table looks like
event_src_id, Time, User, Website, location, some stats
Some queries example are
1) select website, sum(stats), count(distinct(user_id)) from table group by website;
2) select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time);
3) select website, sum(stats), count(distinct(user_id)) from table group by website where event_src_id=XXXXXXXXXXX;
4) select website, sum(stats), count(distinct(user_id)) from table group by website where time > 1 jan 2014 and time <=31 jan 2014;
5) select website, location, sum(stats), count(distinct(user_id)) from table group by website, location;
6) select website, sum(stats) as stats_val from table group by website order by stats_val desc limit 10;
select location, sum(stats) as stats_val from table group by location order by stats_val desc limit 10;
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)
I tried Hadoop elastic search and its seems like insertion part can fixed with that, I am more worried on the reading part. The aggregation framework seems to give some hope but I could not work as per query one. how to group and sum and distinct at same time? How can I best use Elasticsearch with Hadoop with given scalability and performance for OLAP based quires. Any help will be appreciated.
Firstly I do not think that using ElasticSearch for OLAP-like queries is a good idea. I would recommend you use some Dremel-like technologies (Impala, TEZ, Storm etc.) which support sql you specified. It has some advantages like:
Do not get me wrong, I love ElasticSearch/Logstash/Kibana, but for log collecting and visualization. Of course it is possible to make some advanced queries but it has some limitations which I found out in my personal projects.
Also consider using Kibana, it is a great tool for data statistics in ElasticSearch and you can do a lot of things with it.
Here are some queries example as you requested (i did not tested it):
1)
{
"aggs": {
"website": {
"terms": {
"field": "website"
},
"aggs": {
"sum_stats": {
"sum": {
"field": "stats"
},
"aggs": {
"distinct_user": {
"cardinality": {
"field": "user_id",
"precision_threshold": 100
}
}
}
}
}
}
}
}
2-6 are similar, use things from 1) with different filters like this:
{
"aggs": {
"your_filter": {
"filter": {
"term": {"event_src_id" : "XXXXXXXXXXX"}
}
},
"aggs": {
"website": {
"terms": {
"field": "website"
},
"aggs": {
"sum_stats": {
"sum": {
"field": "stats"
},
"aggs": {
"distinct_user": {
"cardinality": {
"field": "user_id",
"precision_threshold": 100
}
}
}
}
}
}
}
}
}
7) DELETE is quite easy
"query" : {
"term" : { "event_src_id" : "XXXXXXXXXXX" }
}
}
how to group and sum and distinct at same time
Aggregations can have sub-aggregations.
First, the group functionality corresponds to the terms aggregation and (sometimes) the top_hits aggregation. Second, There is a sum aggregation, a simple stats metric aggregation. Finally, Your use of distinct in this case is to perform count(distinct), which corresponds to a cardinality aggregation, which can be approximate or exact depending on your needs.
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)
There is a delete by query api which you can use, but be careful about high percentages of deleted documents; Lucene and Elasticsearch are not optimized for this, and you will incur an overhead due to delete markers in the data.
Examples
select website, sum(stats), count(distinct(user_id)) from table group by website
GET /_search
{
"aggs": {
"website_stats": {
"terms": {
"field": "website"
},
"aggs": {
"sum_stats": {
"sum": {
"field": "stats"
}
},
"count_users": {
"cardinality": {
"field": "user_id"
}
}
}
}
}
}
select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time)
GET /_search
{
"query": {
"filter": {
"range": {
"Time": {
"gt": "2009-01-01 00:00:00",
"lt": "2010-01-01 00:00:00"
}
}
}
},
"aggs": {
"monthly_stats" {
"terms": {
"field": "website"
},
"aggs": {
"months": {
"date_histogram": {
"field": "Time",
"interval": "month"
},
"aggs" : {
"sum_stats": {
"sum": {
"field": "stats"
}
},
"count_users": {
"cardinality": {
"field": "user_id"
}
}
}
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With