Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate distribution (Histogram) of large amount of data in a distributed system?

I am building a metrics reporting system on an instance fleet containing more than 100,000 front-end instances. For any request, every single instance will have a response time. And what I need is the distribution of the response time of every kinds of request over the whole fleet. For example the [Percentile 50, Percentile 90, Percentile 99, Percentile99.9...] of [requestType1, requestType2...requestType1000].

Every instance will collect the response time take place inside. So over a minute, what one instance collects in memory is the lists of response time of all kinds of requestTypes. For example requestType1 - [1, 2, 3, 4, 1, 2], requestType2 - [2, 2, 3, 2, 1]...... So what I need to do is to process these data and produce the final result.

I tried a lot of designs, my major pain points are the huge size of datapoints I collected of every single requestType, and the expense of communication between instances. I will explain my current design below, but I also want to know if there are better designs or some fancy algorithms can aggregate histograms?

Currently the most promising one is like: Every front-end instance will send their data to a random instance of a mid-layer instance fleet. In this mid-layer fleet, every instance will aggregate all datapoints it gets over a short period of time, e.g. 5 seconds. (It don't have enough memory to hold for a longer time). Then the mid-layer instance will distribute the aggregated data by hash value of requestTypes to back-end instances. What it means is all mid-layer instances will send the datapoints of the same requestTypes to the same back-end instance. Then in the back-end instance I may use a third party's Histogram container (CodaHale's histogram or HdrHistogram) to calculate P50, P90, P99 of incoming datapoints...The reason I need the mid-layer instance fleet is sending data from front-end instances is expensive, so I want all it's data be sent at once, but not make 100 calls to send to 100 different back-end instances.

The main problem I may think of this design is the relatively high complexity, and if one back-instance is down, I may loss all data of some requestTypes. So for the system design part, anyone have some better ideas?

The other way I am thinking is to find a fancy algorithm to aggregate existing histograms. The the design above, the data I get will be 100% accurate. But actually I can tolerate some mistakes. For example in CodaHale's histogram and HdrHistogram, I am sure they actually don't save all data points, but applied some advanced math algorithms to get a relatively high precision result with very low cost. And I can use the Histogram library in front-end or mid-layer instances. But the problems is although I can get the [P50, P90, P99...] of every front-end instance or mid-layer instance at a low cost, I couldn't find a way to aggregate them. Because different front-end instance may handle different types of requests, and the distribution of requests to front-end instances are unknown, so simply calculate the average value of ALL P50, P90, P99 will have a lot of inaccuracy. So does anyone have idea, how can I aggregate multiple CodaHale's histogram or HdrHistogram together? Or are there any algorithms can help to aggregate histograms into one?

========================================================================

I have some new idea last night. Since P50 and P90 are measuring the "average" of all data, I think simple apply weighted average on all P50 and P90 calculated in every mid-layer instance should be good enough. But P99, P99.9 and P99.99 are measuring those outlying data, so an average of P99 of subset may not be accurate.

But if assuming the data in mid-layer instance is relatively random distributed, I can get top 5% of datapoints in every mid-layer instance, and send them to back-end. The 5% of every mid-layer datapoints together is 5% of overall datapoints. And I have more confidence, that the P80 of these 5% data is close to P99 of overall data, P98 of these 5% data is close to P99.9 of overall data, and P99.8 of 5% data is close to P99.99 of overall data.

I hope in this way, I can only transfer 5% of overall data, but get a high accuracy result. What do you think of this way?

like image 642
Liu Yunao Avatar asked May 27 '15 04:05

Liu Yunao


1 Answers

System design:

If calls are expensive then maybe you could stream the data? I don't see real benefits of this mid-tier in your description - why frontend->midtier call cost is lower then frontend->backend?

If you are concerned about loosing data you have two options:

  • send events to multiple nodes. But you will need to somehow avoid duplication when processing them.
  • write everything to a persistent log (Kafka could do the work here)

It all depends on the volume of events (1/min/frontend or 10k/s/frontend) and distance between the frontend and the backend (same datacenter or mobile devices -> datacenter?).

If it's the same datacenter you could communicate with backend via persistent log - this solves data loss problem. If there are lots of events you could aggregate them on the frontends and push aggregates downstream

Aggregation:

There are various algorithms, e.g. q-digest, t-digest. See Quantiles over Data Streams: An Experimental Study

It's also worth noting that HdrHistograms can be combined

like image 53
mabn Avatar answered Sep 27 '22 23:09

mabn