Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating a histogram on a streaming data - Online histogram calculation

I am looking for an algorithm to generate a histogram over a large amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.

I appreciate your ideas.

Cheers,

like image 860
Ali Salehi Avatar asked Jun 17 '11 12:06

Ali Salehi


People also ask

How do you calculate a histogram?

Then, derive the frequency density for each interval by dividing the frequency by the corresponding class width. Finally, the area for the histogram equation is calculated by adding the product of all the frequency density and their corresponding class width.

What is bucket in histogram?

A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called "intervals", "classes", or "buckets".

What does a histogram show?

A histogram is a graph that shows the frequency of numerical data using rectangles. The height of a rectangle (the vertical axis) represents the distribution frequency of a variable (the amount, or how often that variable appears).


1 Answers

I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.

like image 104
Ali Salehi Avatar answered Sep 20 '22 15:09

Ali Salehi