I am looking for an algorithm to generate a histogram over a large amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.
I appreciate your ideas.
Cheers,
Then, derive the frequency density for each interval by dividing the frequency by the corresponding class width. Finally, the area for the histogram equation is calculated by adding the product of all the frequency density and their corresponding class width.
A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called "intervals", "classes", or "buckets".
A histogram is a graph that shows the frequency of numerical data using rectangles. The height of a rectangle (the vertical axis) represents the distribution frequency of a variable (the amount, or how often that variable appears).
I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :
A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With