I have a dataset with 100 000 datapoints which I have to plot on a graph. The resulting graph will be about 500px wide, so for every pixel there will be about 200 datapoints, which seems quite unnecessary.
I need to find a way to get rid of the excess datapoints without losing the shape of the graph to speed up the rendering. Currently the rendering of all 100 000 points can take 10+ seconds as I'm also using anti-aliasing and other "effects".
I tried to approach this problem by just taking every 200th datapoint and plotting them, but this results in some of the more significant points missing out (think about spikes in the graph that I want to be able to show). I also thought of splitting the dataset in chunks of 200 datapoints, then taking the maximum value from every chunk but that wont work either.
Is anyone aware of a method that would suit my needs here? The language I'm using is PHP, graph is created by GD and data is coming from MySQL, so optimizations to some of those are welcome.
The data is in this format:
Datetime Value
2005-01-30 00:00:00 35.30
2005-01-30 01:00:00 35.65
2005-01-30 02:00:00 36.15
2005-01-30 03:00:00 35.95
...
And the resulting graph currently looks like this:
alt text http://www.ulmanen.fi/stuff/graph-sample.png
I know this question is quite old but I had a problem almost similar.
To reduce the number of points to display without affecting the shape of the graph, We use the Ramer-Douglas-Peucker algoritm. The difference of shape between the uncompressed graph and the one with this algorithm is unnoticeable.
It seems to me that 1 in 200 is pretty serious data loss, and if those 200 values that should be represented with one value on the graph aren't close enough to be meaningfully substituted with an average, you have yourself a problem. If average isn't good enough, you must find a criterium to tell what data is more significant and should be included, and we can't help you with it because we don't know what kind of data it is, its statistical properties, or why any value would be more significant than the other. With those additional info, maybe a more specific answer could be given.
EDIT: After looking at the graph, it seems that you need both minimum and maximum in a given interval, because the dark blue area are values between those two, correct? Maybe you can take 100 values and make a graph from minimum, maximum, and average, so that every point in graph is made with 6 instead of 200 values, or something like that.
Another approach that might work is splitting the graph up into 200 point bins, and discard all but the maximum, minimum, and median points in each interval. Each of the three points in the interval gets plotted at its original location, so the locations of the extreme values won't change. Using the median instead of the mean will probably work better for your data set because the maxima are much more extreme than the minima, which would cause the filtered graph to shift upwards if you used the mean.
One approach to your problem is max-min decimation; I suggest you Google for a definition and algorithm I don't have either to hand or I would share with you.
Beyond that I think you might use a low-pass (anti-aliasing) filter followed by simple decimation (ie throwing away excess points).
I think that ordinary average from each 200 bunch of points would be just enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With