Summary of Question:
Are there any easy to implement algorithms for reducing the number of points needed to represent a time series without altering how it appears in a plot?
Motivating Problem:
I'm trying to interactively visualize 10 to 15 channels of data logged from an embedded system at ~20 kHz. Logs can cover upwards of an hour of time which means that I'm dealing with between 1e8 and 1e9 points. Further, I care about potentially small anomalies that last for very short periods of time (i.e. less than 1 ms) such that simple decimation isn't an option.
Not surprisingly, most plotting libraries get a little sad if you do the naive thing and try to hand them arrays of data larger than the dedicated GPU memory. It's actually a bit worse than this on my system; using a vector of random floats as a test case, I'm only getting about 5e7 points out of the stock Matlab plotting function and Python + matplotlib before my refresh rate drops below 1 FPS.
Existing Questions and Solutions:
This problem is somewhat similar to a number of existing questions such as:
How to plot large data vectors accurately at all zoom levels in real time?
How to plot large time series (thousands of administration times/doses of a medication)?
[Several Cross Validated questions]
but deals with larger data sets and/or is more stringent about fidelity at the cost of interactivity (it would be great to get 60 FPS silky smooth panning and zooming, but realistically, I would be happy with 1 FPS).
Clearly, some form of data reduction is needed. There are two paradigms that I have found while searching for existing tools that solve my problem:
Decimate but track outliers: A good example of this is Matlab + dsplot (i.e. the tool suggested in the accepted answer of the first question I linked above). dsplot decimates down to a fixed number of evenly spaced points, but then adds back in outliers identified using the standard deviation of a high pass FIR filter. While this is probably a viable solution for several classes of data, it potentially has difficulties if there is substantial frequency content past the filter cutoff frequency and may require tuning.
Plot min and max: With this approach, you divide the time series up in to intervals corresponding to each horizontal pixel and plot just the minimum and maximum values in each interval. Matlab + Plot (Big) is a good example of this, but uses an O(n) calculation of min and max making it a bit slow by the time you get to 1e8 or 1e9 points. A binary search tree in a mex function or python would solve this problem, but is complicated to implemented.
Are there any simpler solutions that do what I want?
Edit (2018-02-18): Question refactored to focus on algorithms instead of tools implementing algorithms.
I had the very same problem displaying pressure timeseries of hundreds of sensors, with samples every minute for several years. In some cases (like when cleaning the data), I wanted to see all the outliers, others I was more interested in the trend. So I wrote a function that can reduce the number of data points using two methods: visvalingam and Douglas-Peucker. The first tend to remove outliers, and the second keeps them. I've optimized the function to work over large datasets. I did that after realizing that all the plotting methods weren't capable to handle that many points, and the ones that did, were decimating the dataset in a way that I couldn't control. The function is the following:
function [X, Y, indices, relevance] = lineSimplificationI(X,Y,N,method,option)
%lineSimplification Reduce the number of points of the line described by X
%and Y to N. Preserving the most relevant ones.
% Using an adapted method of visvalingam and Douglas-Peucker algorithms.
% The number of points of the line is reduced iteratively until reaching
% N non-NaN points. Repeated NaN points in original data are deleted but
% non-repeated NaNs are preserved to keep line breaks.
% The two available methods are
%
% Visvalingam: The relevance of a point is proportional to the area of
% the triangle defined by the point and its two neighbors.
%
% Douglas-Peucker: The relevance of a point is proportional to the
% distance between it and the straight line defined by its two neighbors.
% Note that the implementation here is iterative but NOT recursive as in
% the original algorithm. This allows to better handle large data sets.
%
% DIFFERENCES: Visvalingam tend to remove outliers while Douglas-Peucker
% keeps them.
%
% INPUTS:
% X: X coordinates of the line points
% Y: Y coordinates of the line points
% method: Either 'Visvalingam' or 'DouglasPeucker' (default)
% option: Either 'silent' (default) or 'verbose' if additional outputs
% of the calculations are desired.
%
% OUTPUTS:
% X: X coordinates of the simplified line points
% Y: Y coordinates of the simplified line points
% indices: Indices to the positions of the points preserved in the
% original X and Y. Therefore Output X is equal to the input
% X(indices).
% relevance: Relevance of the returned points. It can be used to furder
% simplify the line dinamically by keeping only points with
% higher relevance. But this will produce bigger distortions of
% the line shape than calling again lineSimplification with a
% smaller value for N, as removing a point changes the relevance
% of its neighbors.
%
% Implementation by Camilo Rada - [email protected]
%
if nargin < 3
error('Line points positions X, Y and target point count N MUST be specified');
end
if nargin < 4
method='DouglasPeucker';
end
if nargin < 5
option='silent';
end
doDisplay=strcmp(option,'verbose');
X=double(X(:));
Y=double(Y(:));
indices=1:length(Y);
if length(X)~=length(Y)
error('Vectors X and Y MUST have the same number of elements');
end
if N>=length(Y)
relevance=ones(length(Y),1);
if doDisplay
disp('N is greater or equal than the number of points in the line. Original X,Y were returned. Relevances were not computed.')
end
return
end
% Removing repeated NaN from Y
% We find all the NaNs with another NaN to the left
repeatedNaNs= isnan(Y(2:end)) & isnan(Y(1:end-1));
%We also consider a repeated NaN the first element if NaN
repeatedNaNs=[isnan(Y(1)); repeatedNaNs(:)];
Y=Y(~repeatedNaNs);
X=X(~repeatedNaNs);
indices=indices(~repeatedNaNs);
%Removing trailing NaN if any
if isnan(Y(end))
Y=Y(1:end-1);
X=X(1:end-1);
indices=indices(1:end-1);
end
pCount=length(X);
if doDisplay
disp(['Initial point count = ' num2str(pCount)])
disp(['Non repeated NaN count in data = ' num2str(sum(isnan(Y)))])
end
iterCount=0;
while pCount>N
iterCount=iterCount+1;
% If the vertices of a triangle are at the points (x1,y1) , (x2, y2) and
% (x3,y3) the are uf such triangle is
% area = abs((x1*(y2-y3)+x2*(y3-y1)+x3*(y1-y2))/2)
% now the areas of the triangles defined by each point of X,Y and its two
% neighbors are
twiceTriangleArea =abs((X(1:end-2).*(Y(2:end-1)-Y(3:end))+X(2:end-1).*(Y(3:end)-Y(1:end-2))+X(3:end).*(Y(1:end-2)-Y(2:end-1))));
switch method
case 'Visvalingam'
% In this case the relevance is given by the area of the
% triangle formed by each point end the two points besides
relevance=twiceTriangleArea/2;
case 'DouglasPeucker'
% In this case the relevance is given by the minimum distance
% from the point to the line formed by its two neighbors
neighborDistances=ppDistance([X(1:end-2) Y(1:end-2)],[X(3:end) Y(3:end)]);
relevance=twiceTriangleArea./neighborDistances;
otherwise
error(['Unknown method: ' method]);
end
relevance=[Inf; relevance; Inf];
%We remove the pCount-N least relevant points as long as they are not contiguous
[srelevance, sortorder]= sort(relevance,'descend');
firstFinite=find(isfinite(srelevance),1,'first');
startPos=uint32(firstFinite+N+1);
toRemove=sort(sortorder(startPos:end));
if isempty(toRemove)
break;
end
%Now we have to deal with contigous elements, as removing one will
%change the relevance of the neighbors. Therefore we have to
%identify pairs of contigous points and only remove the one with
%leeser relevance
%Contigous will be true for an element if the next or the previous
%element is also flagged for removal
contiguousToKeep=[diff(toRemove(:))==1; false] | [false; (toRemove(1:end-1)-toRemove(2:end))==-1];
notContiguous=~contiguousToKeep;
%And the relevances asoociated to the elements flagged for removal
contRel=relevance(toRemove);
% Now we rearrange contigous so it is sorted in two rows, therefore
% if both rows are true in a given column, we have a case of two
% contigous points that are both flagged for removal
% this process is demenden of the rearrangement, as contigous
% elements can end up in different colums, so it has to be done
% twice to make sure no contigous elements are removed
nContiguous=length(contiguousToKeep);
for paddingMode=1:2
%The rearragngement is only possible if we have an even number of
%elements, so we add one dummy zero at the end if needed
if paddingMode==1
if mod(nContiguous,2)
pcontiguous=[contiguousToKeep; false];
pcontRel=[contRel; -Inf];
else
pcontiguous=contiguousToKeep;
pcontRel=contRel;
end
else
if mod(nContiguous,2)
pcontiguous=[false; contiguousToKeep];
pcontRel=[-Inf; contRel];
else
pcontiguous=[false; contiguousToKeep(1:end-1)];
pcontRel=[-Inf; contRel(1:end-1)];
end
end
contiguousPairs=reshape(pcontiguous,2,[]);
pcontRel=reshape(pcontRel,2,[]);
%finding colums with contigous element
contCols=all(contiguousPairs);
if ~any(contCols) && paddingMode==2
break;
end
%finding the row of the least relevant element of each column
[~, lesserElementRow]=max(pcontRel);
%The index in contigous of the first element of each pair is
if paddingMode==1
firstElementIdx=((1:size(contiguousPairs,2))*2)-1;
else
firstElementIdx=((1:size(contiguousPairs,2))*2)-2;
end
% and the index in contigous of the most relevant element of each
% pair is
lesserElementIdx=firstElementIdx+lesserElementRow-1;
%now we set the least relevant element as NOT continous, so it is
%removed
contiguousToKeep(lesserElementIdx(contCols))=false;
end
%and now we delete the relevant continous points from the toRemove
%list
toRemove=toRemove(contiguousToKeep | notContiguous);
if any(diff(toRemove(:))==1) && doDisplay
warning([num2str(sum(diff(toRemove(:))==1)) ' continous elements removed in one iteration.'])
end
toRemoveLogical=false(pCount,1);
toRemoveLogical(toRemove)=true;
X=X(~toRemoveLogical);
Y=Y(~toRemoveLogical);
indices=indices(~toRemoveLogical);
pCount=length(X);
nRemoved=sum(toRemoveLogical);
if doDisplay
disp(['Iteration ' num2str(iterCount) ', Point count = ' num2str(pCount) ' (' num2str(nRemoved) ' removed)'])
end
if nRemoved==0
break;
end
end
end
function d = ppDistance(p1,p2)
d=sqrt((p1(:,1)-p2(:,1)).^2+(p1(:,2)-p2(:,2)).^2);
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With