Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve min/max downsampling

I have some large arrays (~100 million points) that I need to interactively plot. I am currenlty using Matplotlib. Plotting the arrays as-is gets very slow and is a waste since you can't visualize that many points anyway.

So I made a min/max decimation function that I tied to the 'xlim_changed' callback of the axis. I went with a min/max approach because the data contains fast spikes that I do not want to miss by just stepping through the data. There are more wrappers that crop to the x-limits, and skip processing under certain conditions but the relevant part is below:

def min_max_downsample(x,y,num_bins):
    """ Break the data into num_bins and returns min/max for each bin"""
    pts_per_bin = x.size // num_bins    

    #Create temp to hold the reshaped & slightly cropped y
    y_temp = y[:num_bins*pts_per_bin].reshape((num_bins, pts_per_bin))
    y_out      = np.empty((num_bins,2))
    #Take the min/max by rows.
    y_out[:,0] = y_temp.max(axis=1)
    y_out[:,1] = y_temp.min(axis=1)
    y_out = y_out.ravel()

    #This duplicates the x-value for each min/max y-pair
    x_out = np.empty((num_bins,2))
    x_out[:] = x[:num_bins*pts_per_bin:pts_per_bin,np.newaxis]
    x_out = x_out.ravel()
    return x_out, y_out

This works pretty well and is sufficiently fast (~80ms on 1e8 points & 2k bins). There is very little lag as it periodically recalculates & updates the line's x & y-data.

However, my only complaint is in the x-data. This code duplicates the x-value of each bin's left edge and doesn't return the true x-location of the y min/max pairs. I typically set the number of bins to double the axis pixel width. So you can't really see the difference because the bins are so small...but I know its there... and it bugs me.

So attempt number 2 which does return the actual x-values for every min/max pair. However it is about 5x slower.

def min_max_downsample_v2(x,y,num_bins):
    pts_per_bin = x.size // num_bins
    #Create temp to hold the reshaped & slightly cropped y
    y_temp = y[:num_bins*pts_per_bin].reshape((num_bins, pts_per_bin))
    #use argmax/min to get column locations
    cc_max = y_temp.argmax(axis=1)
    cc_min = y_temp.argmin(axis=1)    
    rr = np.arange(0,num_bins)
    #compute the flat index to where these are
    flat_max = cc_max + rr*pts_per_bin
    flat_min = cc_min + rr*pts_per_bin
    #Create a boolean mask of these locations
    mm_mask  = np.full((x.size,), False)
    mm_mask[flat_max] = True
    mm_mask[flat_min] = True  
    x_out = x[mm_mask]    
    y_out = y[mm_mask]  
    return x_out, y_out

This takes roughly 400+ ms on my machine which becomes pretty noticeable. So my question is basically is there a way to go faster and provide the same results? The bottleneck is mostly in the numpy.argmin and numpy.argmax functions which are a good bit slower than numpy.min and numpy.max.

The answer might be to just live with version #1 since it visually doesn't really matter. Or maybe try to speed it up something like cython (which I have never used).

FYI using Python 3.6.4 on Windows ... example usage would be something like this:

x_big = np.linspace(0,10,100000000)
y_big = np.cos(x_big )
x_small, y_small = min_max_downsample(x_big ,y_big ,2000) #Fast but not exactly correct.
x_small, y_small = min_max_downsample_v2(x_big ,y_big ,2000) #correct but not exactly fast.
like image 910
Aero Engy Avatar asked Jan 30 '19 21:01

Aero Engy


People also ask

What is downsampling and how to do it?

1. Downsampling and performing aggregation Downsampling is to resample a time-series dataset to a wider time frame. For example, from minutes to hours, from days to years. The result will have a reduced number of rows and values can be aggregated with mean (), min (), max (), sum () etc. Let’s see how it works with the help of an example.

What happens after downsampling by a factor of M?

After downsampling by a factor of M , the new sampling period becomes MT , and therefore the new sampling frequency is where f s is the original sampling rate. This tells us that after downsampling by a factor of M , the new folding frequency will be decreased M times.

How good is 0 0 downsampling?

0 Downsampling does wonders to The Witcher 2, since their own Ubersampling is a bit of an overkill. Nine out of ten orphans can't tell the difference. Jul 1, 2009 18,775 0 0 Downsampling is basically "stupid" AA. You just choose to render everything at a higher resolution. Click to expand...

What is the best GPU for downsampling?

An Nvidia GPU ( use this for downsampling on AMD cards ), Windows 7 ( SOME people have issues with this method on Windows 8 ) and 10 minutes worth of patience. From my own experience going from 4xx to 5xx and 6xx series the downsampling compatibility has become better.


1 Answers

I managed to get an improved performance by using the output of arg(min|max) directly to index the data arrays. This comes at the cost of an extra call to np.sort but the axis to be sorted has only two elements (the min. / max. indices) and the overall array is rather small (number of bins):

def min_max_downsample_v3(x, y, num_bins):
    pts_per_bin = x.size // num_bins

    x_view = x[:pts_per_bin*num_bins].reshape(num_bins, pts_per_bin)
    y_view = y[:pts_per_bin*num_bins].reshape(num_bins, pts_per_bin)
    i_min = np.argmin(y_view, axis=1)
    i_max = np.argmax(y_view, axis=1)

    r_index = np.repeat(np.arange(num_bins), 2)
    c_index = np.sort(np.stack((i_min, i_max), axis=1)).ravel()

    return x_view[r_index, c_index], y_view[r_index, c_index]

I checked the timings for your example and I obtained:

  • min_max_downsample_v1: 110 ms ± 5 ms
  • min_max_downsample_v2: 240 ms ± 8.01 ms
  • min_max_downsample_v3: 164 ms ± 1.23 ms

I also checked returning directly after the calls to arg(min|max) and the result was equally 164 ms, i.e. there's no real overhead after that anymore.

like image 178
a_guest Avatar answered Nov 06 '22 15:11

a_guest