Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe resample at every nth row

Tags:

pandas

I have a script that reads system log files into pandas dataframes and produces charts from those. The charts are fine for small data sets. But when I face larger data sets due to larger timeframe of data gathering, the charts become too crowded to discern.

I am planning to resample the dataframe so that if the dataset passes certain size, I will resample it so there are ultimately only the SIZE_LIMIT number of rows. This means I need to filter the dataframe so every n = actual_size/SIZE_LIMIT rows would aggregated to a single row in the new dataframe. The agregation can be either average value or just the nth row taken as is.

I am not fully versed with pandas, so may have missed some obvious means.

like image 556
nom-mon-ir Avatar asked Dec 01 '22 20:12

nom-mon-ir


2 Answers

Actually I think you should not modify the data itself, but to take a view of the data in the desired interval to plot. This view would be the actual datapoints to be plotted.

A naive approach would be, for a computer screen for example, to calculate how many points are in your interval, and how many pixels you have available. Thus, for plotting a dataframe with 10000 points in a window 1000 pixels width, you take a slice with a STEP of 10, using this syntax (whole_data would be a 1D array just for the example):

data_to_plot = whole_data[::10]

This might have undesired effects, specifically masking short peaks that might "escape invisible" from the slicing operation. An alternative would be to split your data into bins, then calculating one datapoint (maximum value, for example) for each bin. I feel that these operations might actually be fast due to numpy/pandas efficient array operations.

Hope this helps!

like image 136
heltonbiker Avatar answered Jan 06 '23 13:01

heltonbiker


You could use the pandas.qcut method on the index to divide the index into equal quantiles. The value you pass to qcut could be actual_size/SIZE_LIMIT.

In [1]: from pandas import *

In [2]: df = DataFrame({'a':range(10000)})

In [3]: df.head()

Out[3]:
   a
0  0
1  1
2  2
3  3
4  4

Here, grouping the index by qcut(df.index,5) results in 5 equally binned groups. I then take the mean of each group.

In [4]: df.groupby(qcut(df.index,5)).mean()

Out[4]:
                       a
[0, 1999.8]        999.5
(1999.8, 3999.6]  2999.5
(3999.6, 5999.4]  4999.5
(5999.4, 7999.2]  6999.5
(7999.2, 9999]    8999.5
like image 26
Zelazny7 Avatar answered Jan 06 '23 12:01

Zelazny7