Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas resample data frame with fixed number of rows

Tags:

python

pandas

With pandas.DataFrame.resample I can downsample a DataFrame into a certain time duration:

df.resample("3s").mean()

However, I do not want to specify a certain time, but rather a fixed number of rows in the original data frame, e.g. "resample such that three rows previously are now aggregated into one". How's that possible in pandas?

like image 857
knub Avatar asked Jun 01 '17 11:06

knub


People also ask

How do I resample data in Pandas?

Resample Pandas time-series data. The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Is Iterrows faster than apply?

This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.

How do you count the number of rows in Pandas based on condition?

To use Pandas to count the number of rows in each group created by the Pandas . groupby() method, we can use the size attribute. This returns a series of different counts of rows belonging to each group.

How do I reduce the number of rows in Pandas?

To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.


1 Answers

It might be a bit late, but here is my answer for everyone searching for a solution to this problem.

One solution would be to use pandas rolling(n) sliding window functionality and then select every nth value. e.G. for n=3

df_sub = df.rolling(3).mean()[::3]

this is a bit wasteful for calculation, since you recalculate the whole dataframe and then just keep 1/n percent of it.

Another similar approach to the problem, wich is not calculating the mean, but instead interpolating the whole dataframe column wise would be to use numpy's interp1 function.

e.G.: Assuming you have a DataFrame, where the indices are are monotonically increasing numerical/timestamped values (as usually with time series data), and you want to adjust every column individually you could do it like this:

def resample_fixed(df, n_new):
    n_old, m = df.values.shape
    mat_old = df.values
    mat_new = np.zeros((n_new, m))
    x_old = np.linspace(df.index.min(), df.index.max(), n_old)
    x_new = np.linspace(df.index.min(), df.index.max(), n_new)

    for j in range(m):
        y_old = mat_old[:, j]
        y_new = np.interp(x_new, x_old, y_old)
        mat_new[:, j] = y_new

    return pd.DataFrame(mat_new, index=x_new, columns=df.columns)

be careful though, interp1 does alter your data, since it linearly interpolates your datapoints. I would recommend inspecting the result after interpolation.

You can find a full example on the interpolation in a gist file I did for that here.

like image 114
Tobi Avatar answered Sep 30 '22 15:09

Tobi