With pandas.DataFrame.resample I can downsample a DataFrame into a certain time duration:
df.resample("3s").mean()
However, I do not want to specify a certain time, but rather a fixed number of rows in the original data frame, e.g. "resample such that three rows previously are now aggregated into one". How's that possible in pandas?
Resample Pandas time-series data. The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.
To use Pandas to count the number of rows in each group created by the Pandas . groupby() method, we can use the size attribute. This returns a series of different counts of rows belonging to each group.
To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.
It might be a bit late, but here is my answer for everyone searching for a solution to this problem.
One solution would be to use pandas rolling(n) sliding window functionality and then select every nth value. e.G. for n=3
df_sub = df.rolling(3).mean()[::3]
this is a bit wasteful for calculation, since you recalculate the whole dataframe and then just keep 1/n percent of it.
Another similar approach to the problem, wich is not calculating the mean, but instead interpolating the whole dataframe column wise would be to use numpy's interp1 function.
e.G.: Assuming you have a DataFrame, where the indices are are monotonically increasing numerical/timestamped values (as usually with time series data), and you want to adjust every column individually you could do it like this:
def resample_fixed(df, n_new):
n_old, m = df.values.shape
mat_old = df.values
mat_new = np.zeros((n_new, m))
x_old = np.linspace(df.index.min(), df.index.max(), n_old)
x_new = np.linspace(df.index.min(), df.index.max(), n_new)
for j in range(m):
y_old = mat_old[:, j]
y_new = np.interp(x_new, x_old, y_old)
mat_new[:, j] = y_new
return pd.DataFrame(mat_new, index=x_new, columns=df.columns)
be careful though, interp1 does alter your data, since it linearly interpolates your datapoints. I would recommend inspecting the result after interpolation.
You can find a full example on the interpolation in a gist file I did for that here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With