Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid poor performance of pandas mean() with datetime columns

I have a pandas (version 0.25.3) DataFrame containing a datetime64 column. I'd like to calculate the mean of each column.

import numpy as np
import pandas as pd

n = 1000000
df = pd.DataFrame({
    "x": np.random.normal(0.0, 1.0, n),
    "d": pd.date_range(pd.datetime.today(), periods=n, freq="1H").tolist()
})

Calculating the mean of individual columns is pretty much instantaneous.

df["x"].mean()
## 1000 loops, best of 3: 1.35 ms per loop
df["d"].mean()
## 100 loops, best of 3: 2.91 ms per loop

However, when I use the DataFrame's .mean() method, it takes a really long time.

%timeit df.mean()
## 1 loop, best of 3: 9.23 s per loop

It isn't clear to me where the performance penalty comes from.

What is the best way to avoid the slowdown? Should I convert the datetime64 column to a different type? Is using the DataFrame-level .mean() method considered bad form?

like image 451
Richie Cotton Avatar asked Jan 15 '20 20:01

Richie Cotton


2 Answers

You could restrict it to the numeric values: df.mean(numeric_only=True)

Then it runs very fast as well.

Here is the text from the documentation:

numeric_only : bool, default None Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

-- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html

like image 100
mpaepper Avatar answered Sep 30 '22 00:09

mpaepper


This is a pandas bug.

On the current master (so probably with Pandas >= 1.3), the minimal example in the question will be fast also when running df.mean(), but the column d is not in the result. You still have to do df["d"].mean() to get a result. I guess this is done to avoid breaking changes, but I am not sure.

Passing the parameter numeric_only=True to .mean() or calling .mean() on columns and not on the dataframe are good workarounds.

Note: Things are not very intuitive, if your Dataframe contains column with a non-numeric data type, such as string, dates etc. Pandas then tries to do a sum (what ever that means for the datatype), then convert it to numbers and divide by the number of rows. For strings this leads to weird results such as "42" + "42" + "42" which is "424242" and then converted to 424242. and divided by 3. For non-numeric values this can be pretty slow. If the concatenation of the strings cannot be converted to a number, the result is either omitted for df.mean() or an error is raised for Pandas >= 1.3 or if you call mean() on the column.

like image 26
lumbric Avatar answered Sep 30 '22 01:09

lumbric