I have a pandas (version 0.25.3
) DataFrame
containing a datetime64
column. I'd like to calculate the mean of each column.
import numpy as np
import pandas as pd
n = 1000000
df = pd.DataFrame({
"x": np.random.normal(0.0, 1.0, n),
"d": pd.date_range(pd.datetime.today(), periods=n, freq="1H").tolist()
})
Calculating the mean of individual columns is pretty much instantaneous.
df["x"].mean()
## 1000 loops, best of 3: 1.35 ms per loop
df["d"].mean()
## 100 loops, best of 3: 2.91 ms per loop
However, when I use the DataFrame's .mean()
method, it takes a really long time.
%timeit df.mean()
## 1 loop, best of 3: 9.23 s per loop
It isn't clear to me where the performance penalty comes from.
What is the best way to avoid the slowdown? Should I convert the datetime64
column to a different type? Is using the DataFrame
-level .mean()
method considered bad form?
You could restrict it to the numeric values:
df.mean(numeric_only=True)
Then it runs very fast as well.
Here is the text from the documentation:
numeric_only : bool, default None Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
-- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html
This is a pandas bug.
On the current master (so probably with Pandas >= 1.3), the minimal example in the question will be fast also when running df.mean()
, but the column d
is not in the result. You still have to do df["d"].mean()
to get a result. I guess this is done to avoid breaking changes, but I am not sure.
Passing the parameter numeric_only=True
to .mean()
or calling .mean()
on columns and not on the dataframe are good workarounds.
Note: Things are not very intuitive, if your Dataframe contains column with a non-numeric data type, such as string, dates etc. Pandas then tries to do a sum (what ever that means for the datatype), then convert it to numbers and divide by the number of rows. For strings this leads to weird results such as "42" + "42" + "42"
which is "424242"
and then converted to 424242.
and divided by 3. For non-numeric values this can be pretty slow. If the concatenation of the strings cannot be converted to a number, the result is either omitted for df.mean()
or an error is raised for Pandas >= 1.3 or if you call mean()
on the column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With