Consider the following dataframe:
b c d e f g h
0 6.25 2018-04-01 True NaN 7 54.0 64.0
1 32.50 2018-04-01 True NaN 7 54.0 64.0
2 16.75 2018-04-01 True NaN 7 54.0 64.0
3 29.25 2018-04-01 True NaN 7 54.0 64.0
4 21.75 2018-04-01 True NaN 7 54.0 64.0
5 21.75 2018-04-01 True True 7 54.0 64.0
6 7.75 2018-04-01 True True 7 54.0 64.0
7 23.25 2018-04-01 True True 7 54.0 64.0
8 12.25 2018-04-01 True True 7 54.0 64.0
9 30.50 2018-04-01 True NaN 7 54.0 64.0
(copy and paste and use df = pd.read_clipboard()
to create the dataframe)
Finding the medians initially works with no problem:
df.median()
b 21.75
d 1.00
e 1.00
f 7.00
g 54.00
h 64.00
dtype: float64
However, if a column is dropped and then the median
is found, the median for column e
disappears:
new_df = df.drop(columns=['b'])
new_df.median()
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64
This behavior is a little unexpected and finding the median for column e by itself still works:
new_df['e'].median()
1.0
Using skipna=False
does not make a difference:
new_df.median(skipna=False)
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64
(it does for the original dataframe):
df.median(skipna=False)
b 21.75
d 1.00
e NaN
f 7.00
g 54.00
h 64.00
dtype: float64
The datatype of column e
is object
in both df
and new_df
and the only difference between the two dataframes is new_df
does not have column b
. Adding the column back into new_df
does not resolve the issue. This only occurs when the first column b
is dropped. It does not occur if column e
is a float or integer datatype.
This behavior is present in both pandas==0.22.0
and pandas==0.24.1
There is now an open GitHub issue for anyone to try and solve this!
Python Pandas DataFrame. median() function calculates the median of elements of DataFrame object along the specified axis. The median is not mean , but the middle of the values in the list of numbers.
Pandas DataFrame median() Method The median() method returns a Series with the median value of each column. Mean, Median, and Mode: Mean - The average value.
In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.
This appears to be a bug. When we dispatch any df to median
, this maps to the internal _reduce
function. With numeric_only
set to None
, this computes the median by series, and ignore failures (for the c
columns, for e.g. median computation will fail.) and accumulate results (see _reduce
in pandas source core/frame.py). So far it is fine. But while stiching the results together through it does a check to infer if the results are scalar or series (for median
it will be scalar of course). To do this check, it always use the first column (see wrap_results
in pandas source core/apply.py). So if the first column calc failed and it was skipped, this check fails, raising an exception. This triggers the fallback method within _reduce
of forcing the dataframe to numeric only (dropping any columns with NaN
) and re-compute the medians.
So in your case, if the column c (or any other dtype where median computation will fail, like text) is in the first column, then all columns with NaN
will also be dropped for the median results. Setting skipna
does not change as the bug is with how non-numeric column in first position triggers a forced numeric only computation. I do not see there is any fix possible without fixing it in the pandas codebase. Or ensuring first column will always succeed for median computation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With