Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior with Pandas median

Consider the following dataframe:

       b           c     d     e  f     g     h
0   6.25  2018-04-01  True   NaN  7  54.0  64.0
1  32.50  2018-04-01  True   NaN  7  54.0  64.0
2  16.75  2018-04-01  True   NaN  7  54.0  64.0
3  29.25  2018-04-01  True   NaN  7  54.0  64.0
4  21.75  2018-04-01  True   NaN  7  54.0  64.0
5  21.75  2018-04-01  True  True  7  54.0  64.0
6   7.75  2018-04-01  True  True  7  54.0  64.0
7  23.25  2018-04-01  True  True  7  54.0  64.0
8  12.25  2018-04-01  True  True  7  54.0  64.0
9  30.50  2018-04-01  True   NaN  7  54.0  64.0

(copy and paste and use df = pd.read_clipboard() to create the dataframe)

Finding the medians initially works with no problem:

df.median()

b    21.75
d     1.00
e     1.00
f     7.00
g    54.00
h    64.00
dtype: float64

However, if a column is dropped and then the median is found, the median for column e disappears:

new_df = df.drop(columns=['b'])
new_df.median()

d     1.0
f     7.0
g    54.0
h    64.0
dtype: float64

This behavior is a little unexpected and finding the median for column e by itself still works:

new_df['e'].median()
1.0

Using skipna=False does not make a difference:

new_df.median(skipna=False)

d     1.0
f     7.0
g    54.0
h    64.0
dtype: float64

(it does for the original dataframe):

df.median(skipna=False)

b    21.75
d     1.00
e      NaN
f     7.00
g    54.00
h    64.00
dtype: float64

The datatype of column e is object in both df and new_df and the only difference between the two dataframes is new_df does not have column b. Adding the column back into new_df does not resolve the issue. This only occurs when the first column b is dropped. It does not occur if column e is a float or integer datatype.

This behavior is present in both pandas==0.22.0 and pandas==0.24.1

There is now an open GitHub issue for anyone to try and solve this!

like image 715
willk Avatar asked Feb 18 '19 21:02

willk


People also ask

What does median do in pandas?

Python Pandas DataFrame. median() function calculates the median of elements of DataFrame object along the specified axis. The median is not mean , but the middle of the values in the list of numbers.

How do you find the median of a panda?

Pandas DataFrame median() Method The median() method returns a Series with the median value of each column. Mean, Median, and Mode: Mean - The average value.

What does STD mean pandas?

In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.


1 Answers

This appears to be a bug. When we dispatch any df to median, this maps to the internal _reduce function. With numeric_only set to None, this computes the median by series, and ignore failures (for the c columns, for e.g. median computation will fail.) and accumulate results (see _reduce in pandas source core/frame.py). So far it is fine. But while stiching the results together through it does a check to infer if the results are scalar or series (for median it will be scalar of course). To do this check, it always use the first column (see wrap_results in pandas source core/apply.py). So if the first column calc failed and it was skipped, this check fails, raising an exception. This triggers the fallback method within _reduce of forcing the dataframe to numeric only (dropping any columns with NaN) and re-compute the medians.

So in your case, if the column c (or any other dtype where median computation will fail, like text) is in the first column, then all columns with NaN will also be dropped for the median results. Setting skipna does not change as the bug is with how non-numeric column in first position triggers a forced numeric only computation. I do not see there is any fix possible without fixing it in the pandas codebase. Or ensuring first column will always succeed for median computation.

like image 172
Prodipta Ghosh Avatar answered Sep 22 '22 15:09

Prodipta Ghosh