Strange behavior with Pandas median

Tags:

Consider the following dataframe:

       b           c     d     e  f     g     h
0   6.25  2018-04-01  True   NaN  7  54.0  64.0
1  32.50  2018-04-01  True   NaN  7  54.0  64.0
2  16.75  2018-04-01  True   NaN  7  54.0  64.0
3  29.25  2018-04-01  True   NaN  7  54.0  64.0
4  21.75  2018-04-01  True   NaN  7  54.0  64.0
5  21.75  2018-04-01  True  True  7  54.0  64.0
6   7.75  2018-04-01  True  True  7  54.0  64.0
7  23.25  2018-04-01  True  True  7  54.0  64.0
8  12.25  2018-04-01  True  True  7  54.0  64.0
9  30.50  2018-04-01  True   NaN  7  54.0  64.0

(copy and paste and use df = pd.read_clipboard() to create the dataframe)

Finding the medians initially works with no problem:

df.median()

b    21.75
d     1.00
e     1.00
f     7.00
g    54.00
h    64.00
dtype: float64

However, if a column is dropped and then the median is found, the median for column e disappears:

new_df = df.drop(columns=['b'])
new_df.median()

d     1.0
f     7.0
g    54.0
h    64.0
dtype: float64

This behavior is a little unexpected and finding the median for column e by itself still works:

new_df['e'].median()
1.0

Using skipna=False does not make a difference:

new_df.median(skipna=False)

d     1.0
f     7.0
g    54.0
h    64.0
dtype: float64

(it does for the original dataframe):

df.median(skipna=False)

b    21.75
d     1.00
e      NaN
f     7.00
g    54.00
h    64.00
dtype: float64

The datatype of column e is object in both df and new_df and the only difference between the two dataframes is new_df does not have column b. Adding the column back into new_df does not resolve the issue. This only occurs when the first column b is dropped. It does not occur if column e is a float or integer datatype.

This behavior is present in both pandas==0.22.0 and pandas==0.24.1

There is now an open GitHub issue for anyone to try and solve this!

715

asked Feb 18 '19 21:02

willk

1 Answers

This appears to be a bug. When we dispatch any df to median, this maps to the internal _reduce function. With numeric_only set to None, this computes the median by series, and ignore failures (for the c columns, for e.g. median computation will fail.) and accumulate results (see _reduce in pandas source core/frame.py). So far it is fine. But while stiching the results together through it does a check to infer if the results are scalar or series (for median it will be scalar of course). To do this check, it always use the first column (see wrap_results in pandas source core/apply.py). So if the first column calc failed and it was skipped, this check fails, raising an exception. This triggers the fallback method within _reduce of forcing the dataframe to numeric only (dropping any columns with NaN) and re-compute the medians.

So in your case, if the column c (or any other dtype where median computation will fail, like text) is in the first column, then all columns with NaN will also be dropped for the median results. Setting skipna does not change as the bug is with how non-numeric column in first position triggers a forced numeric only computation. I do not see there is any fix possible without fixing it in the pandas codebase. Or ensuring first column will always succeed for median computation.

172

answered Sep 22 '22 15:09

Prodipta Ghosh

Related questions
                            
                                Difference between positive and negative values in xticklabel by using Latex in matplotlib
                            
                                Automatically Generate GitHub Wiki Documentation from Python Docstrings
                            
                                How to use Python's difflib to produce side-by-side comparison of two files similar to Unix sdiff command?
                            
                                Handling Redshift identity columns in SQLAlchemy
                            
                                How do I detect if my python code is running in PowerShell or the Command Prompt (cmd)
                            
                                Keras | Getting the Inception v3 example running
                            
                                Batch normalization with 3D convolutions in TensorFlow
                            
                                Implementing an efficient graph data structure for maintaining cluster distances in the Rank-Order Clustering algorithm
                            
                                What's the difference between uWSGI's socket-timeout/http-timeout/harakiri?
                            
                                Disable pylint warning E1101 when using enums
                            
                                How to evaluate a classifier with PySpark 2.4.5
                            
                                Python based build tools
                            
                                How can I automatically reload tasks modules with Celery daemon?
                            
                                Flask and SQLAlchemy causes a lot of IDLE in transaction connections in PostgreSQL
                            
                                the path complexity (fastest route) to any given number in python
                            
                                Why is tail recursion optimization faster than normal recursion in Python?
                            
                                mapToScene requires the view being shown for correct transformations?
                            
                                INFO menuinst_win32:__init__(182): Menu: name: 'Anaconda${PY_VER} ${PLATFORM}'
                            
                                Flask server won't stop on Ctrl+C in Windows
                            
                                Connect to Azure analysis services from python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strange behavior with Pandas median

Tags:

python

pandas

dataframe

willk

People also ask

1 Answers

Prodipta Ghosh

Recent Activity

Donate For Us