I have a dfAB
import pandas as pd
import random
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
dfAB = pd.DataFrame({ 'A': A, 'B': B })
dfAB
We can take the quantile function, because I want to know the 75th percentile of the columns:
dfAB.quantile(0.75)
But say now I put some NaNs in the dfAB and re-do the function, obviously its differnt:
dfAB.loc[5:8]=np.nan
dfAB.quantile(0.75)
Basically, when I calculated the mean of the dfAB, I passed skipna to ignore Na's as I didn't want them affecting my stats (I have quite a few in my code, on purpose, and obv making them zero doesn't help)
dfAB.mean(skipna=True)
Thus, what im getting at is whether/how the quantile function addresses NaN's?
Pandas DataFrame quantile() Method The quantile() method calculates the quantile of the values in a given axis. Default axis is row. By specifying the column axis ( axis='columns' ), the quantile() method calculates the quantile column-wise and returns the mean value for each row.
This is what Pandas documentation gives: na_values : scalar, str, list-like, or dict, optional Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame : isnull() notnull()
Yes, this appears to be the way that pd.quantile
deals with NaN
values. To illustrate, you can compare the results to np.nanpercentile
, which explicitely Computes the qth percentile of the data along the specified axis, while ignoring nan values (quoted from the docs, my emphasis):
>>> dfAB
A B
0 5.0 10.0
1 43.0 67.0
2 86.0 2.0
3 61.0 83.0
4 2.0 27.0
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 27.0 70.0
>>> dfAB.quantile(0.75)
A 56.50
B 69.25
Name: 0.75, dtype: float64
>>> np.nanpercentile(dfAB, 75, axis=0)
array([56.5 , 69.25])
And see that they are equivalent
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With