Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent zero values from messing up a pandas boxplot?

Tags:

python

pandas

I have a pandas df and after pivoting, it prints as following,

country   CHINA    USA
0        119.02    0.0
1        121.20    0.0
3        112.49    0.0
4        113.94    0.0
5        114.67    0.0
6        111.77    0.0
7        117.57    0.0
......................

......................
6648       0.00  420.0
6649       0.00  420.0
6650       0.00  420.0
6651       0.00  420.0
6652       0.00  420.0
6653       0.00  420.0
6654       0.00  500.0
6655       0.00  500.0
6656       0.00  390.0
6657       0.00  450.0
6658       0.00  420.0
6659       0.00  420.0
6660       0.00  450.0 

The method is here,

def visualize_box_plot(df):

    df = df[df.outlier != 1]
    df = pd.pivot_table(df, 
                     index=df.index, 
                     columns = df['country'],
                     values='value', 
                     fill_value = 0)

    df.CHINA = df.CHINA.round(2)
    df.USA = df.USA.round(2)

    # this is the prints 
    # provided earlier 
    print df 

    df_usa = df[(df['USA'] != 0)]
    df_china = df[(df['CHINA'] != 0)]

    usa = df_usa.as_matrix()[:, -1]
    china = df_china.as_matrix()[:,0]

    print "USA:", len(usa), " ", "CHINA: ", len(china)

    # unequal length 
    # USA: 1673   CHINA:  4384

    x =  [china, usa]
    plt.boxplot(x)
    plt.show()

Zero values come from the NaN during the time of pivoting and I would like omit them while making the box plot. So, I use the code,

    df_usa = df[(df['USA'] != 0)]
    df_china = df[(df['CHINA'] != 0)]

Those code actually creates seperate df and converted to the NUmpy matrix and lastly, I visualize them all together with matplotlib. Point to be considered, the length of the Numpy matrix is not the same and hence, I can't just call the boxplot function directly with df.

Here is my visualization where 1 and 2 needs to be replaced with CHINA and USA respectively,

enter image description here

The visualization is not good and I get the feelings there might be better way to get the job done. Any suggestion ? Some sample code will help a lot. You may use the df rounding to 2 digits after the decimal. The main issue is to make the code elegant and improve the visualization better.

like image 691
Heisenberg Avatar asked Oct 28 '25 10:10

Heisenberg


1 Answers

I think code can be more simplier - simply replace 0 to NaN and then call DataFrame.boxplot:

print (df.mask(df == 0))
#alternative solution
#print (df.replace(0,np.nan))
          CHINA    USA
country               
0        119.02    NaN
1        121.20    NaN
3        112.49    NaN
4        113.94    NaN
5        114.67    NaN
6        111.77    NaN
7        117.57    NaN
6648        NaN  420.0
6649        NaN  420.0
6650        NaN  420.0
6651        NaN  420.0
6652        NaN  420.0
6653        NaN  420.0
6654        NaN  500.0
6655        NaN  500.0
6656        NaN  390.0
6657        NaN  450.0
6658        NaN  420.0
6659        NaN  420.0
6660        NaN  450.0

df.mask(df == 0).boxplot()

graph

Another possible solution is use DataFrame.plot.box:

df.mask(df == 0).plot.box()

graph

Box Plots in docs

like image 121
jezrael Avatar answered Oct 29 '25 23:10

jezrael