Plot graph with multiple attributes similar to "hue" in Seaborn

Tags:

I have the following sample data set called df, where stage time is how many days to get there:

id stage1_time stage_1_to_2_time stage_2_time stage_2_to_3_time stage3_time
a  10          30                40           30                70
b  30               
c  15          30                45     
d

I wrote the following script to get a scatter plot of stage1_time against a CDF:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

dict = {'id': id, 'stage_1_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'stage_2_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'stage_3_time' : [70, None, None, None]}
df = pd.DataFrame(dict)

#create eCDF function
def ecdf(df):
    n = len(df)
    x = np.sort(df)
    y = np.arange(1.0, n+1) / n
    return x, y

def generate_scatter_plot(df):

    x, y = ecdf(df)

    plt.plot(x, y, marker='.', linestyle='none') 
    plt.axvline(x.mean(), color='gray', linestyle='dashed', linewidth=2) #Add mean

    x_m = int(x.mean())
    y_m = stats.percentileofscore(df.as_matrix(), x.mean())/100.0

    plt.annotate('(%s,%s)' % (x_m,int(y_m*100)) , xy=(x_m,y_m), xytext=(10,-5), textcoords='offset points')

    percentiles= np.array([0,25,50,75,100])
    x_p = np.percentile(df, percentiles)
    y_p = percentiles/100.0

    plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay quartiles

    for x,y in zip(x_p, y_p):                                        
        plt.annotate('%s' % int(x), xy=(x,y), xytext=(10,-5), textcoords='offset points')

#Data to plot
stage1_time = df['stage_1_time'].dropna().sort_values()

#Scatter Plot
stage1_time_scatter = generate_scatter_plot(pd.DataFrame({"df" : stage1_time.as_matrix()}))
plt.title('Scatter Plot of Days to Stage1')
plt.xlabel('Days to Stage1')
plt.ylabel('Cumulative Probability')
plt.legend(('Days to Stage1', "Mean", 'Quartiles'), loc='lower right')
plt.margins(0.02)

plt.show()

Output:

enter image description here

Currently I have number of days it took all who reached stage1 plotted against its cumulative probability, however what I am trying to achieve is that the scatter has three colors when I plot: those who reached stage1 and stayed there, those who moved on to stage2, and those who moved on to stage3. I would also like the counts for the data in the graph: # in stage1, # in stage2 and # in stage3.

Can anyone assist with getting there please?

FYI, intention is to use this as a base so that I can also create a graph for stage2_time, where those reaching stage_3 are highlighted a different color.

360

asked May 30 '18 18:05

user8834780

1 Answers

You can create a new column and use it to store the final stage, then use this new column to color your plot.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

dict = {'id': id, 'Progressive_time': [10, 30, 15, None],'stage_1_to_2_time': [30, None, 30, None], 'Active_time' : [40,None, 45, None],'stage_2_to_3_time' : [30, None, None,None],'Engaged_time' : [70, None, None, None]}
df = pd.DataFrame(dict)

    #create eCDF function
def ecdf(df, serie):
    n = len(df)
    df['x'] = np.sort(df[serie])
    df['y'] = np.arange(1.0, n+1) / n
    return df

def generate_scatter_plot(df,serie,nb_stage):
    df=df.dropna(subset=[serie]).sort_values(by=[serie])
    st=1
    for i in range(1,nb_stage*2,2):
        df.loc[df.iloc[:,i].notnull(),'stage']=st
        st=st+1

    df= ecdf(df, serie)
    plt.plot(df.loc[df['stage'] == 1, 'x'], df.loc[df['stage'] == 1, 'y'], marker='.', linestyle='none',c='blue') 
    plt.plot(df.loc[df['stage'] == 2, 'x'], df.loc[df['stage'] == 2, 'y'], marker='.', linestyle='none',c='red') 
    plt.plot(df.loc[df['stage'] == 3, 'x'], df.loc[df['stage'] == 3, 'y'], marker='.', linestyle='none',c='green') 
    plt.axvline(df['x'].mean(), color='gray', linestyle='dashed', linewidth=2) #Add mean


    x_m = int(df['x'].mean())
    y_m = stats.percentileofscore(df[serie], df['x'].mean())/100.0

    plt.annotate('(%s,%s)' % (x_m,int(y_m*100)) , xy=(x_m,y_m), xytext=(10,-5), textcoords='offset points')

    percentiles= np.array([0,25,50,75,100])
    x_p = np.percentile(df[serie], percentiles)
    y_p = percentiles/100.0

    plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay quartiles

    for x,y in zip(x_p, y_p):                                        
        plt.annotate('%s' % int(x), xy=(x,y), xytext=(10,-5), textcoords='offset points')

#Scatter Plot
stage1_time_scatter = generate_scatter_plot(df,'stage_1_time',3)
plt.title('Scatter Plot of Days to Stage1')
plt.xlabel('Days to Stage1')
plt.ylabel('Cumulative Probability')
plt.legend(('Progressive','Active','Engaged','Days to Stage1', "Mean", 'Quartiles'), loc='lower right')
plt.margins(0.02)

plt.show()

answered Oct 19 '22 10:10

Aurelia_B

Related questions
                            
                                Python PIL Image in Label auto resize
                            
                                Python multiprocessing Pool hangs on ubuntu server
                            
                                Probability tree for sentences in nltk employing both lookahead and lookback dependencies
                            
                                Python equivalent of Matlab's clear, close all, clc
                            
                                Calculating Dynamic Time Warping Distance in a Pandas Data Frame
                            
                                xgboost binary logistic regression
                            
                                Fast Interpolation / Resample of Numpy Array - Python
                            
                                Robot Framework test scripts fail with SSLError
                            
                                keras BLSTM for sequence labeling
                            
                                How to get rid of tensorflow verbose messages with Keras
                            
                                Python Entry point 'console_scripts' not found
                            
                                Why does copying a >= 16 GB Numpy array set all its elements to 0?
                            
                                Jupyter Notebook (only) Memory Error, same code run in a conventional .py and works
                            
                                Python/sockets/ssl EOF occurred in violation of protocol
                            
                                Python Chord Diagram (Plotly) - Interactive Tooltips
                            
                                How to efficiently compute a rolling unique count in a pandas time series?
                            
                                Networkx: Find all minimal cuts consisting of only nodes from one set in a bipartite graph
                            
                                Customize flask form as table
                            
                                How do I access the return values of ThreadPoolExecutor?
                            
                                Y-axis autoscaling with x-range sliders in plotly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Plot graph with multiple attributes similar to "hue" in Seaborn

Tags:

python

pandas

matplotlib

python-2.7

user8834780

People also ask

1 Answers

Aurelia_B

Recent Activity

Donate For Us