Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build a lift chart (a.k.a gains chart) in Python?

I just created a model using scikit-learn which estimates the probability of how likely a client will respond to some offer. Now I'm trying to evaluate my model. For that I want to plot the lift chart. I understand the concept of lift, but I'm struggling to understand how to actually implement it in python.

like image 393
Abhishek Arora Avatar asked Mar 09 '17 15:03

Abhishek Arora


People also ask

How does Python calculate lift?

Lift is calculated by taking the difference between the treatment conversion rate and the control conversion rate divided by the control conversion rate. The result is the relative percent difference of treatment compared to control.

How is a lift chart constructed?

The graph is constructed with the cumulative number of cases (in descending order of probability) on the x-axis and the cumulative number of true positives on the y-axis as shown below. True positives are those observations from the important class (here class 1) that are classified correctly.

What is gain and lift chart?

The gain chart and lift chart are two measures that are used for Measuring the benefits of using the model and are used in business contexts such as target marketing. It's not just restricted to marketing analysis. It can also be used in other domains such as risk modeling, supply chain analytics, etc.

What is the difference between lift and gain chart?

The gain chart is a chart drawn between the gain on the vertical axis and the decile on the horizontal axis. 4. Lift is the ratio of the number of positive observations up to decile i using the model to the expected number of positives up to that decile i based on a random model.

What is the purpose of cumulative gains and lift charts?

So the cumulative gains and lift charts are purely for understanding how that model (and that model only) will give me more impact in a scenario where I'm not going to interact with the entire population.

Is it possible to add multiple models to a lift chart?

In the microsoft ressource you provided, it is said : "You can add multiple models to a lift chart, as long as the models all have the same predictable attribute". I suppose you could use the AUC (area under the curve) to compare the different curves as with the ROC or P-R curve or am I wrong?

How do you construct a lift curve?

Constructing a lift curve follows a similar process as forming the cumulative gain curve. Indeed, it is derived from the gain chart. First, we order observations on the horizontal axis with the highest probability of being a target on the left and vice versa for the right-hand side.


2 Answers

You can use the scikit-plot package to do the heavy lifting.

skplt.metrics.plot_cumulative_gain(y_test, predicted_probas) 

Example

# The usual train-test split mumbo-jumbo from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB  X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.33) nb = GaussianNB() nb.fit(X_train, y_train) predicted_probas = nb.predict_proba(X_test)  # The magic happens here import matplotlib.pyplot as plt import scikitplot as skplt skplt.metrics.plot_cumulative_gain(y_test, predicted_probas) plt.show() 

This should result in a plot like this: enter image description here

like image 165
Jonny Brooks Avatar answered Sep 17 '22 14:09

Jonny Brooks


Lift/cumulative gains charts aren't a good way to evaluate a model (as it cannot be used for comparison between models), and are instead a means of evaluating the results where your resources are finite. Either because there's a cost to action each result (in a marketing scenario) or you want to ignore a certain number of guaranteed voters, and only action those that are on the fence. Where your model is very good, and has high classification accuracy for all results, you won't get much lift from ordering your results by confidence.

import sklearn.metrics
import pandas as pd

def calc_cumulative_gains(df: pd.DataFrame, actual_col: str, predicted_col:str, probability_col:str):

    df.sort_values(by=probability_col, ascending=False, inplace=True)

    subset = df[df[predicted_col] == True]

    rows = []
    for group in np.array_split(subset, 10):
        score = sklearn.metrics.accuracy_score(group[actual_col].tolist(),
                                                   group[predicted_col].tolist(),
                                                   normalize=False)

        rows.append({'NumCases': len(group), 'NumCorrectPredictions': score})

    lift = pd.DataFrame(rows)

    #Cumulative Gains Calculation
    lift['RunningCorrect'] = lift['NumCorrectPredictions'].cumsum()
    lift['PercentCorrect'] = lift.apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x['RunningCorrect'], axis=1)
    lift['CumulativeCorrectBestCase'] = lift['NumCases'].cumsum()
    lift['PercentCorrectBestCase'] = lift['CumulativeCorrectBestCase'].apply(
        lambda x: 100 if (100 / lift['NumCorrectPredictions'].sum()) * x > 100 else (100 / lift[
            'NumCorrectPredictions'].sum()) * x)
    lift['AvgCase'] = lift['NumCorrectPredictions'].sum() / len(lift)
    lift['CumulativeAvgCase'] = lift['AvgCase'].cumsum()
    lift['PercentAvgCase'] = lift['CumulativeAvgCase'].apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x)

    #Lift Chart
    lift['NormalisedPercentAvg'] = 1
    lift['NormalisedPercentWithModel'] = lift['PercentCorrect'] / lift['PercentAvgCase']

    return lift

To plot the cumulative gains chart, you can use this code below.

    import matplotlib.pyplot as plt
    def plot_cumulative_gains(lift: pd.DataFrame):
        fig, ax = plt.subplots()
        fig.canvas.draw()
    
        handles = []
        handles.append(ax.plot(lift['PercentCorrect'], 'r-', label='Percent Correct Predictions'))
        handles.append(ax.plot(lift['PercentCorrectBestCase'], 'g-', label='Best Case (for current model)'))
        handles.append(ax.plot(lift['PercentAvgCase'], 'b-', label='Average Case (for current model)'))
        ax.set_xlabel('Total Population (%)')
        ax.set_ylabel('Number of Respondents (%)')
    
        ax.set_xlim([0, 9])
        ax.set_ylim([10, 100])
    
        labels = [int((label+1)*10) for label in [float(item.get_text()) for item in ax.get_xticklabels()]]
    
        ax.set_xticklabels(labels)
    
        fig.legend(handles, labels=[h[0].get_label() for h in handles])
        fig.show()

And to visualise lift:

    def plot_lift_chart(lift: pd.DataFrame):
        plt.figure()
        plt.plot(lift['NormalisedPercentAvg'], 'r-', label='Normalised \'response rate\' with no model')
        plt.plot(lift['NormalisedPercentWithModel'], 'g-', label='Normalised \'response rate\' with using model')
        plt.legend()
        plt.show()

Result looks like:

Cumulative Gains Chart

I found these websites useful for reference:

  • https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/lift-chart-analysis-services-data-mining
  • https://paultebraak.wordpress.com/2013/10/31/understanding-the-lift-chart/
  • http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html

Edit:

I found the MS link somewhat misleading in its descriptions, but the Paul Te Braak link very informative. To answer the comment;

For the cumulative gains chart above, all the calculations are based upon the accuracy for that specific model. As the Paul Te Braak link notes, how can my model's prediction accuracy reach 100% (the red line in the chart)? The best case scenario (the green line) is how quickly we can reach the same accuracy that the red line achieves over the course of the whole population (e.g. our optimum cumulative gains scenario). Blue is if we just randomly pick the classification for each sample in the population. So the cumulative gains and lift charts are purely for understanding how that model (and that model only) will give me more impact in a scenario where I'm not going to interact with the entire population.

One scenario I have used the cumulative gains chart is for fraud cases, where I want to know how many applications we can essentially ignore or prioritise (because I know that the model predicts them as well as it can) for the top X percent. In that case, for the 'average model' I instead selected the classification from the real unordered dataset (to show how existing applications were being processed, and how - using the model - we could instead prioritise types of application).

So, for comparing models, just stick with ROC/AUC, and once you're happy with the selected model, use the cumulative gains/ lift chart to see how it responds to the data.

like image 24
morganics Avatar answered Sep 20 '22 14:09

morganics