Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to measure the accuracy of predictions using Python/Pandas?

I have used the Elo and Glicko rating systems along with the results for matches to generate ratings for players. Prior to each match, I can generate an expectation (a float between 0 and 1) for each player based on their respective ratings. I would like test how accurate this expectation is, for two reasons:

  • To compare the difference rating systems
  • To tune variables (such as kfactor in Elo) used to calculate ratings

There are a few differences from chess worth being aware of:

  • Possible results are wins (which I am treating as 1.0), losses (0.0), with the very occasional (<5%) draws (0.5 each). Each individual match is rated, not a series like in chess.
  • Players have less matches -- many have less than 10, few go over 25, max is 75

Thinking the appropriate function is "correlation", I have attempted creating a DataFrame containing the prediction in one column (a float between 0, 1) and the result in the other (1|0.5|0) and using corr(), but based on the output, I am not sure if this is correct.

If I create a DataFrame containing expectations and results for only the first player in a match (the results will always be 1.0 or 0.5 since due to my data source, losers are never displayed first), corr() returns very low: < 0.05. However, if I create a series which has two rows for each match and contains both the expectation and result for each player (or, alternatively, randomly choose which player to append, so results will be either 0, 0.5, or 1), the corr() is much higher: ~0.15 to 0.30. I don't understand why this would make a difference, which makes me wonder if I am either misusing the function or using the wrong function entirely.

If it helps, here is some real (not random) sample data: http://pastebin.com/eUzAdNij

like image 290
profesor_tortuga Avatar asked Mar 18 '17 05:03

profesor_tortuga


People also ask

How do you check pandas accuracy?

You can take the intersection of the columns to find out which columns are common between baseline and forecast and just apply accuracy_score on those columns. Write a function to take baseline and a forecast to give you accuracy. For regression try mean absolute error, the lower the error the best the prediction is.

How do you measure accuracy prediction?

Accuracy is a metric used in classification problems used to tell the percentage of accurate predictions. We calculate it by dividing the number of correct predictions by the total number of predictions.

How does python display accuracy?

You can also get the accuracy score in python using sklearn. metrics' accuracy_score() function which takes in the true labels and the predicted labels as arguments and returns the accuracy as a float value. sklearn. metrics comes with a number of useful functions to compute common evaluation metrics.


1 Answers

An industry standard way to judge the accuracy of prediction is Receiver Operating Characteristic (ROC). You can create it from your data using sklearn and matplotlib with this code below.

ROC is a 2-D plot of true positive vs false positive rates. You want the line to be above diagonal, the higher the better. Area Under Curve (AUC) is a standard measure of accuracy: the larger the more accurate your classifier is.

import pandas as pd

# read data
df = pd.read_csv('sample_data.csv', header=None, names=['classifier','category'])

# remove values that are not 0 or 1 (two of those)
df = df.loc[(df.category==1.0) | (df.category==0.0),:]

# examine data frame
df.head()

from matplotlib import pyplot as plt
# add this magic if you're in a notebook
# %matplotlib inline

from sklearn.metrics import roc_curve, auc
# matplot figure
figure, ax1 = plt.subplots(figsize=(8,8))

# create ROC itself
fpr,tpr,_ = roc_curve(df.category,df.classifier)

# compute AUC
roc_auc = auc(fpr,tpr)

# plotting bells and whistles
ax1.plot(fpr,tpr, label='%s (area = %0.2f)' % ('Classifier',roc_auc))
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.0])
ax1.set_xlabel('False Positive Rate', fontsize=18)
ax1.set_ylabel('True Positive Rate', fontsize=18)
ax1.set_title("Receiver Operating Characteristic", fontsize=18)
plt.tick_params(axis='both', labelsize=18)
ax1.legend(loc="lower right", fontsize=14)
plt.grid(True)
figure.show()

From your data, you should get a plot like this one: enter image description here

like image 142
Gena Kukartsev Avatar answered Sep 17 '22 12:09

Gena Kukartsev