How to measure the accuracy of predictions using Python/Pandas?

Q: How do you check pandas accuracy?

You can take the intersection of the columns to find out which columns are common between baseline and forecast and just apply accuracy_score on those columns. Write a function to take baseline and a forecast to give you accuracy. For regression try mean absolute error, the lower the error the best the prediction is.

Q: How do you measure accuracy prediction?

Accuracy is a metric used in classification problems used to tell the percentage of accurate predictions. We calculate it by dividing the number of correct predictions by the total number of predictions.

Q: How does python display accuracy?

You can also get the accuracy score in python using sklearn. metrics' accuracy_score() function which takes in the true labels and the predicted labels as arguments and returns the accuracy as a float value. sklearn. metrics comes with a number of useful functions to compute common evaluation metrics.

Tags:

python

python-3.x

pandas

statistics

I have used the Elo and Glicko rating systems along with the results for matches to generate ratings for players. Prior to each match, I can generate an expectation (a float between 0 and 1) for each player based on their respective ratings. I would like test how accurate this expectation is, for two reasons:

To compare the difference rating systems
To tune variables (such as kfactor in Elo) used to calculate ratings

There are a few differences from chess worth being aware of:

Possible results are wins (which I am treating as 1.0), losses (0.0), with the very occasional (<5%) draws (0.5 each). Each individual match is rated, not a series like in chess.
Players have less matches -- many have less than 10, few go over 25, max is 75

Thinking the appropriate function is "correlation", I have attempted creating a DataFrame containing the prediction in one column (a float between 0, 1) and the result in the other (1|0.5|0) and using corr(), but based on the output, I am not sure if this is correct.

If I create a DataFrame containing expectations and results for only the first player in a match (the results will always be 1.0 or 0.5 since due to my data source, losers are never displayed first), corr() returns very low: < 0.05. However, if I create a series which has two rows for each match and contains both the expectation and result for each player (or, alternatively, randomly choose which player to append, so results will be either 0, 0.5, or 1), the corr() is much higher: ~0.15 to 0.30. I don't understand why this would make a difference, which makes me wonder if I am either misusing the function or using the wrong function entirely.

If it helps, here is some real (not random) sample data: http://pastebin.com/eUzAdNij

290

asked Mar 18 '17 05:03

profesor_tortuga

1 Answers

An industry standard way to judge the accuracy of prediction is Receiver Operating Characteristic (ROC). You can create it from your data using sklearn and matplotlib with this code below.

ROC is a 2-D plot of true positive vs false positive rates. You want the line to be above diagonal, the higher the better. Area Under Curve (AUC) is a standard measure of accuracy: the larger the more accurate your classifier is.

import pandas as pd

# read data
df = pd.read_csv('sample_data.csv', header=None, names=['classifier','category'])

# remove values that are not 0 or 1 (two of those)
df = df.loc[(df.category==1.0) | (df.category==0.0),:]

# examine data frame
df.head()

from matplotlib import pyplot as plt
# add this magic if you're in a notebook
# %matplotlib inline

from sklearn.metrics import roc_curve, auc
# matplot figure
figure, ax1 = plt.subplots(figsize=(8,8))

# create ROC itself
fpr,tpr,_ = roc_curve(df.category,df.classifier)

# compute AUC
roc_auc = auc(fpr,tpr)

# plotting bells and whistles
ax1.plot(fpr,tpr, label='%s (area = %0.2f)' % ('Classifier',roc_auc))
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.0])
ax1.set_xlabel('False Positive Rate', fontsize=18)
ax1.set_ylabel('True Positive Rate', fontsize=18)
ax1.set_title("Receiver Operating Characteristic", fontsize=18)
plt.tick_params(axis='both', labelsize=18)
ax1.legend(loc="lower right", fontsize=14)
plt.grid(True)
figure.show()

From your data, you should get a plot like this one: enter image description here

142

answered Sep 17 '22 12:09

Gena Kukartsev

Related questions
                            
                                How to find the nth derivative given the first derivative with SymPy?
                            
                                Correlating a SQLAlchemy relationship with an awkward join
                            
                                Post import hooks in Python 3
                            
                                Calculate maximum likelihood using PyMC3
                            
                                Removing diagonal elements from a sparse matrix in scipy
                            
                                Sklearn Kmeans parameter confusion?
                            
                                Scikit-Learn SVR Prediction Always Gives the Same Value
                            
                                ARRAY_CONTAINS muliple values in pyspark
                            
                                Using placeholder as shape in tensorflow
                            
                                Configuring Salt API - Java
                            
                                Slack API - Attatchments from custom bot post as plain text
                            
                                rm() function of r alternative in python
                            
                                Was the year 1000 (and others) a leap year?
                            
                                Test Environment with Mocked REST API
                            
                                Selenium Add Cookies From CookieJar
                            
                                ValueError: Must pass DataFrame with boolean values only
                            
                                Do bulk inserts/update in MongoDB with PyMongo
                            
                                Find the intersection of two curves given by (x, y) data with high precision in Python
                            
                                Is matplotlib scatter plot slow for large number of data?
                            
                                Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With