I am trying to add a sklearn prediction to a pandas dataframe, so that I can make a thorough evaluation of the prediction. The relavant piece of code is the following:
clf = linear_model.LinearRegression()
clf.fit(Xtrain,ytrain)
ypred = pd.DataFrame({'pred_lin_regr': pd.Series(clf.predict(Xtest))})
The dataframes look like this:
Xtest
axial_MET cos_theta_r1 deltaE_abs lep1_eta lep1_pT lep2_eta
8000 1.383026 0.332365 1.061852 0.184027 0.621598 -0.316297
8001 -1.054412 0.046317 1.461788 -1.141486 0.488133 1.011445
8002 0.259077 0.429920 0.769219 0.631206 0.353469 1.027781
8003 -0.096647 0.066200 0.411222 -0.867441 0.856115 -1.357888
8004 0.145412 0.371409 1.111035 1.374081 0.485231 0.900024
ytest
8000 1
8001 0
8002 0
8003 0
8004 0
ypred
pred_lin_regr
0 0.461636
1 0.314448
2 0.363751
3 0.291858
4 0.416056
Concatenating Xtest and ytest works fine:
df_total = pd.concat([Xtest, ytest], axis=1)
but the event information is lost on ypred.
What would be the must python/pandas/numpy-like way to do this?
I am using the following versions:
argparse==1.2.1
cycler==0.9.0
decorator==4.0.4
ipython==4.0.0
ipython-genutils==0.1.0
matplotlib==1.5.0
nose==1.3.7
numpy==1.10.1
pandas==0.17.0
path.py==8.1.2
pexpect==4.0.1
pickleshare==0.5
ptyprocess==0.5
py==1.4.30
pyparsing==2.0.5
pytest==2.8.2
python-dateutil==2.4.2
pytz==2015.7
scikit-learn==0.16.1
scipy==0.16.1
simplegeneric==0.8.1
six==1.10.0
sklearn==0.0
traitlets==4.0.0
wsgiref==0.1.2
I tried the following:
df_total["pred_lin_regr"] = clf.predict(Xtest)
seems to do the job, but I think I can't be sure that the events are matched correctly
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
The Sklearn 'Predict' Method Predicts an Output That being the case, it provides a set of tools for doing things like training and evaluating machine learning models. What is this? And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).
Sklearn Pandas, part of the Scikit Contrib package, adds some syntactic sugar to use Dataframes in sklearn pipelines and back again. The first thing to note is that the output is a numpy one.
You're correct with your second line, df_total["pred_lin_regr"] = clf.predict(Xtest)
and it's more efficient.
In that one you're taking the output of clf.predict()
, which happens to be an array, and adding it to a dataframe. The output you're receiving from the array itself is in order to match Xtest
, since that's the case, adding it to a numpy array will not change or alter that order.
Here's a little proof from this example:
Taking the following protion:
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
print(regr.predict(diabetes_X_test))
df = pd.DataFrame(regr.predict(diabetes_X_test))
print(df)
The first print()
function will give us a numpy array as expected:
[ 225.9732401 115.74763374 163.27610621 114.73638965 120.80385422
158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]
That order is identical to the second print()
function in which we add the results to a dataframe:
0
0 225.973240
1 115.747634
2 163.276106
3 114.736390
4 120.803854
5 158.219886
6 236.085681
7 121.815098
8 99.567728
9 123.837587
10 204.737114
11 96.533996
12 154.174909
13 130.916295
14 83.387823
15 171.366059
16 137.995004
17 137.995004
18 189.568453
19 84.399067
Rerunning the code for a portion of the test, will give us the same ordered results as such:
print(regr.predict(diabetes_X_test[0:5]))
df = pd.DataFrame(regr.predict(diabetes_X_test[0:5]))
print(df)
[ 225.9732401 115.74763374 163.27610621 114.73638965 120.80385422]
0
0 225.973240
1 115.747634
2 163.276106
3 114.736390
4 120.803854
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With