Adding scikit-learn (sklearn) prediction to pandas data frame

Tags:

I am trying to add a sklearn prediction to a pandas dataframe, so that I can make a thorough evaluation of the prediction. The relavant piece of code is the following:

clf = linear_model.LinearRegression()
clf.fit(Xtrain,ytrain)
ypred = pd.DataFrame({'pred_lin_regr': pd.Series(clf.predict(Xtest))})

The dataframes look like this:

Xtest

       axial_MET  cos_theta_r1  deltaE_abs  lep1_eta   lep1_pT  lep2_eta  
8000   1.383026      0.332365    1.061852  0.184027  0.621598 -0.316297   
8001  -1.054412      0.046317    1.461788 -1.141486  0.488133  1.011445   
8002   0.259077      0.429920    0.769219  0.631206  0.353469  1.027781   
8003  -0.096647      0.066200    0.411222 -0.867441  0.856115 -1.357888   
8004   0.145412      0.371409    1.111035  1.374081  0.485231  0.900024

ytest

ypred

        pred_lin_regr
0       0.461636
1       0.314448
2       0.363751
3       0.291858
4       0.416056

Concatenating Xtest and ytest works fine:

df_total = pd.concat([Xtest, ytest], axis=1)

but the event information is lost on ypred.

What would be the must python/pandas/numpy-like way to do this?

I am using the following versions:

argparse==1.2.1
cycler==0.9.0
decorator==4.0.4
ipython==4.0.0
ipython-genutils==0.1.0
matplotlib==1.5.0
nose==1.3.7
numpy==1.10.1
pandas==0.17.0
path.py==8.1.2
pexpect==4.0.1
pickleshare==0.5
ptyprocess==0.5
py==1.4.30
pyparsing==2.0.5
pytest==2.8.2
python-dateutil==2.4.2
pytz==2015.7
scikit-learn==0.16.1
scipy==0.16.1
simplegeneric==0.8.1
six==1.10.0
sklearn==0.0
traitlets==4.0.0
wsgiref==0.1.2

I tried the following:

df_total["pred_lin_regr"] = clf.predict(Xtest)

seems to do the job, but I think I can't be sure that the events are matched correctly

627

asked Nov 08 '15 14:11

bolla

1 Answers

You're correct with your second line, df_total["pred_lin_regr"] = clf.predict(Xtest) and it's more efficient.

In that one you're taking the output of clf.predict(), which happens to be an array, and adding it to a dataframe. The output you're receiving from the array itself is in order to match Xtest, since that's the case, adding it to a numpy array will not change or alter that order.

Here's a little proof from this example:

Taking the following protion:

import numpy as np

import pandas as pd
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

print(regr.predict(diabetes_X_test))

df = pd.DataFrame(regr.predict(diabetes_X_test))

print(df)

The first print() function will give us a numpy array as expected:

[ 225.9732401   115.74763374  163.27610621  114.73638965  120.80385422
  158.21988574  236.08568105  121.81509832   99.56772822  123.83758651
  204.73711411   96.53399594  154.17490936  130.91629517   83.3878227
  171.36605897  137.99500384  137.99500384  189.56845268   84.3990668 ]

That order is identical to the second print() function in which we add the results to a dataframe:

             0
0   225.973240
1   115.747634
2   163.276106
3   114.736390
4   120.803854
5   158.219886
6   236.085681
7   121.815098
8    99.567728
9   123.837587
10  204.737114
11   96.533996
12  154.174909
13  130.916295
14   83.387823
15  171.366059
16  137.995004
17  137.995004
18  189.568453
19   84.399067

Rerunning the code for a portion of the test, will give us the same ordered results as such:

print(regr.predict(diabetes_X_test[0:5]))

df = pd.DataFrame(regr.predict(diabetes_X_test[0:5]))

print(df)

[ 225.9732401   115.74763374  163.27610621  114.73638965  120.80385422]
            0
0  225.973240
1  115.747634
2  163.276106
3  114.736390
4  120.803854

104

answered Nov 12 '22 15:11

Leb

Related questions
                            
                                How can a pointer be passed between Rust and Python?
                            
                                Get all scope names on Sublime Text 3
                            
                                Using SBT to manage projects that contain both Scala and Python
                            
                                How to write an ipython alias which executes in python instead of shell?
                            
                                Python pyproj convert ecef to lla
                            
                                How to execute an .sql file in pymssql
                            
                                What's the cleanest way to set up an enumeration in Python? [duplicate]
                            
                                Seaborn heatmap by column
                            
                                Python plotting error bars with different values above and below the point
                            
                                Nested for-loops and dictionaries in finding value occurrence in string
                            
                                chain two remote tasks in celery by send_task
                            
                                python - crontab to run a script
                            
                                Prime number hard drive storage for very large primes - Sieve of Atkin
                            
                                How do I get IPython profile behavior from Jupyter 4.x?
                            
                                Loading UTF-8 file in Python 3 using numpy.genfromtxt
                            
                                How to label a seaborn contour plot
                            
                                Private settings in Django and Deployment
                            
                                Add a dll/so to a python built distribution
                            
                                pytest fixture params with monkeypatch
                            
                                Make Python unittest show AssertionError but no Traceback

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Adding scikit-learn (sklearn) prediction to pandas data frame

Tags:

python

pandas

numpy

scikit-learn

bolla

People also ask

1 Answers

Leb

Recent Activity

Donate For Us