Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import numpy as np  data = load_iris()  # bear with me for the next few steps... I'm trying to walk you through # how my data object landscape looks... i.e. how I get from raw data  # to matrices with the actual data I have, not the iris dataset # put feature matrix into columnar format in dataframe df = pd.DataFrame(data = data.data)  # add outcome variable df['class'] = data.target  X = np.matrix(df.loc[:, [0, 1, 2, 3]]) y = np.array(df['class'])  # finally, split into train-test X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)  model = DecisionTreeClassifier()  model.fit(X_train, y_train)  # I've got my predictions now y_hats = model.predict(X_test) 

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats 

But that raises:

ValueError: Length of values does not match length of index

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

like image 446
blacksite Avatar asked Nov 21 '16 20:11

blacksite


1 Answers

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)  df['y_hats'] = y_hats2 

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import numpy as np  data = load_iris()  # bear with me for the next few steps... I'm trying to walk you through # how my data object landscape looks... i.e. how I get from raw data  # to matrices with the actual data I have, not the iris dataset # put feature matrix into columnar format in dataframe df = pd.DataFrame(data = data.data)  # add outcome variable df_class = pd.DataFrame(data = data.target)  # finally, split into train-test X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)  model = DecisionTreeClassifier()  model.fit(X_train, y_train)  # I've got my predictions now y_hats = model.predict(X_test)  y_test['preds'] = y_hats  df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True) 
like image 134
flyingmeatball Avatar answered Sep 24 '22 00:09

flyingmeatball