Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: Number of features of the model must match the input

I'm getting this error when trying to predict using a model I built in scikit learn. I know that there are a bunch of questions about this but mine seems different from them because I am wildly off between my input and model features. Here is my code for training my model (FYI the .csv file has 45 columns with one being the known value):

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib


df = pd.read_csv("Cinderella.csv")


features_df = pd.get_dummies(df, columns=['Overall_Sentiment', 'Word_1','Word_2','Word_3','Word_4','Word_5','Word_6','Word_7','Word_8','Word_9','Word_10','Word_11','Word_1','Word_12','Word_13','Word_14','Word_15','Word_16','Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41', 'Word_42', 'Word_43'], dummy_na=True)

del features_df['Slope']

X = features_df.as_matrix()
y = df['Slope'].as_matrix()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = ensemble.GradientBoostingRegressor(
    n_estimators=500,
    learning_rate=0.01,
    max_depth=5,
    min_samples_leaf=3,
    max_features=0.1,
    loss='lad'
)

model.fit(X_train, y_train)

joblib.dump(model, 'slope_from_sentiment_model.pkl')

mse = mean_absolute_error(y_train, model.predict(X_train))

print("Training Set Mean Absolute Error: %.4f" % mse)

mse = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

Here is my code for the actual prediction using a different .csv file (this has 44 columns because it doesn't have any values):

from sklearn.externals import joblib
import pandas


model = joblib.load('slope_from_sentiment_model.pkl')

df = pandas.read_csv("Slaughterhouse_copy.csv")


features_df = pandas.get_dummies(df, columns=['Overall_Sentiment','Word_1', 'Word_2', 'Word_3', 'Word_4', 'Word_5', 'Word_6', 'Word_7', 'Word_8', 'Word_9', 'Word_10', 'Word_11', 'Word_12', 'Word_13', 'Word_14', 'Word_15', 'Word_16', 'Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41','Word_42','Word_43'], dummy_na=True)

predicted_slopes = model.predict(features_df)

When I run the prediction file I get:

ValueError: Number of features of the model must match the input. Model n_features is 146 and input n_features is 226.

If anyone could help me it would be greatly appreciated! Thanks in advance!

like image 939
jack_f Avatar asked May 17 '17 13:05

jack_f


3 Answers

The reason you're getting the error is due to the different distinct values in your features where you're generating the dummy values with get_dummies.

Let's suppose the Word_1 column in your training set has the following distinct words: the, dog, jumps, roof, off. That's 5 distinct words so pandas will generate 5 features for Word_1. Now, if your scoring dataset has a different number of distinct words in the Word_1 column, then you're going to get a different number of features.

How to fix:

You'll want to concatenate your training and scoring datasets using concat, apply get_dummies, and then split your datasets. That'll ensure you have captured all the distinct values in your columns. Given that you're using two different csv's, you probably want to generate a column that specifies your training vs scoring dataset.

Example solution:

train_df = pd.read_csv("Cinderella.csv")
train_df['label'] = 'train'

score_df = pandas.read_csv("Slaughterhouse_copy.csv")
score_df['label'] = 'score'

# Concat
concat_df = pd.concat([train_df , score_df])

# Create your dummies
features_df = pd.get_dummies(concat_df, columns=['Overall_Sentiment', 'Word_1','Word_2','Word_3','Word_4','Word_5','Word_6','Word_7','Word_8','Word_9','Word_10','Word_11','Word_1','Word_12','Word_13','Word_14','Word_15','Word_16','Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41', 'Word_42', 'Word_43'], dummy_na=True)

# Split your data
train_df = features_df[features_df['label'] == 'train']
score_df = features_df[features_df['label'] == 'score']

# Drop your labels
train_df = train_df.drop('label', axis=1)
score_df = score_df.drop('label', axis=1)

# Now delete your 'slope' feature, create your features matrix, and create your model as you have already shown in your example
...
like image 91
Scratch'N'Purr Avatar answered Nov 08 '22 14:11

Scratch'N'Purr


I tried the method suggested here and ended up with hot encoding the label column as well,and in the dataframe it is shown as 'label_test' and 'label_train' so just a heads up try this post get_dummies:

train_df = feature_df[feature_df['label_train'] == 1]
test_df = feature_df[feature_df['label_test'] == 0]
train_df = train_df.drop(['label_train', 'label_test'], axis=1)
test_df = test_df.drop(['label_train', 'label_test'], axis=1)
like image 34
Akson Avatar answered Nov 08 '22 15:11

Akson


Below correction to original answer from Scratch'N'Purr would help solve issues one might face using string as value for new inserted column 'label' -
train_df = pd.read_csv("Cinderella.csv") train_df['label'] = 1

    score_df = pandas.read_csv("Slaughterhouse_copy.csv")
    score_df['label'] = 2

    # Concat
    concat_df = pd.concat([train_df , score_df])

    # Create your dummies
    features_df = pd.get_dummies(concat_df)

    # Split your data
    train_df = features_df[features_df['label'] == '1]
    score_df = features_df[features_df['label'] == '2]
    ...
like image 1
code-on-treehouse Avatar answered Nov 08 '22 15:11

code-on-treehouse