I have a Dataframe (data
) for which the head looks like the following:
status datetime country amount city
601766 received 1.453916e+09 France 4.5 Paris
669244 received 1.454109e+09 Italy 6.9 Naples
I would like to predict the status
given datetime, country, amount
and city
Since status, country, city
are string, I one-hot-encoded them:
one_hot = pd.get_dummies(data['country'])
data = data.drop(item, axis=1) # Drop the column as it is now one_hot_encoded
data = data.join(one_hot)
I then create a simple LinearRegression model and fit my data:
y_data = data['status']
classifier = LinearRegression(n_jobs = -1)
X_train, X_test, y_train, y_test = train_test_split(data, y_data, test_size=0.2)
columns = X_train.columns.tolist()
classifier.fit(X_train[columns], y_train)
But I got the following error:
could not convert string to float: 'received'
I have the feeling I miss something here and I would like to have some inputs on how to proceed. Thank you for having read so far!
One-hot encoding is a great tool for turning some of these categorical features into multiple binary features; the presence or absence of the individual categorical unit can then be fit into the linear regression.
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.
Because this procedure generates several new variables, it is prone to causing a large problem (too many predictors) if the original column has a large number of unique values. Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy.
Similar to linear regression models, logistic regression models can accommodate continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables (see our recent blog on Key Driver Analysis for more information).
Consider the following approach:
first let's one-hot-encode all non-numeric columns:
In [220]: from sklearn.preprocessing import LabelEncoder
In [221]: x = df.select_dtypes(exclude=['number']) \
.apply(LabelEncoder().fit_transform) \
.join(df.select_dtypes(include=['number']))
In [228]: x
Out[228]:
status country city datetime amount
601766 0 0 1 1.453916e+09 4.5
669244 0 1 0 1.454109e+09 6.9
now we can use LinearRegression
classifier:
In [230]: classifier.fit(x.drop('status',1), x['status'])
Out[230]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
To do a one-hot encoding in a scikit-learn project, you may find it cleaner to use the scikit-learn-contrib project category_encoders: https://github.com/scikit-learn-contrib/categorical-encoding, which includes many common categorical variable encoding methods including one-hot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With