Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logistic regression on One-hot encoding

I have a Dataframe (data) for which the head looks like the following:

          status      datetime    country    amount    city  
601766  received  1.453916e+09    France       4.5     Paris
669244  received  1.454109e+09    Italy        6.9     Naples

I would like to predict the status given datetime, country, amount and city

Since status, country, city are string, I one-hot-encoded them:

one_hot = pd.get_dummies(data['country'])
data = data.drop(item, axis=1) # Drop the column as it is now one_hot_encoded
data = data.join(one_hot)

I then create a simple LinearRegression model and fit my data:

y_data = data['status']
classifier = LinearRegression(n_jobs = -1)
X_train, X_test, y_train, y_test = train_test_split(data, y_data, test_size=0.2)
columns = X_train.columns.tolist()
classifier.fit(X_train[columns], y_train)

But I got the following error:

could not convert string to float: 'received'

I have the feeling I miss something here and I would like to have some inputs on how to proceed. Thank you for having read so far!

like image 528
Mornor Avatar asked Jun 01 '17 13:06

Mornor


People also ask

Can you use one-hot encoding linear regression?

One-hot encoding is a great tool for turning some of these categorical features into multiple binary features; the presence or absence of the individual categorical unit can then be fit into the linear regression.

Do you encode categorical variables for logistic regression?

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.

What are two limitations of using one-hot encoding?

Because this procedure generates several new variables, it is prone to causing a large problem (too many predictors) if the original column has a large number of unique values. Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy.

Can you do logistic regression with categorical predictors?

Similar to linear regression models, logistic regression models can accommodate continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables (see our recent blog on Key Driver Analysis for more information).


2 Answers

Consider the following approach:

first let's one-hot-encode all non-numeric columns:

In [220]: from sklearn.preprocessing import LabelEncoder

In [221]: x = df.select_dtypes(exclude=['number']) \
                .apply(LabelEncoder().fit_transform) \
                .join(df.select_dtypes(include=['number']))

In [228]: x
Out[228]:
        status  country  city      datetime  amount
601766       0        0     1  1.453916e+09     4.5
669244       0        1     0  1.454109e+09     6.9

now we can use LinearRegression classifier:

In [230]: classifier.fit(x.drop('status',1), x['status'])
Out[230]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
like image 195
MaxU - stop WAR against UA Avatar answered Sep 22 '22 17:09

MaxU - stop WAR against UA


To do a one-hot encoding in a scikit-learn project, you may find it cleaner to use the scikit-learn-contrib project category_encoders: https://github.com/scikit-learn-contrib/categorical-encoding, which includes many common categorical variable encoding methods including one-hot.

like image 26
Will McGinnis Avatar answered Sep 19 '22 17:09

Will McGinnis