Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn does not work with string value on KNN

I'm using Scikit learn to do a K-Nearest Neigbour Classification:

from sklearn.neighbors import KNeighborsClassifier

model=KNeighborsClassifier() 
model.fit(train_input,train_labels)

If I print my data:

print("train_input:")
print(train_input.iloc[0])
print("\n")
print("train_labels:")
print(train_labels.iloc[0]) 

I get this:

train_input:
PassengerId                          1
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object


train_labels:
0

The code fails with this error:

ValueError                                Traceback (most recent call last)
<ipython-input-21-1f18eec1e602> in <module>()
     63 
     64 model=KNeighborsClassifier()
---> 65 model.fit(train_input,train_labels)
ValueError: could not convert string to float: 'Q'

So, does the KNN algorithm not work with String values?

How can I modify my data such that it fits the KNN implementation in Scikit-Learn?

like image 273
octavian Avatar asked Dec 02 '17 16:12

octavian


People also ask

Can the target variable be a string?

You can use string values as you target variable, as documentation says target variable should be {array-like, sparse matrix} Target values of shape = [n_samples] or [n_samples, n_outputs] , they did not mention it to be numeric only.

Does KNN work with categorical variables?

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.

How does KNN choose K value?

In KNN, finding the value of k is not easy. A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive. Data scientists usually choose as an odd number if the number of classes is 2 and another simple approach to select k is set k=sqrt(n).

What is the problem with KNN?

Summary. The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It's easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.


1 Answers

For nominal String features, consider one hot encoding: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

For ordinal String features, consider label encoding (with a sensible ordering based on your understanding of the feature): http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html.

like image 174
Z3D__ Avatar answered Sep 16 '22 14:09

Z3D__