I'm using Scikit learn to do a K-Nearest Neigbour Classification:
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier()
model.fit(train_input,train_labels)
If I print my data:
print("train_input:")
print(train_input.iloc[0])
print("\n")
print("train_labels:")
print(train_labels.iloc[0])
I get this:
train_input:
PassengerId 1
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
train_labels:
0
The code fails with this error:
ValueError Traceback (most recent call last)
<ipython-input-21-1f18eec1e602> in <module>()
63
64 model=KNeighborsClassifier()
---> 65 model.fit(train_input,train_labels)
ValueError: could not convert string to float: 'Q'
So, does the KNN algorithm not work with String
values?
How can I modify my data such that it fits the KNN implementation in Scikit-Learn?
You can use string values as you target variable, as documentation says target variable should be {array-like, sparse matrix} Target values of shape = [n_samples] or [n_samples, n_outputs] , they did not mention it to be numeric only.
KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.
In KNN, finding the value of k is not easy. A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive. Data scientists usually choose as an odd number if the number of classes is 2 and another simple approach to select k is set k=sqrt(n).
Summary. The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It's easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.
For nominal String
features, consider one hot encoding: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
For ordinal String
features, consider label encoding (with a sensible ordering based on your understanding of the feature): http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With