Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Could not convert string to float error from the Titanic competition

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.

This is my code:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)

x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
like image 929
oo92 Avatar asked Jun 22 '18 23:06

oo92


1 Answers

There are a couple ways to deal with this, and it kind of depends what you're looking for:

  1. You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,

or

  1. dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.

In lots of machine learning applications, factors are better to deal with as dummy codes.

Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).

Encoding Categories to numeric:

Method 1: pd.factorize

pd.factorize is a simple, fast way of encoding to numeric:

For example, if your column gender looks like this:

>>> df
   gender
0  Female
1    Male
2    Male
3    Male
4  Female
5  Female
6    Male
7  Female
8  Female
9  Female

df['gender_factor'] = pd.factorize(df.gender)[0]

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0

Method 2: categorical dtype

Another way would be to use category dtype:

df['gender_factor'] = df['gender'].astype('category').cat.codes

This would result in the same output

Method 3 sklearn.preprocessing.LabelEncoder()

This method comes with some bonuses, such as easy back transforming:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0

# Easy to back transform:

df['gender_factor'] = le.inverse_transform(df.gender_factor)

>>> df
   gender gender_factor
0  Female        Female
1    Male          Male
2    Male          Male
3    Male          Male
4  Female        Female
5  Female        Female
6    Male          Male
7  Female        Female
8  Female        Female
9  Female        Female

Dummy Coding:

Method 1: pd.get_dummies

df.join(pd.get_dummies(df.gender))

   gender  Female  Male
0  Female       1     0
1    Male       0     1
2    Male       0     1
3    Male       0     1
4  Female       1     0
5  Female       1     0
6    Male       0     1
7  Female       1     0
8  Female       1     0
9  Female       1     0

Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:

df.join(pd.get_dummies(df.gender, drop_first=True))

   gender  Male
0  Female     0
1    Male     1
2    Male     1
3    Male     1
4  Female     0
5  Female     0
6    Male     1
7  Female     0
8  Female     0
9  Female     0
like image 178
sacuL Avatar answered Nov 15 '22 07:11

sacuL