Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to transform string column to categorical matrix using Keras and Sklearn

I am trying to build a simple Keras model, with Python3.6 on MacOS, to predict house prices in a given range but I fail to transform the output into a category matrix. I am using this dataset from Kaggle.

I've created a new column in the dataframe with different price ranges as strings to serve as target output in my model, then use keras.utils and Sklearn LabelEncoder to try to create the output binary matrix but I keep getting the error:

ValueError: invalid literal for int() with base 10: '0 - 50000'

Here is my code:

import pandas as pd
import numpy as np
from keras.layers import Dense
from keras.models import Sequential, load_model
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical, np_utils
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

seed = 7
np.random.seed(seed)

data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)

price_range = 50000
bins = np.arange(0, 12000000, price_range)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])] 

#correct first value 
labels[0] = '0 - 50000'

for item in labels:
    str(item)

print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000', 
 '200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000', 
 '400001 - 450000', '450001 - 500000']

data['PriceRange'] = pd.cut(data.Price, 
                            bins=bins, 
                            labels=labels, 
                            right=True, 
                            include_lowest=True)

#print(data.PriceRange.value_counts())
output_len = len(labels)
print(output_len)

Everything is correct here until I run the next piece:

predictors = data.drop(['Suburb', 'Address', 'SellerG', 'CouncilArea', 
                        'Propertycount', 'Date', 'Type', 'Price', 'PriceRange'], axis=1).as_matrix()

target = data['PriceRange']


# encode class values as integers
encoder = LabelEncoder()
encoder.fit(target)
encoded_Y = encoder.transform(target)

target = np_utils.to_categorical(data.PriceRange)

n_cols = predictors.shape[1]

And I get the ValueError: invalid literal for int() with base 10: '0 - 50000'

Con someone help me here? Don't really understand what I am doing wrong.

Many thanks

like image 375
Hugo Sanchez Avatar asked Nov 30 '17 12:11

Hugo Sanchez


2 Answers

Its because np_utils.to_categorical takes y of datatype int, but you have strings either convert them into int by giving them a key i.e :

cats = data.PriceRange.values.categories
di = dict(zip(cats,np.arange(len(cats))))
#{'0 - 50000': 0,
# '10000001 - 10050000': 200,
# '1000001 - 1050000': 20,
# '100001 - 150000': 2,
# '10050001 - 10100000': 201,
# '10100001 - 10150000': 202,

target = np_utils.to_categorical(data.PriceRange.map(di))

or since you are using pandas you can use pd.get_dummies to get one hot encoding.

onehot = pd.get_dummies(data.PriceRange)
target_labels = onehot.columns
target = onehot.as_matrix()

array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])
like image 91
Bharath Avatar answered Nov 23 '22 15:11

Bharath


With only one line of code

tf.keras.utils.to_categorical(data.PriceRange.factorize()[0])
like image 20
Marco Cerliani Avatar answered Nov 23 '22 15:11

Marco Cerliani