Unable to transform string column to categorical matrix using Keras and Sklearn

Question

I am trying to build a simple Keras model, with Python3.6 on MacOS, to predict house prices in a given range but I fail to transform the output into a category matrix. I am using this dataset from Kaggle.

I've created a new column in the dataframe with different price ranges as strings to serve as target output in my model, then use keras.utils and Sklearn LabelEncoder to try to create the output binary matrix but I keep getting the error:

ValueError: invalid literal for int() with base 10: '0 - 50000'

Here is my code:

import pandas as pd
import numpy as np
from keras.layers import Dense
from keras.models import Sequential, load_model
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical, np_utils
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

seed = 7
np.random.seed(seed)

data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)

price_range = 50000
bins = np.arange(0, 12000000, price_range)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])] 

#correct first value 
labels[0] = '0 - 50000'

for item in labels:
    str(item)

print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000', 
 '200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000', 
 '400001 - 450000', '450001 - 500000']

data['PriceRange'] = pd.cut(data.Price, 
                            bins=bins, 
                            labels=labels, 
                            right=True, 
                            include_lowest=True)

#print(data.PriceRange.value_counts())
output_len = len(labels)
print(output_len)

Everything is correct here until I run the next piece:

predictors = data.drop(['Suburb', 'Address', 'SellerG', 'CouncilArea', 
                        'Propertycount', 'Date', 'Type', 'Price', 'PriceRange'], axis=1).as_matrix()

target = data['PriceRange']


# encode class values as integers
encoder = LabelEncoder()
encoder.fit(target)
encoded_Y = encoder.transform(target)

target = np_utils.to_categorical(data.PriceRange)

n_cols = predictors.shape[1]

And I get the ValueError: invalid literal for int() with base 10: '0 - 50000'

Con someone help me here? Don't really understand what I am doing wrong.

Many thanks

Bharath · Accepted Answer

Its because np_utils.to_categorical takes y of datatype int, but you have strings either convert them into int by giving them a key i.e :

cats = data.PriceRange.values.categories
di = dict(zip(cats,np.arange(len(cats))))
#{'0 - 50000': 0,
# '10000001 - 10050000': 200,
# '1000001 - 1050000': 20,
# '100001 - 150000': 2,
# '10050001 - 10100000': 201,
# '10100001 - 10150000': 202,

target = np_utils.to_categorical(data.PriceRange.map(di))

or since you are using pandas you can use pd.get_dummies to get one hot encoding.

onehot = pd.get_dummies(data.PriceRange)
target_labels = onehot.columns
target = onehot.as_matrix()

array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Marco Cerliani · Answer

With only one line of code

tf.keras.utils.to_categorical(data.PriceRange.factorize()[0])

Unable to transform string column to categorical matrix using Keras and Sklearn

Tags:

python-3.x

pandas

tensorflow

keras

scikit-learn

Hugo Sanchez

2 Answers

Bharath

Marco Cerliani

Recent Activity

Donate For Us

Unable to transform string column to categorical matrix using Keras and Sklearn

Tags:

python-3.x

pandas

tensorflow

keras

scikit-learn

Hugo Sanchez

2 Answers

Bharath

Marco Cerliani

Related questions

Recent Activity

Donate For Us