I am trying to build a simple Keras model, with Python3.6 on MacOS, to predict house prices in a given range but I fail to transform the output into a category matrix. I am using this dataset from Kaggle.
I've created a new column in the dataframe with different price ranges as strings to serve as target output in my model, then use keras.utils and Sklearn LabelEncoder to try to create the output binary matrix but I keep getting the error:
ValueError: invalid literal for int() with base 10: '0 - 50000'
Here is my code:
import pandas as pd
import numpy as np
from keras.layers import Dense
from keras.models import Sequential, load_model
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical, np_utils
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
seed = 7
np.random.seed(seed)
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
price_range = 50000
bins = np.arange(0, 12000000, price_range)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
for item in labels:
str(item)
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
#print(data.PriceRange.value_counts())
output_len = len(labels)
print(output_len)
Everything is correct here until I run the next piece:
predictors = data.drop(['Suburb', 'Address', 'SellerG', 'CouncilArea',
'Propertycount', 'Date', 'Type', 'Price', 'PriceRange'], axis=1).as_matrix()
target = data['PriceRange']
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(target)
encoded_Y = encoder.transform(target)
target = np_utils.to_categorical(data.PriceRange)
n_cols = predictors.shape[1]
And I get the ValueError: invalid literal for int() with base 10: '0 - 50000'
Con someone help me here? Don't really understand what I am doing wrong.
Many thanks
Its because np_utils.to_categorical
takes y of datatype int, but you have strings either convert them into int by giving them a key i.e :
cats = data.PriceRange.values.categories
di = dict(zip(cats,np.arange(len(cats))))
#{'0 - 50000': 0,
# '10000001 - 10050000': 200,
# '1000001 - 1050000': 20,
# '100001 - 150000': 2,
# '10050001 - 10100000': 201,
# '10100001 - 10150000': 202,
target = np_utils.to_categorical(data.PriceRange.map(di))
or since you are using pandas you can use pd.get_dummies
to get one hot encoding.
onehot = pd.get_dummies(data.PriceRange)
target_labels = onehot.columns
target = onehot.as_matrix()
array([[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
With only one line of code
tf.keras.utils.to_categorical(data.PriceRange.factorize()[0])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With