I am new to python and TensorFlow. I recently started understanding and executing TensorFlow examples, and came across this one: https://www.tensorflow.org/versions/r0.10/tutorials/wide_and_deep/index.html
I got the error, TypeError: argument of type 'float' is not iterable, and I believe that the problem is with the following line of code:
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
(income_bracket is the label column of the census dataset, with '>50K' being one of the possible label values, and the other label is '=<50K'. The dataset is read into df_train. The explanation provided in the documentation for the reason to do the above is, "Since the task is a binary classification problem, we'll construct a label column named "label" whose value is 1 if the income is over 50K, and 0 otherwise.")
If anyone could explain me what is exactly happening and how should I fix it, that'll be great. I tried using Python2.7 and Python3.4, and I don't think that the problem is with the version of the language. Also, if anyone is aware of great tutorials for someone who is new to TensorFlow and pandas, please share the links.
Complete program:
import pandas as pd
import urllib
import tempfile
import tensorflow as tf
gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["female", "male"])
race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=["Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
marital_status = tf.contrib.layers.sparse_column_with_hash_bucket("marital_status", hash_bucket_size=100)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")
wide_columns = [gender, native_country, education, occupation, workclass, marital_status, relationship, age_buckets, tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)), tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4)), tf.contrib.layers.crossed_column([age_buckets, race, occupation], hash_bucket_size=int(1e6))]
deep_columns = [
tf.contrib.layers.embedding_column(workclass, dimension=8),
tf.contrib.layers.embedding_column(education, dimension=8),
tf.contrib.layers.embedding_column(marital_status, dimension=8),
tf.contrib.layers.embedding_column(gender, dimension=8),
tf.contrib.layers.embedding_column(relationship, dimension=8),
tf.contrib.layers.embedding_column(race, dimension=8),
tf.contrib.layers.embedding_column(native_country, dimension=8),
tf.contrib.layers.embedding_column(occupation, dimension=8),
age, education_num, capital_gain, capital_loss, hours_per_week]
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=wide_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 50])
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "gender",
"capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket"]
LABEL_COLUMN = 'label'
CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
def input_fn(df):
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1])
for k in CATEGORICAL_COLUMNS}
feature_cols = dict(continuous_cols.items() + categorical_cols.items())
label = tf.constant(df[LABEL_COLUMN].values)
return feature_cols, label
def train_input_fn():
return input_fn(df_train)
def eval_input_fn():
return input_fn(df_test)
m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
Thank you
PS: Full stack trace for the error
Traceback (most recent call last):
File "/home/jaspreet/PycharmProjects/TicTacTensorFlow/census.py", line 73, in <module>
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 2023, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "inference.pyx", line 920, in pandas.lib.map_infer (pandas/lib.c:44780)
File "/home/jaspreet/PycharmProjects/TicTacTensorFlow/census.py", line 73, in <lambda>
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
TypeError: argument of type 'float' is not iterable
As you can see, when you inspect the test.data
, you will obviously see that the first line of data has "NAN" in income_bracket
field.
I have further inspected that this is the only line contains "NAN" by doing:
ib = df_test ["income_bracket"]
t = type('12')
for idx,i in enumerate(ib):
if(type(i) != t):
print idx,type(i)
RESULT: 0 <type 'float'>
So you may just skip this row by:
df_test = pd.read_csv(file_test , names=COLUMNS, skipinitialspace=True, skiprows=1)
The program works verbatim with the latest version of pandas, i.e., 0.18.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With