I'm getting started on a Tensorflow project, and am in the middle of defining and creating my feature columns. However, I have hundreds and hundreds of features- it's a pretty extensive dataset. Even after preprocessing and scrubbing, I have a lot of columns.
The traditional way of creating a feature_column
is defined in the Tensorflow tutorial and even this StackOverflow post. You essentially declare and initialize a Tensorflow object for each feature column:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
This works all well and good if your dataset has only a few columns, but in my case, I surely don't want to have hundreds of lines of code initializing different feature_column
objects.
What's the best way to resolve this issue? I notice that in the tutorial, all the columns are collected as a list:
base_columns = [
gender, native_country, education, occupation, workclass, relationship,
age_buckets,
]
Which is ultimately passed into your estimator:
m = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns)
So would the ideal way of handling feature_column
creation for hundreds of columns be to append them directly into a list? Something like this?
my_columns = []
for col in df.columns:
if is_string_dtype(df[col]): #is_string_dtype is pandas function
my_column.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
my_column.append(tf.feature_column.numeric_column(col))
Is this the best way of creating these feature columns? Or am I missing some functionality to Tensorflow that allows me to work around this step?
Think of feature columns as the intermediaries between raw data and Estimators. Feature columns are very rich, enabling you to transform a diverse range of raw data into formats that Estimators can use, allowing easy experimentation. In simple words feature column are bridge between raw data and estimator or model.
feature_column. numeric_column('b') Feature column describe a set of transformations to the inputs. For example, to "bucketize" feature a , wrap the a column in a feature_column.
feature_column. bucketized_column. Represents discretized dense input bucketed by boundaries .
What you have posted in the question makes sense. Small extension based on your own code:
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]):
my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]):
my_columns.append(tf.feature_column.numeric_column(col))
elif ptypes.is_categorical_dtype(df[col]):
my_columns.append(tf.feature_column.categorical_column(col,
hash_bucket_size= len(df[col].unique())))
I used your own answer. Just edited a little bit (there should be my_columns
instead of my_column
in for
loop) and posting it the way it worked for me.
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
my_columns.append(tf.feature_column.numeric_column(col))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With