Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating many feature columns in Tensorflow

I'm getting started on a Tensorflow project, and am in the middle of defining and creating my feature columns. However, I have hundreds and hundreds of features- it's a pretty extensive dataset. Even after preprocessing and scrubbing, I have a lot of columns.

The traditional way of creating a feature_column is defined in the Tensorflow tutorial and even this StackOverflow post. You essentially declare and initialize a Tensorflow object for each feature column:

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])

This works all well and good if your dataset has only a few columns, but in my case, I surely don't want to have hundreds of lines of code initializing different feature_column objects.

What's the best way to resolve this issue? I notice that in the tutorial, all the columns are collected as a list:

base_columns = [
    gender, native_country, education, occupation, workclass, relationship,
    age_buckets,
]

Which is ultimately passed into your estimator:

m = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns)

So would the ideal way of handling feature_column creation for hundreds of columns be to append them directly into a list? Something like this?

my_columns = []

for col in df.columns:
    if is_string_dtype(df[col]): #is_string_dtype is pandas function
        my_column.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
            hash_bucket_size= len(df[col].unique())))

    elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
        my_column.append(tf.feature_column.numeric_column(col))

Is this the best way of creating these feature columns? Or am I missing some functionality to Tensorflow that allows me to work around this step?

like image 342
Yu Chen Avatar asked Oct 19 '17 16:10

Yu Chen


People also ask

What are feature columns in TensorFlow?

Think of feature columns as the intermediaries between raw data and Estimators. Feature columns are very rich, enabling you to transform a diverse range of raw data into formats that Estimators can use, allowing easy experimentation. In simple words feature column are bridge between raw data and estimator or model.

What is Feature_column?

feature_column. numeric_column('b') Feature column describe a set of transformations to the inputs. For example, to "bucketize" feature a , wrap the a column in a feature_column.

What do you use the TF Feature_column Bucketized_column function for?

feature_column. bucketized_column. Represents discretized dense input bucketed by boundaries .


2 Answers

What you have posted in the question makes sense. Small extension based on your own code:

import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
  if ptypes.is_string_dtype(df[col]): 
    my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
        hash_bucket_size= len(df[col].unique())))

  elif ptypes.is_numeric_dtype(df[col]): 
    my_columns.append(tf.feature_column.numeric_column(col))

  elif ptypes.is_categorical_dtype(df[col]): 
    my_columns.append(tf.feature_column.categorical_column(col, 
        hash_bucket_size= len(df[col].unique())))
like image 181
greeness Avatar answered Oct 10 '22 03:10

greeness


I used your own answer. Just edited a little bit (there should be my_columns instead of my_column in for loop) and posting it the way it worked for me.

import pandas.api.types as ptypes

my_columns = []

for col in df.columns:
  if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
    my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
        hash_bucket_size= len(df[col].unique())))

  elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
    my_columns.append(tf.feature_column.numeric_column(col))
like image 22
Maxim Zh Avatar answered Oct 10 '22 04:10

Maxim Zh