I am having some trouble deciding how to create embeddings for a categorical feature for my DNN model. The feature consists of a non fixed set of tags.
The feature is like:
column = [['Adventure','Animation','Comedy'],
['Adventure','Comedy'],
['Adventure','Children','Comedy']
I would like to do this with tensorflow
so I know the tf.feature_column module should work, I just don't know which version to use.
Thanks!
Estimated Time: 10 minutes. Categorical data refers to input features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person.
Step 1: Create a dictionary with key as category and values with its rank. Step 2: Create a new column and map the ordinal column with the created dictionary. Step 3: Drop the original column.
First you need to fill in your features to the same length.
import itertools
import numpy as np
column = np.array(list(itertools.zip_longest(*column, fillvalue='UNK'))).T
print(column)
[['Adventure' 'Animation' 'Comedy']
['Adventure' 'Comedy' 'UNK']
['Adventure' 'Children' 'Comedy']]
Then you can use tf.feature_column.embedding_column
to create embeddings for a categorical feature. The inputs of embedding_column
must be a CategoricalColumn
created by any of the categorical_column_*
function.
# if you have big vocabulary list in files, you can use tf.feature_column.categorical_column_with_vocabulary_file
cat_fc = tf.feature_column.categorical_column_with_vocabulary_list(
'cat_data', # identifying the input feature
['Adventure', 'Animation', 'Comedy', 'Children'], # vocabulary list
dtype=tf.string,
default_value=-1)
cat_column = tf.feature_column.embedding_column(
categorical_column =cat_fc,
dimension = 5,
combiner='mean')
categorical_column_with_vocabulary_list
will ignore the 'UNK'
since there is no 'UNK'
in vocabulary list. dimension
specifying dimension of the embedding and combiner
specifying how to reduce if there are multiple entries in a single row with 'mean' the default in embedding_column
.
The result:
tensor = tf.feature_column.input_layer({'cat_data':column}, [cat_column])
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print(session.run(tensor))
[[-0.694761 -0.0711766 0.05720187 0.01770079 -0.09884425]
[-0.8362482 0.11640486 -0.01767573 -0.00548441 -0.05738768]
[-0.71162754 -0.03012567 0.15568805 0.00752804 -0.1422816 ]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With