How to create embeddedings for a column that is a list of categorical values

Tags:

I am having some trouble deciding how to create embeddings for a categorical feature for my DNN model. The feature consists of a non fixed set of tags.

The feature is like:

column = [['Adventure','Animation','Comedy'],
          ['Adventure','Comedy'],
          ['Adventure','Children','Comedy']

I would like to do this with tensorflow so I know the tf.feature_column module should work, I just don't know which version to use.

Thanks!

831

asked May 12 '19 12:05

jiwidi

1 Answers

First you need to fill in your features to the same length.

import itertools
import numpy as np

column = np.array(list(itertools.zip_longest(*column, fillvalue='UNK'))).T
print(column)

[['Adventure' 'Animation' 'Comedy']
 ['Adventure' 'Comedy' 'UNK']
 ['Adventure' 'Children' 'Comedy']]

Then you can use tf.feature_column.embedding_column to create embeddings for a categorical feature. The inputs of embedding_column must be a CategoricalColumn created by any of the categorical_column_* function.

# if you have big vocabulary list in files, you can use tf.feature_column.categorical_column_with_vocabulary_file
cat_fc = tf.feature_column.categorical_column_with_vocabulary_list(
    'cat_data', # identifying the input feature
    ['Adventure', 'Animation', 'Comedy', 'Children'], # vocabulary list
    dtype=tf.string,
    default_value=-1)

cat_column = tf.feature_column.embedding_column(
    categorical_column =cat_fc,
    dimension = 5,
    combiner='mean')

categorical_column_with_vocabulary_list will ignore the 'UNK' since there is no 'UNK' in vocabulary list. dimension specifying dimension of the embedding and combiner specifying how to reduce if there are multiple entries in a single row with 'mean' the default in embedding_column.

The result:

tensor = tf.feature_column.input_layer({'cat_data':column}, [cat_column])

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print(session.run(tensor))

[[-0.694761   -0.0711766   0.05720187  0.01770079 -0.09884425]
 [-0.8362482   0.11640486 -0.01767573 -0.00548441 -0.05738768]
 [-0.71162754 -0.03012567  0.15568805  0.00752804 -0.1422816 ]]

133

answered Oct 11 '22 15:10

giser_yugang

Related questions
                            
                                C++ Template Specialization and Subclassing
                            
                                magit over tramp: re-use ssh connection
                            
                                Python Django - "AUTH_USER_MODEL refers to model '%s' that has not been installed" % settings.AUTH_USER_MODEL
                            
                                Perform an asynchronous service get when closing browser window / tab with Angular
                            
                                Why is the output 0 here. var a = 7; a.constructor();
                            
                                Why does Prometheus consume so much memory?
                            
                                EF Core - String or binary data would be truncated
                            
                                Redux-Persist with React-Native-Background-Fetch
                            
                                Why are the views not created in Aurelia production build mode
                            
                                Bind ASP.NET Core action parameter to JWT claim
                            
                                After a specific time of day - which method is better?
                            
                                Kestrel throwing "Invalid Host header" error while serving traffic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With