Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create embeddedings for a column that is a list of categorical values

Tags:

I am having some trouble deciding how to create embeddings for a categorical feature for my DNN model. The feature consists of a non fixed set of tags.

The feature is like:

column = [['Adventure','Animation','Comedy'],
          ['Adventure','Comedy'],
          ['Adventure','Children','Comedy']

I would like to do this with tensorflow so I know the tf.feature_column module should work, I just don't know which version to use.

Thanks!

like image 831
jiwidi Avatar asked May 12 '19 12:05

jiwidi


People also ask

What are categorical embeddings?

Estimated Time: 10 minutes. Categorical data refers to input features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person.

How do you handle categorical columns?

Step 1: Create a dictionary with key as category and values with its rank. Step 2: Create a new column and map the ordinal column with the created dictionary. Step 3: Drop the original column.


1 Answers

First you need to fill in your features to the same length.

import itertools
import numpy as np

column = np.array(list(itertools.zip_longest(*column, fillvalue='UNK'))).T
print(column)

[['Adventure' 'Animation' 'Comedy']
 ['Adventure' 'Comedy' 'UNK']
 ['Adventure' 'Children' 'Comedy']]

Then you can use tf.feature_column.embedding_column to create embeddings for a categorical feature. The inputs of embedding_column must be a CategoricalColumn created by any of the categorical_column_* function.

# if you have big vocabulary list in files, you can use tf.feature_column.categorical_column_with_vocabulary_file
cat_fc = tf.feature_column.categorical_column_with_vocabulary_list(
    'cat_data', # identifying the input feature
    ['Adventure', 'Animation', 'Comedy', 'Children'], # vocabulary list
    dtype=tf.string,
    default_value=-1)

cat_column = tf.feature_column.embedding_column(
    categorical_column =cat_fc,
    dimension = 5,
    combiner='mean')

categorical_column_with_vocabulary_list will ignore the 'UNK' since there is no 'UNK' in vocabulary list. dimension specifying dimension of the embedding and combiner specifying how to reduce if there are multiple entries in a single row with 'mean' the default in embedding_column.

The result:

tensor = tf.feature_column.input_layer({'cat_data':column}, [cat_column])

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print(session.run(tensor))

[[-0.694761   -0.0711766   0.05720187  0.01770079 -0.09884425]
 [-0.8362482   0.11640486 -0.01767573 -0.00548441 -0.05738768]
 [-0.71162754 -0.03012567  0.15568805  0.00752804 -0.1422816 ]]
like image 133
giser_yugang Avatar answered Oct 11 '22 15:10

giser_yugang