I have been working with the datasets and feature_columns in tensorflow(https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html). I see they have categorical features and a way to create embedding features from categorical features. But when working on nlp tasks, how do we create a single embedding lookup?
For eg: Consider text classification task. Every data point would have a lot of textual columns but they would not be separate categories. How do we create and use a single embedding lookup for all these columns?
Below is an example of how I am currently using the embedding features. I am building a categorical feature for each column and using that for creating embedding. The problem would be that the embeddings for same word could be different for different columns.
def create_embedding_features(key, vocab_list=None, embedding_size=20):
cat_feature = \
tf.feature_column.categorical_column_with_vocabulary_list(
key=key,
vocabulary_list = vocab_list
)
embedding_feature = tf.feature_column.embedding_column(
categorical_column = cat_feature,
dimension = embedding_size
)
return embedding_feature
le_features_embd = [create_embedding_features(f, vocab_list=vocab_list)
for f in feature_keys]
I think you have some misunderstanding. For text classification task, if your input is a piece of text (a sentence), you should treat the entire sentence as a single feature column. Thus every data point has only a single textual column NOT a lot of columns. The value in this column is usually a combined embedding of all the tokens. That's the way we convert a var-length sparse feature (unknown number of text tokens) into one dense feature (e.g., a fixed 256 dimensional float vector).
Let's start with a _CategoricalColumn
.
cat_column_with_vocab = tf.feature_column.categorical_column_with_vocabulary_list(
key='my-text',
vocabulary_list=vocab_list)
Note if your vocabulary is huge, your should use categorical_column_with_vocabulary_file
.
We create an embedding column by using an initializer to read from a checkpoint (if we have pre-trained embedding) or randomize.
embedding_initializer = None
if has_pretrained_embedding:
embedding_initializer=tf.contrib.framework.load_embedding_initializer(
ckpt_path=xxxx)
else:
embedding_initializer=tf.random_uniform_initializer(-1.0, 1.0)
embed_column = embedding_column(
categorical_column=cat_column_with_vocab,
dimension=256, ## this is your pre-trained embedding dimension
initializer=embedding_initializer,
trainable=False)
Suppose you have another dense feature price
:
price_column = tf.feature_column.numeric_column('price')
Create your feature columns
columns = [embed_column, price_column]
Build the model:
features = tf.parse_example(...,
features=make_parse_example_spec(columns))
dense_tensor = tf.feature_column.input_layer(features, columns)
for units in [128, 64, 32]:
dense_tensor = tf.layers.dense(dense_tensor, units, tf.nn.relu)
prediction = tf.layers.dense(dense_tensor, 1)
By the way, for tf.parse_example
to work, this assumes your input data is tf.Example
like this (text protobuf):
features {
feature {
key: "price"
value { float_list {
value: 29.0
}}
}
feature {
key: "my-text"
value { bytes_list {
value: "this"
value: "product"
value: "is"
value: "for sale"
value: "within"
value: "us"
}}
}
}
That is, I assume you have two feature types, one is the product price, and the other is the text description of the product. Your vocabulary list would be a superset of
["this", "product", "is", "for sale", "within", "us"].
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With