Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow pad sequence feature column

How to pad sequences in the feature column and also what is a dimension in the feature_column.

I am using Tensorflow 2.0 and implementing an example of text summarization. Pretty new to machine learning, deep learning, and TensorFlow.

I came across feature_column and found them useful as I think they can be embedded in the processing pipeline of the model.

In a classic scenario where not using feature_column, I can pre-process the text, tokenize it, convert it into a sequence of numbers and then pad them to a maxlen of say 100 words. I am not able to get this done when using the feature_column.

Below is what I have written sofar.


train_dataset = tf.data.experimental.make_csv_dataset(
    'assets/train_dataset.csv', label_name=LABEL, num_epochs=1, shuffle=True, shuffle_buffer_size=10000, batch_size=1, ignore_errors=True)

vocabulary = ds.get_vocabulary()

def text_demo(feature_column):
    feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
    article, _ = next(iter(train_dataset.take(1)))

    tokenizer = tf_text.WhitespaceTokenizer()

    tokenized = tokenizer.tokenize(article['Text'])

    sequence_input, sequence_length = feature_layer({'Text':tokenized.to_tensor()})

    print(sequence_input)

def categorical_column(feature_column):
    dense_column = tf.keras.layers.DenseFeatures(feature_column)

    article, _ = next(iter(train_dataset.take(1)))

    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
    lang_tokenizer.fit_on_texts(article)

    tensor = lang_tokenizer.texts_to_sequences(article)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post', maxlen=50)

    print(dense_column(tensor).numpy())


text_seq_vocab_list = tf.feature_column.sequence_categorical_column_with_vocabulary_list(key='Text', vocabulary_list=list(vocabulary))
text_embedding = tf.feature_column.embedding_column(text_seq_vocab_list, dimension=8)
text_demo(text_embedding)

numerical_voacb_list = tf.feature_column.categorical_column_with_vocabulary_list(key='Text', vocabulary_list=list(vocabulary))
embedding = tf.feature_column.embedding_column(numerical_voacb_list, dimension=8)
categorical_column(embedding)

I am also confused as to what to use here, sequence_categorical_column_with_vocabulary_list or categorical_column_with_vocabulary_list. In the documentation, SequenceFeatures is also not explained, all though I know it is an experimental feature.

I am also not able to understand what does dimension param do?

like image 689
ghost Avatar asked Dec 23 '22 22:12

ghost


1 Answers

Actually, this

I am also confused as to what to use here, sequence_categorical_column_with_vocabulary_list or categorical_column_with_vocabulary_list.

should be the first question, because it affects interpretation the one from the topic name.

Also it is not exactly clear what do you mean on text summarization. What type of model\layers you are going to pass the processed texts into?

By the way, it is important, because tf.keras.layers.DenseFeatures and tf.keras.experimental.SequenceFeatures is suppoused for the different networks architecture and approaches.

As documentation for SequenceFeatures layer says the outputs of SequenceFeatures layers is supposed to be fed into sequence networks, such as RNN.

And DenseFeatures produces a dense Tensor as an output and so suits for other types of networks.

As you perform tokenization in your code snippet, you are going to use embedding in your model. Then you have two options:

  1. pass learned embeddings forward into Dense layers. This means you will not analyze words order.
  2. pass learned embeddings into Convolution, Reccurent, AveragePooling, LSTM layers and so use the word order to learn as well

The first option would require to use:

  • The tf.keras.layers.DenseFeatures with
  • one of tf.feature_column.categorical_column_*()
  • and tf.feature_column.embedding_column()

The second option would require to use:

  • The tf.keras.experimental.SequenceFeatures with
  • one of tf.feature_column.sequence_categorical_column_*()
  • and tf.feature_column.embedding_column()

Here are examples. The preprocessing and training part are the same for both options:

import tensorflow as tf
print(tf.__version__)

from tensorflow import feature_column

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import tensorflow.keras.utils as ku
from tensorflow.keras.utils import plot_model

import pandas as pd
from sklearn.model_selection import train_test_split

DATA_PATH = 'C:\SoloLearnMachineLearning\Stackoverflow\TextDataset.csv'

#it is just two column csv, like:
# text;label
# A wiki is run using wiki software;0
# otherwise known as a wiki engine.;1

dataframe = pd.read_csv(DATA_PATH, delimiter = ';')
dataframe.head()

# Preprocessing before feature_clolumn includes
# - getting the vocabulary
# - tokenization, which means only splitting on tokens.
#   Encoding sentences with vocablary will be done by feature_column!
# - padding
# - truncating

# Build vacabulary
vocab_size = 100
oov_tok = '<OOV>'

sentences = dataframe['text'].to_list()

tokenizer = Tokenizer(num_words = vocab_size, oov_token="<OOV>")

tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# if word_index shorter then default value of vocab_size we'll save actual size
vocab_size=len(word_index)
print("vocab_size = word_index = ",len(word_index))

# Split sentensec on tokens. here token = word
# text_to_word_sequence() has good default filter for 
# charachters include basic punctuation, tabs, and newlines
dataframe['text'] = dataframe['text'].apply(text_to_word_sequence)

dataframe.head()

max_length = 6

# paddind and trancating setnences
# do that directly with strings without using tokenizer.texts_to_sequences()
# the feature_colunm will convert strings into numbers
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: (x + N * [''])[:N])
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: x[:N])
dataframe.head()

# Define method to create tf.data dataset from Pandas Dataframe
def df_to_dataset(dataframe, label_column, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    #labels = dataframe.pop(label_column)
    labels = dataframe[label_column]

    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

# Split dataframe into train and validation sets
train_df, val_df = train_test_split(dataframe, test_size=0.2)

print(len(train_df), 'train examples')
print(len(val_df), 'validation examples')

batch_size = 32
ds = df_to_dataset(dataframe, 'label',shuffle=False,batch_size=batch_size)

train_ds = df_to_dataset(train_df, 'label',  shuffle=False, batch_size=batch_size)
val_ds = df_to_dataset(val_df, 'label', shuffle=False, batch_size=batch_size)

# and small batch for demo
example_batch = next(iter(ds))[0]
example_batch

# Helper methods to print exxample outputs of for defined feature_column

def demo(feature_column):
    feature_layer = tf.keras.layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch).numpy())

def seqdemo(feature_column):
    sequence_feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
    print(sequence_feature_layer(example_batch))

Here we come with the first option, when we do not use word order to learn

# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
            categorical_column_with_vocabulary_list(key='text', 
                                                    vocabulary_list=list(word_index))
#indicator_column produce one-hot-encoding. These lines just to compare with embedding
#print(demo(feature_column.indicator_column(payment_description_3)))
#print(payment_description_2,'\n')

# argument dimention here is exactly the dimension of the space in which tokens 
# will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(demo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential just for more generallity

# Define DenseFeatures layer to pass feature_columns into Keras model
feature_layer = tf.keras.layers.DenseFeatures(text_embedding)

# Define inputs for each feature column.
# See https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}

# Here we have just one column
# Important to define tf.keras.Input with shape 
# corresponding to lentgh of our sequence of words
feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                              name='text',
                                              dtype=tf.string)
print(feature_layer_inputs)

# Define outputs of DenseFeatures layer 
# And accually use them as first layer of the model
feature_layer_outputs = feature_layer(feature_layer_inputs)
print(feature_layer_outputs)

# Add consequences layers.
# See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(feature_layer_outputs)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in feature_layer_inputs.values()],
                              outputs=x)

model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              #run_eagerly=True
             )

# Note that fit() method looking up features in train_ds and valdation_ds by name in 
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=num_epochs,
                    verbose=1
                    )

And the second option when we take care about words order and learn it our model.

# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
              sequence_categorical_column_with_vocabulary_list(key='text', 
                                                vocabulary_list=list(word_index))

# arguemnt dimention here is exactly the dimension of the space in 
# which tokens will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(seqdemo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential 
# just for more generallity

# Define SequenceFeatures layer to pass feature_columns into Keras model
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)

# Define inputs for each feature column. See
# см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}
sequence_feature_layer_inputs = {}

# Here we have just one column

sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                                       name='text',
                                                       dtype=tf.string)
print(sequence_feature_layer_inputs)

# Define outputs of SequenceFeatures layer 
# And accually use them as first layer of the model

# Note here that SequenceFeatures layer produce tuple of two tensors as output.
# We need just first to pass next.
sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
print(sequence_feature_layer_outputs)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/

# Conv1D and MaxPooling1D will learn features from words order
x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
x = tf.keras.layers.MaxPooling1D(2)(x)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
                              outputs=x)
model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              #run_eagerly=True
             )

# Note that fit() method looking up features in train_ds and valdation_ds by name in 
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=num_epochs,
                    verbose=1
                    )

Please find complete jupiter notebooks with this exapmles on my github:

  • Answer. Tensorflow pad sequence feature column. DenseFeatures.ipynb
  • Answer. Tensorflow pad sequence feature column. SequenceFeatures.ipynb

Argument dimention in feature_column.embedding_column() is exactly the dimension of the space in which tokens will be presented during model's learning. See the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings for detailed explanation

Also note that using feature_column.embedding_column() is an alternative to tf.keras.layers.Embedding(). As you see feature_column make encoding step from a preprocessing pipeline, but you stil should manually do splitting, padding and trancation of sentences.

like image 102
Egor B Eremeev Avatar answered Jan 09 '23 02:01

Egor B Eremeev