I'm honestly trying to figure out how to convert a dataset (format: pandas DataFrame
or numpy array) to a form that a simple text-classification tensorflow model can train on for sentiment analysis. The dataset I'm using is similar to IMDB (containing both text and labels (positive or negative)). Every tutorial I've looked at has either prepared data differently, or didn't bother with data preparation and left it to your imagination. (For instance, all the IMDB tutorials import a preprocessed Tensorflow BatchDataset
from tensorflow_datasets
, which isn't helpful when I'm using my own set of data). My own attempts to convert a Pandas DataFrame
to Tensorflow's Dataset
types have resulted in ValueErrors or a negative loss during training. Any help would be appreciated.
I had originally prepared my data as follows, where training
and validation
are already shuffled Pandas DataFrame
s containing text
and label
columns:
# IMPORT STUFF
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf # (I'm using tensorflow 2.0)
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
# ... [code for importing and preparing the pandas dataframe omitted]
# TOKENIZE
train_text = training['text'].to_numpy()
tok = Tokenizer(oov_token='<unk>')
tok.fit_on_texts(train_text)
tok.word_index['<pad>'] = 0
tok.index_word[0] = '<pad>'
train_seqs = tok.texts_to_sequences(train_text)
train_seqs = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
train_labels = training['label'].to_numpy().flatten()
valid_text = validation['text'].to_numpy()
valid_seqs = tok.texts_to_sequences(valid_text)
valid_seqs = tf.keras.preprocessing.sequence.pad_sequences(valid_seqs, padding='post')
valid_labels = validation['label'].to_numpy().flatten()
# CONVERT TO TF DATASETS
train_ds = tf.data.Dataset.from_tensor_slices((train_seqs,train_labels))
valid_ds = tf.data.Dataset.from_tensor_slices((valid_seqs,valid_labels))
train_ds = train_ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
valid_ds = valid_ds.batch(BATCH_SIZE)
# PREFETCH
train_ds = train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
valid_ds = valid_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
This resulted train_ds and valid_ds being tokenized and of type PrefetchDataset
or <PrefetchDataset shapes: ((None, None, None, 118), (None, None, None)), types: (tf.int32, tf.int64)>
.
I then trained as follows, but got a large negative loss and an accuracy of 0.
model = keras.Sequential([
layers.Embedding(vocab_size, embedding_dim),
layers.GlobalAveragePooling1D(),
layers.Dense(1, activation='sigmoid') # also tried activation='softmax'
])
model.compile(optimizer='adam',
loss='binary_crossentropy', # binary_crossentropy
metrics=['accuracy'])
history = model.fit(
train_ds,
epochs=1,
validation_data=valid_ds, validation_steps=1, steps_per_epoch=BUFFER_SIZE)
If I don't do the fancy prefetch stuff, train_ds
would be of type BatchDataset
or <BatchDataset shapes: ((None, 118), (None,)), types: (tf.int32, tf.int64)>
, but that also is getting me a negative loss and an accuracy of 0.
And if I just do the following:
x, y = training['text'].to_numpy(), training['label'].to_numpy()
x, y = tf.convert_to_tensor(x),tf.convert_to_tensor(y)
then x
and y
are of type EagerTensor
, but I can't seem to figure out how to Batch an EagerTensor
.
What types and shapes do I really need for train_ds
? What am I missing or doing wrong?
The text_classification_with_hub tutorial trains an already prepared imdb dataset as shown:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=20,
validation_data=validation_data.batch(512),
verbose=1)
In this example, train_data
is of form tensorflow.python.data.ops.dataset_ops._OptionsDataset
, and train_data.shuffle(1000).batch(512)
is tensorflow.python.data.ops.dataset_ops.BatchDataset
(or <BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int64)>
).
They apparently didn't bother with tokenization with this dataset, but I doubt tokenization is my issue. Why does their train_data.shuffle(10000).batch(512)
work but my train_ds
not work?
It's possible the issue is with the model setup, the Embedding
layer, or with tokenization, but I'm not so sure that's the case. I've already looked at the following tutorials for inspiration:
https://www.tensorflow.org/tutorials/keras/text_classification_with_hub
https://www.kaggle.com/drscarlat/imdb-sentiment-analysis-keras-and-tensorflow
https://www.tensorflow.org/tutorials/text/image_captioning
https://www.tensorflow.org/tutorials/text/word_embeddings#learning_embeddings_from_scratch
https://thedatafrog.com/word-embedding-sentiment-analysis/
A DataFrame as an arrayIf your data has a uniform datatype, or dtype , it's possible to use a pandas DataFrame anywhere you could use a NumPy array. This works because the pandas. DataFrame class supports the __array__ protocol, and TensorFlow's tf. convert_to_tensor function accepts objects that support the protocol.
UPDATE: I figured out that the issue was that I neglected to convert my target labels to 0 and 1 for binary cross entropy. The problem had nothing to do with converting to a Tensorflow Dataset type. My above code works fine for accomplishing that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With