Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I go from Pandas DataFrame to Tensorflow BatchDataset for NLP?

I'm honestly trying to figure out how to convert a dataset (format: pandas DataFrame or numpy array) to a form that a simple text-classification tensorflow model can train on for sentiment analysis. The dataset I'm using is similar to IMDB (containing both text and labels (positive or negative)). Every tutorial I've looked at has either prepared data differently, or didn't bother with data preparation and left it to your imagination. (For instance, all the IMDB tutorials import a preprocessed Tensorflow BatchDataset from tensorflow_datasets, which isn't helpful when I'm using my own set of data). My own attempts to convert a Pandas DataFrame to Tensorflow's Dataset types have resulted in ValueErrors or a negative loss during training. Any help would be appreciated.

I had originally prepared my data as follows, where training and validation are already shuffled Pandas DataFrames containing text and label columns:

# IMPORT STUFF

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf # (I'm using tensorflow 2.0)
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
# ... [code for importing and preparing the pandas dataframe omitted]

# TOKENIZE

train_text = training['text'].to_numpy()
tok = Tokenizer(oov_token='<unk>')
tok.fit_on_texts(train_text)
tok.word_index['<pad>'] = 0
tok.index_word[0] = '<pad>'

train_seqs = tok.texts_to_sequences(train_text)
train_seqs = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

train_labels = training['label'].to_numpy().flatten()

valid_text = validation['text'].to_numpy()
valid_seqs = tok.texts_to_sequences(valid_text)
valid_seqs = tf.keras.preprocessing.sequence.pad_sequences(valid_seqs, padding='post')

valid_labels = validation['label'].to_numpy().flatten()

# CONVERT TO TF DATASETS

train_ds = tf.data.Dataset.from_tensor_slices((train_seqs,train_labels))
valid_ds = tf.data.Dataset.from_tensor_slices((valid_seqs,valid_labels))

train_ds = train_ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
valid_ds = valid_ds.batch(BATCH_SIZE)

# PREFETCH

train_ds = train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
valid_ds = valid_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

This resulted train_ds and valid_ds being tokenized and of type PrefetchDataset or <PrefetchDataset shapes: ((None, None, None, 118), (None, None, None)), types: (tf.int32, tf.int64)>.

I then trained as follows, but got a large negative loss and an accuracy of 0.

model = keras.Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.GlobalAveragePooling1D(),
    layers.Dense(1, activation='sigmoid') # also tried activation='softmax'
])

model.compile(optimizer='adam',
              loss='binary_crossentropy', # binary_crossentropy
              metrics=['accuracy'])

history = model.fit(
    train_ds,
    epochs=1,
    validation_data=valid_ds, validation_steps=1, steps_per_epoch=BUFFER_SIZE)

If I don't do the fancy prefetch stuff, train_ds would be of type BatchDataset or <BatchDataset shapes: ((None, 118), (None,)), types: (tf.int32, tf.int64)>, but that also is getting me a negative loss and an accuracy of 0.

And if I just do the following:

x, y = training['text'].to_numpy(), training['label'].to_numpy()
x, y = tf.convert_to_tensor(x),tf.convert_to_tensor(y)

then x and y are of type EagerTensor, but I can't seem to figure out how to Batch an EagerTensor.

What types and shapes do I really need for train_ds? What am I missing or doing wrong?

The text_classification_with_hub tutorial trains an already prepared imdb dataset as shown:

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

In this example, train_data is of form tensorflow.python.data.ops.dataset_ops._OptionsDataset, and train_data.shuffle(1000).batch(512) is tensorflow.python.data.ops.dataset_ops.BatchDataset (or <BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int64)>).

They apparently didn't bother with tokenization with this dataset, but I doubt tokenization is my issue. Why does their train_data.shuffle(10000).batch(512) work but my train_ds not work?

It's possible the issue is with the model setup, the Embedding layer, or with tokenization, but I'm not so sure that's the case. I've already looked at the following tutorials for inspiration:

https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

https://www.kaggle.com/drscarlat/imdb-sentiment-analysis-keras-and-tensorflow

https://www.tensorflow.org/tutorials/text/image_captioning

https://www.tensorflow.org/tutorials/text/word_embeddings#learning_embeddings_from_scratch

https://thedatafrog.com/word-embedding-sentiment-analysis/

like image 700
bug_spray Avatar asked Oct 13 '19 09:10

bug_spray


People also ask

Can you use pandas with TensorFlow?

A DataFrame as an arrayIf your data has a uniform datatype, or dtype , it's possible to use a pandas DataFrame anywhere you could use a NumPy array. This works because the pandas. DataFrame class supports the __array__ protocol, and TensorFlow's tf. convert_to_tensor function accepts objects that support the protocol.


1 Answers

UPDATE: I figured out that the issue was that I neglected to convert my target labels to 0 and 1 for binary cross entropy. The problem had nothing to do with converting to a Tensorflow Dataset type. My above code works fine for accomplishing that.

like image 142
bug_spray Avatar answered Sep 27 '22 17:09

bug_spray