Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mapping text data through huggingface tokenizer

I have my encode function that looks like this:

from transformers import BertTokenizer, BertModel

MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)

def encode(texts, tokenizer=tokenizer, maxlen=10):
#     import pdb; pdb.set_trace()
    inputs = tokenizer.encode_plus(
        texts,
        return_tensors='tf',
        return_attention_masks=True, 
        return_token_type_ids=True,
        pad_to_max_length=True,
        max_length=maxlen
    )

    return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]

I want to get my data encoded on the fly by doing this:

x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
           .map(encode))

However, this chucks the error:

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Now from my understanding when I set a breakpoint inside encode it was because I was sending a non-numpy array. How do I get huggingface transformers to play nice with tensorflow strings as inputs?

If you need a dummy dataframe here it is:

df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})

What I tried

So I tried to use from_generator so that I can parse in the strings to the encode_plus function. However, this does not work with TPUs.

AUTO = tf.data.experimental.AUTOTUNE

def get_gen(df):
    def gen():
        for i in range(len(df)):
            yield encode(df.loc[i, 'comment_text']) , df.loc[i, 'toxic']
    return gen

shapes = ((tf.TensorShape([maxlen]), tf.TensorShape([maxlen]), tf.TensorShape([maxlen])), tf.TensorShape([]))

train_dataset = tf.data.Dataset.from_generator(
    get_gen(df_train),
    ((tf.int32, tf.int32, tf.int32), tf.int32),
    shapes
)
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(AUTO)

Version Info:

transformers.__version__, tf.__version__ => ('2.7.0', '2.1.0')

like image 523
sachinruk Avatar asked Mar 02 '23 12:03

sachinruk


2 Answers

the tokenizer of bert works on a string, a list/tuple of strings or a list/tuple of integers. So, check is your data getting converted to string or not. To apply tokenizer on whole dataset I used Dataset.map, but this runs on graph mode. So, I need to wrap it in a tf.py_function. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function. My data was getting converted to bytes after using py_function hence I applied tf.compat.as_str to convert bytes to string.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
    lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
    lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
    return lang1, lang2
def tf_encode(pt, en):
    result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
    result_pt.set_shape([None])
    result_en.set_shape([None])
    return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, 
                                                           padded_shapes=(60, 60))
a,p = next(iter(train_dataset))
like image 143
Varchita Lalwani Avatar answered Mar 12 '23 00:03

Varchita Lalwani


When you create the tensorflow dataset with: tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values) tensorflow converts your strings into tensors of string type which is not an accepted input of of tokenizer.encode_plus. Like the error message says it only accepts a string, a list/tuple of strings or a list/tuple of integers. You can verify this by adding a print(type(texts)) inside your encode function (Output:<class 'tensorflow.python.framework.ops.Tensor'>).

I'm not sure what your follow up plan is and why you need a tf.data.Dataset, but you have to encode your input before you turn it into a tf.data.Dataset:

import tensorflow as tf
from transformers import BertTokenizer, BertModel

MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)

texts = ['Today was a good day', 'Today was a bad day',
       'Today was a rainy day', 'Today was a sunny day',
       'Today was a cloudy day']


#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
        texts,
        return_tensors='tf',
        return_attention_masks=True, 
        return_token_type_ids=True,
        pad_to_max_length=True,
        max_length=10
    )

dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
                                              inputs['attention_mask'],
                                              inputs['token_type_ids']))
print(type(dataset))

Output:

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
like image 28
cronoik Avatar answered Mar 11 '23 22:03

cronoik