Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training TFBertForSequenceClassification with custom X and Y data

I am working on a TextClassification problem, for which I am trying to traing my model on TFBertForSequenceClassification given in huggingface-transformers library.

I followed the example given on their github page, I am able to run the sample code with given sample data using tensorflow_datasets.load('glue/mrpc'). However, I am unable to find an example on how to load my own custom data and pass it in model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7).

How can I define my own X, do tokenization of my X and prepare train_dataset with my X and Y. Where X represents my input text and Y represents classification category of given X.

Sample Training dataframe :

    text    category_index
0   Assorted Print Joggers - Pack of 2 ,/ Gray Pri...   0
1   "Buckle" ( Matt ) for 35 mm Width Belt  0
2   (Gagam 07) Barcelona Football Jersey Home 17 1...   2
3   (Pack of 3 Pair) Flocklined Reusable Rubber Ha...   1
4   (Summer special Offer)Firststep new born baby ...   0
like image 965
Rahul Goel Avatar asked Feb 29 '20 09:02

Rahul Goel


1 Answers

There are really not many good examples of HuggingFace transformers with the custom dataset files.

Let's import the required libraries first:

import numpy as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing as p

import tensorflow as tf
import transformers as trfs

And define the needed constants:

# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

Now it's time to read the dataset:

df = pd.read_csv('data.csv')

Then define the required model from pretrained BERT for sequence classification:

def create_model(max_sequence, model_name, num_labels):
    bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # This is the input for the tokens themselves(words from the dataset after encoding):
    input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')

    # attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
    # Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
    # and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
    attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
    
    # Use previous inputs as BERT inputs:
    output = bert_model([input_ids, attention_mask])[0]

    # We can also add dropout as regularization technique:
    #output = tf.keras.layers.Dropout(rate=0.15)(output)

    # Provide number of classes to the final layer:
    output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

    # Final model:
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

Now we need to instantiate the model using defined function, and compile our model:

model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())

opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

Create a function for the tokenization(converting text to tokens):

def batch_encode(X, tokenizer):
    return tokenizer.batch_encode_plus(
    X,
    max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
    add_special_tokens=True, # add [CLS] and [SEP] tokens
    return_attention_mask=True,
    return_token_type_ids=False, # not needed for this type of ML task
    pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
    return_tensors='tf'
)

Load the tokenizer:

tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

Split the data into train and validation parts:

X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)

Encode our sets:

X_train = batch_encode(X_train)
X_val = batch_encode(X_val)

Finally, we can fit our model using train set and validate after each epoch using validation set:

model.fit(
    x=X_train.values(),
    y=y_train,
    validation_data=(X_val.values(), y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)
like image 127
konstantin_doncov Avatar answered Oct 25 '22 08:10

konstantin_doncov