I am working on a TextClassification problem, for which I am trying to traing my model on TFBertForSequenceClassification given in huggingface-transformers library.
I followed the example given on their github page, I am able to run the sample code with given sample data using tensorflow_datasets.load('glue/mrpc')
.
However, I am unable to find an example on how to load my own custom data and pass it in
model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)
.
How can I define my own X, do tokenization of my X and prepare train_dataset with my X and Y. Where X represents my input text and Y represents classification category of given X.
Sample Training dataframe :
text category_index
0 Assorted Print Joggers - Pack of 2 ,/ Gray Pri... 0
1 "Buckle" ( Matt ) for 35 mm Width Belt 0
2 (Gagam 07) Barcelona Football Jersey Home 17 1... 2
3 (Pack of 3 Pair) Flocklined Reusable Rubber Ha... 1
4 (Summer special Offer)Firststep new born baby ... 0
There are really not many good examples of HuggingFace
transformers with the custom dataset files.
Let's import the required libraries first:
import numpy as np
import pandas as pd
import sklearn.model_selection as ms
import sklearn.preprocessing as p
import tensorflow as tf
import transformers as trfs
And define the needed constants:
# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64
# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased'
# Batch size for fitting:
BATCH_SIZE = 16
# Number of epochs:
EPOCHS=5
Now it's time to read the dataset:
df = pd.read_csv('data.csv')
Then define the required model from pretrained BERT for sequence classification:
def create_model(max_sequence, model_name, num_labels):
bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
# This is the input for the tokens themselves(words from the dataset after encoding):
input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')
# attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
# Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH,
# and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
# Use previous inputs as BERT inputs:
output = bert_model([input_ids, attention_mask])[0]
# We can also add dropout as regularization technique:
#output = tf.keras.layers.Dropout(rate=0.15)(output)
# Provide number of classes to the final layer:
output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)
# Final model:
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
return model
Now we need to instantiate the model using defined function, and compile our model:
model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
Create a function for the tokenization(converting text to tokens):
def batch_encode(X, tokenizer):
return tokenizer.batch_encode_plus(
X,
max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
add_special_tokens=True, # add [CLS] and [SEP] tokens
return_attention_mask=True,
return_token_type_ids=False, # not needed for this type of ML task
pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
return_tensors='tf'
)
Load the tokenizer:
tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)
Split the data into train and validation parts:
X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)
Encode our sets:
X_train = batch_encode(X_train)
X_val = batch_encode(X_val)
Finally, we can fit our model using train set and validate after each epoch using validation set:
model.fit(
x=X_train.values(),
y=y_train,
validation_data=(X_val.values(), y_val),
epochs=EPOCHS,
batch_size=BATCH_SIZE
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With