Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BERT document embedding

I am trying to do document embedding using BERT. The code I use is a combination of two sources. I use BERT Document Classification Tutorial with Code, and BERT Word Embeddings Tutorial. Below is the code, I feed the first 510 tokens of each document to the BERT model. Finally, I apply K-means clustering to these embeddings, but the members of each cluster are TOTALLY irrelevant. I am wondering how this is possible. Maybe something is wrong with my code. I would appreciate if you take a look at my code and tell if there is something wrong with it. I use Google colab to run this code.

# text_to_embedding function
import torch
from keras.preprocessing.sequence import pad_sequences

def text_to_embedding(tokenizer, model, in_text):
    '''
    Uses the provided BERT 'model' and 'tokenizer' to generate a vector
    representation of the input string, 'in_text'.

    Returns the vector stored as a numpy ndarray.
    '''

    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    MAX_LEN = 510

    # 'encode' will:
    #  (1) Tokenize the sentence
    #  (2) Prepend the '[CLS]' token to the start.
    #  (3) Append the '[SEP]' token to the end.
    #  (4) Map tokens to their IDs.
    input_ids = tokenizer.encode(
        in_text,                         # sentence to encode.
        add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
        max_length = MAX_LEN,            # Truncate all sentences.
        #return_tensors = 'pt'           # Return pytorch tensors.
    )

    # Pad our input tokens. Truncation was handled above by the 'encode'
    # function, which also makes sure that the '[SEP]' token is placed at the
    # end *after* truncating.
    # Note: 'pad_sequences' expects a list of lists, but we only have one
    # piece of text, so we surround 'input_ids' with an extra set of brackets.
    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
                          value=0, truncating="post", padding="post")
    
    # Remove the outer list.
    input_ids = results[0]

    # Create attention masks.
    attn_mask = [int(i > 0) for i in input_ids]

    # Cast to tensors.
    input_ids = torch.tensor(input_ids)
    attn_mask = torch.tensor(attn_mask)

    # Add an extra dimension for the "batch" (even though there is only one
    # input in this batch)
    input_ids = input_ids.unsqueeze(0)
    attn_mask = attn_mask.unsqueeze(0)


    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Copy the inputs to the GPU
    input_ids = input_ids.to(device)
    attn_mask = attn_mask.to(device)

    # telling the model not to build the backward graph will make this
    # a little quicker.
    with torch.no_grad():

        # Forward pass, returns hidden states and predictions
        # This will return the logits rather than the loss because we have
        # not provided labels.
        outputs = model(
            input_ids = input_ids,
            token_type_ids = None,
            attention_mask = attn_mask)
        

        hidden_states = outputs[2]

        #Sentence Vectors
        #To get a single vector for our entire sentence we have multiple 
        #application-dependent strategies, but a simple approach is to 
        #average the second to last hiden layer of each token producing 
        #a single 768 length vector.
        # `hidden_states` has shape [13 x 1 x ? x 768]

        # `token_vecs` is a tensor with shape [? x 768]
        token_vecs = hidden_states[-2][0]

        # Calculate the average of all ? token vectors.
        sentence_embedding = torch.mean(token_vecs, dim=0)
        # Move to the CPU and convert to numpy ndarray.
        sentence_embedding = sentence_embedding.detach().cpu().numpy()

        return(sentence_embedding)


from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.cuda()

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
like image 744
MRM Avatar asked Aug 01 '20 20:08

MRM


People also ask

Can BERT be used for word embeddings?

Word Embedding with BERT Model The BERT base model uses 12 layers of transformer encoders as discussed, and each output per token from each layer of these can be used as a word embedding!.

Which embedding is used in BERT?

BERT uses Wordpiece embeddings input for tokens. Along with token embeddings, BERT uses positional embeddings and segment embeddings for each token. Positional embeddings contain information about the position of tokens in sequence. Segment embeddings help when model input has sentence pairs.

Is BERT a word embedding or sentence embedding?

In this tutorial, we will use BERT to extract features, namely word and sentence embedding vectors, from text data. What can we do with these word and sentence embedding vectors? First, these embeddings are useful for keyword/search expansion, semantic search and information retrieval.

What are document embeddings?

Word embedding and document embedding It means that each word is mapped to the vector of real numbers that represent the word. Embedding models are mostly based on neural networks. Document embedding is usually computed from the word embeddings in two steps.


1 Answers

I don't know if it solves your problem but here's my 2 cent:

  • You don't have to calculate the attention mask and do the padding manually. Have a look at the documentation. Just call the tokenizer itself:
results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
# Cast to tensors
...
  • Instead of using the average of the second to last hidden layer, you can try the same thing with the last hidden layer; or you can use the vector represents [CLS] from the last layer
like image 102
Thành Hưng Dương Avatar answered Nov 11 '22 17:11

Thành Hưng Dương