BERT document embedding

Tags:

I am trying to do document embedding using BERT. The code I use is a combination of two sources. I use BERT Document Classification Tutorial with Code, and BERT Word Embeddings Tutorial. Below is the code, I feed the first 510 tokens of each document to the BERT model. Finally, I apply K-means clustering to these embeddings, but the members of each cluster are TOTALLY irrelevant. I am wondering how this is possible. Maybe something is wrong with my code. I would appreciate if you take a look at my code and tell if there is something wrong with it. I use Google colab to run this code.

# text_to_embedding function
import torch
from keras.preprocessing.sequence import pad_sequences

def text_to_embedding(tokenizer, model, in_text):
    '''
    Uses the provided BERT 'model' and 'tokenizer' to generate a vector
    representation of the input string, 'in_text'.

    Returns the vector stored as a numpy ndarray.
    '''

    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    MAX_LEN = 510

    # 'encode' will:
    #  (1) Tokenize the sentence
    #  (2) Prepend the '[CLS]' token to the start.
    #  (3) Append the '[SEP]' token to the end.
    #  (4) Map tokens to their IDs.
    input_ids = tokenizer.encode(
        in_text,                         # sentence to encode.
        add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
        max_length = MAX_LEN,            # Truncate all sentences.
        #return_tensors = 'pt'           # Return pytorch tensors.
    )

    # Pad our input tokens. Truncation was handled above by the 'encode'
    # function, which also makes sure that the '[SEP]' token is placed at the
    # end *after* truncating.
    # Note: 'pad_sequences' expects a list of lists, but we only have one
    # piece of text, so we surround 'input_ids' with an extra set of brackets.
    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
                          value=0, truncating="post", padding="post")
    
    # Remove the outer list.
    input_ids = results[0]

    # Create attention masks.
    attn_mask = [int(i > 0) for i in input_ids]

    # Cast to tensors.
    input_ids = torch.tensor(input_ids)
    attn_mask = torch.tensor(attn_mask)

    # Add an extra dimension for the "batch" (even though there is only one
    # input in this batch)
    input_ids = input_ids.unsqueeze(0)
    attn_mask = attn_mask.unsqueeze(0)


    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Copy the inputs to the GPU
    input_ids = input_ids.to(device)
    attn_mask = attn_mask.to(device)

    # telling the model not to build the backward graph will make this
    # a little quicker.
    with torch.no_grad():

        # Forward pass, returns hidden states and predictions
        # This will return the logits rather than the loss because we have
        # not provided labels.
        outputs = model(
            input_ids = input_ids,
            token_type_ids = None,
            attention_mask = attn_mask)
        

        hidden_states = outputs[2]

        #Sentence Vectors
        #To get a single vector for our entire sentence we have multiple 
        #application-dependent strategies, but a simple approach is to 
        #average the second to last hiden layer of each token producing 
        #a single 768 length vector.
        # `hidden_states` has shape [13 x 1 x ? x 768]

        # `token_vecs` is a tensor with shape [? x 768]
        token_vecs = hidden_states[-2][0]

        # Calculate the average of all ? token vectors.
        sentence_embedding = torch.mean(token_vecs, dim=0)
        # Move to the CPU and convert to numpy ndarray.
        sentence_embedding = sentence_embedding.detach().cpu().numpy()

        return(sentence_embedding)


from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.cuda()

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

744

asked Aug 01 '20 20:08

MRM

1 Answers

I don't know if it solves your problem but here's my 2 cent:

You don't have to calculate the attention mask and do the padding manually. Have a look at the documentation. Just call the tokenizer itself:

results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
# Cast to tensors
...

Instead of using the average of the second to last hidden layer, you can try the same thing with the last hidden layer; or you can use the vector represents [CLS] from the last layer

102

answered Nov 11 '22 17:11

Thành Hưng Dương

Related questions
                            
                                Python: Time and space complexity of creating size n^2 tuples
                            
                                "django.db.utils.ProgrammingError: relation "auth_user" does not exist" Django V2.0
                            
                                Fail to import avro schema with python3.6.4
                            
                                Django asyncio call in views doesn't work
                            
                                Access to Flask Global Variables in Blueprint Apps
                            
                                Is it appropriate to raise an EnvironmentError for os.environ?
                            
                                Permission Error When Trying to Use PyInstaller
                            
                                No module named 'requests_aws4auth' when trying to import in a lambda
                            
                                When does Python perform type conversion when comparing int and float?
                            
                                (Background on this error at: http://sqlalche.me/e/e3q8)
                            
                                Right usage of second argument in proxy
                            
                                cannot unpack non-iterable numpy.float64 object python3 opencv
                            
                                tensorflow gradient - getting all nan values
                            
                                Failed to decode JSON object: Expecting value: line 1 column 1 (char 0)</p>
                            
                                Python subprocess.Popen() not working in a docker container - works fine locally
                            
                                Error pickling a `matlab` object in joblib `Parallel` context
                            
                                How to list files inside tar in AWS S3 without downloading it?
                            
                                How to write your own async/awaitable coroutine function in Python?
                            
                                Copy from vim to python console in tmux
                            
                                finding nested columns in pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BERT document embedding

Tags:

python-3.x

embedding

word-embedding

bert-language-model

MRM

People also ask

1 Answers

Thành Hưng Dương

Recent Activity

Donate For Us