Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Document similarity with Word Mover Distance and Bert-Embedding

I am trying to calculate the document similarity (nearest neighbor) for two arbitrary documents using word embeddings based on Google's BERT. In order to obtain word embeddings from Bert, I use bert-as-a-service. Document similarity should be based on Word-Mover-Distance with the python wmd-relax package.

My previous tries are orientated along this tutorial from the wmd-relax github repo: https://github.com/src-d/wmd-relax/blob/master/spacy_example.py

import numpy as np
import spacy
import requests
from wmd import WMD
from collections import Counter
from bert_serving.client import BertClient

# Wikipedia titles
titles = ["Germany", "Spain", "Google", "Apple"]

# Standard model from spacy
nlp = spacy.load("en_vectors_web_lg")

# Fetch wiki articles and prepare as specy document
documents_spacy = {}
print('Create spacy document')
for title in titles:
    print("... fetching", title)
    pages = requests.get(
        "https://en.wikipedia.org/w/api.php?action=query&format=json&titles=%s"
        "&prop=extracts&explaintext" % title).json()["query"]["pages"]
    text = nlp(next(iter(pages.values()))["extract"])
    tokens = [t for t in text if t.is_alpha and not t.is_stop]
    words = Counter(t.text for t in tokens)
    orths = {t.text: t.orth for t in tokens}
    sorted_words = sorted(words)
    documents_spacy[title] = (title, [orths[t] for t in sorted_words],
                              np.array([words[t] for t in sorted_words],
                                       dtype=np.float32))


# This is the original embedding class with the model from spacy
class SpacyEmbeddings(object):
    def __getitem__(self, item):
        return nlp.vocab[item].vector


# Bert Embeddings using bert-as-as-service
class BertEmbeddings:
    def __init__(self, ip='localhost', port=5555, port_out=5556):
        self.server = BertClient(ip=ip, port=port, port_out=port_out)

    def __getitem__(self, item):
        text = nlp.vocab[item].text
        emb = self.server.encode([text])
        return emb


# Get the nearest neighbor of one of the atricles
calc_bert = WMD(BertEmbeddings(), documents_spacy)
calc_bert.nearest_neighbors(titles[0])

Unfortunately, the calculations fails with a dimensions mismatch in the distance calculation: ValueError: shapes (812,1,768) and (768,1,812) not aligned: 768 (dim 2) != 1 (dim 1)

like image 663
winwin Avatar asked Nov 07 '22 19:11

winwin


1 Answers

bert-as-service output's shape is (batch_size, sequence_len, embedding_dimension. In your case, sequence_len is 1 since you are pooling the results.

Now, you can transpose the other one to match with this using the transpose method of the numpy.ndarray.

like image 104
Razzaghnoori Avatar answered Nov 14 '22 23:11

Razzaghnoori