Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cluster similar sentences using BERT

For ElMo, FastText and Word2Vec, I'm averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences.

A good example of the implementation can be seen in this short article: http://ai.intelligentonlinetools.com/ml/text-clustering-word-embedding-machine-learning/

I would like to do the same thing using BERT (using the BERT python package from hugging face), however I am rather unfamiliar with how to extract the raw word/sentence vectors in order to input them into a clustering algorithm. I know that BERT can output sentence representations - so how would I actually extract the raw vectors from a sentence?

Any information would be helpful.

like image 397
somethingstrang Avatar asked Apr 10 '19 18:04

somethingstrang


People also ask

How do you cluster similar sentences?

You can use Sentence Transformers to generate the sentence embeddings. These embeddings are much more meaningful as compared to the one obtained from bert-as-service, as they have been fine-tuned such that semantically similar sentences have higher similarity score.

Can Bert be used for sentence embedding?

Sentence-BERT uses a Siamese network like architecture to provide 2 sentences as an input. These 2 sentences are then passed to BERT models and a pooling layer to generate their embeddings. Then use the embeddings for the pair of sentences as inputs to calculate the cosine similarity.

Is Bert a sentence encoder?

A BERT layer with a mean pooling operation is used as a shared text encoder.


4 Answers

You can use Sentence Transformers to generate the sentence embeddings. These embeddings are much more meaningful as compared to the one obtained from bert-as-service, as they have been fine-tuned such that semantically similar sentences have higher similarity score. You can use FAISS based clustering algorithm if number of sentences to be clustered are in millions or more as vanilla K-means like clustering algorithm takes quadratic time.

like image 190
Subham Kumar Avatar answered Oct 10 '22 14:10

Subham Kumar


You will need to generate bert embeddidngs for the sentences first. bert-as-service provides a very easy way to generate embeddings for sentences.

This is how you can geberate bert vectors for a list of sentences you need to cluster. It is explained very well in the bert-as-service repository: https://github.com/hanxiao/bert-as-service

Installations:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Download one of the pre-trained models available at https://github.com/google-research/bert

Start the service:

bert-serving-start -model_dir /your_model_directory/ -num_worker=4 

Generate the vectors for the list of sentences:

from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)

This would give you a list of vectors, you could write them into a csv and use any clustering algorithm as the sentences are reduced to numbers.

like image 36
Palak Bansal Avatar answered Oct 10 '22 16:10

Palak Bansal


As Subham Kumar mentioned, one can use this Python 3 library to compute sentence similarity: https://github.com/UKPLab/sentence-transformers

The library has a few code examples to perform clustering:

fast_clustering.py:

"""
This is a more complex example on performing clustering on large scale dataset.

This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly
similar. You can freely configure the threshold what is considered as similar. A high threshold will
only find extremely similar sentences, a lower threshold will find more sentence that are less similar.

A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned.

The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation).

In this example, we download a large set of questions from Quora and then find similar questions in this set.
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time


# Model for computing sentence embeddings. We use one trained for similar questions detection
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000 # We limit our corpus to only the first 50k questions


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)


print("Start clustering")
start_time = time.time()

#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)

print("Clustering done after {:.2f} sec".format(time.time() - start_time))

#Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} Elements ".format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])

kmeans.py:

"""
This is a simple application for sentence embeddings: clustering

Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")

agglomerative.py:

"""
This is a simple application for sentence embeddings: clustering

Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

# Perform kmean clustering
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5) #, affinity='cosine', linkage='average', distance_threshold=0.4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in clustered_sentences.items():
    print("Cluster ", i+1)
    print(cluster)
    print("")
like image 3
Franck Dernoncourt Avatar answered Oct 10 '22 15:10

Franck Dernoncourt


Bert adds a special [CLS] token at the beginning of each sample/sentence. After fine-tuning on a downstream task, the embedding of this [CLS] token or pooled_output as they call it in the hugging face implementation represents the sentence embedding.

But I think that you don't have labels so you won't be able to fine-tune, therefore you cannot use the pooled_output as a sentence embedding. Instead you should use the word embeddings in encoded_layers which is a tensor with dimensions (12,seq_len, 768). In this tensor you have the embeddings(dimension 768) from each of the 12 layers in Bert. To get the word embeddings you can use the output of the last layer, you can concatenate or sum the output of the last 4 layers and so on.

Here is the script for extracting the features: https://github.com/ethanjperez/pytorch-pretrained-BERT/blob/master/examples/extract_features.py

like image 3
DSDS Avatar answered Oct 10 '22 16:10

DSDS