Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to extract the exhaustive vocabulary of the google universal sentence encoder large?

I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.

In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.

I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.

I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.

e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".

If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.

Any ideas on how I can extract the vocabulary known/used by GUSE?

like image 434
Steve Madere Avatar asked Mar 13 '19 18:03

Steve Madere


People also ask

How do you train a universal sentence encoder?

This kernel serves as a short and straightforward introduction to the process of: Loading a trained model from Tensorflow hub. Building a Sequential Keras model by using the trained model as a layer. Training the newly created Keras model, and perform inference.

What is Google sentence encoder?

The universal sentence encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained universal sentence encoder is publicly available in Tensorflow-hub.

What is InferSent?

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks. We provide our pre-trained English sentence encoder from our paper and our SentEval evaluation toolkit.

What is the “Universal sentence encoder”?

This is where the “Universal Sentence Encoder” comes into the picture. The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained Universal Sentence Encoder is publicly available in Tensorflow-hub.

What are the different types of sentence embeddings techniques?

There are various Sentence embeddings techniques like Doc2Vec, SentenceBERT, Universal Sentence Encoder, etc. Universal Sentence Encoder encodes entire sentence or text into vectors of real numbers that can be used for clustering, sentence similarity, text classification, and other Natural language processing (NLP) tasks.

What is the size of the embedding vector of sentence encoder?

As it can be seen whether it is a word, sentence or phrase, the sentence encoder is able to give an embedding vector of size 512. While using it in Rest API, you have to call it multiple times. Calling the module and session, again and again, will be very time-consuming. (~16s for each call from our testing).

Why is this encoder so slow for short sentences?

This encoder has better accuracy on downstream tasks but higher memory and compute resource usage due to complex architecture. Also, the compute time scales dramatically with the length of sentence as self-attention has \(O(n^{2})\) time complexity with the length of the sentence. But for short sentences, it is only moderately slower.


2 Answers

I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.

IMPORTANT: I'm assuming you're looking at https://tfhub.dev/google/universal-sentence-encoder/4! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.

Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.

Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.

The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.

import importlib

MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'

# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')

saved_model = loader_impl.parse_saved_model(MODEL_PATH)

# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor

word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004

Some resources that helped:

  1. A GitHub issue relating to changing the vocabulary
  2. A Tensorflow Google-group thread linked from the issue

Extra Notes

Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:

with open('/path/to/glove.6B.50d.txt') as f:
    glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)

USE_vocabulary = set(word_list) # from above

print(len(USE_vocabulary - glove_vocabulary)) # -> 281150

Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?

like image 60
Roee Shenberg Avatar answered Nov 15 '22 08:11

Roee Shenberg


I combine the earlier answer from @Roee Shenberg and the solution provided here to come up with solution, which is applicable for USE v4:

import importlib
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')

saved_model = loader_impl.parse_saved_model("/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/")
graph = saved_model.meta_graphs[0].graph_def

fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "ptb" in str(f).lower()];
print(len(fns)) # should be 1

nodes_with_sp = [n for n in fns[0].node_def if n.name == "Embeddings_words"]
print(len(nodes_with_sp)) # should be 1

words_tensor = nodes_with_sp[0].attr.get("value").tensor

word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # should be 400004

If you are just curious about the words I upload them here.

like image 22
United121 Avatar answered Nov 15 '22 06:11

United121