Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load embeddings (in tsv file) generated from StarSpace

Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.

Thanks,

Amulya

like image 700
Just Data Avatar asked Mar 03 '18 20:03

Just Data


People also ask

How do I convert a starspace model to word2vec?

You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import. The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file.

Can I load starspace embedding files using Gensim?

I've not been able to directly load the StarSpace embedding files using Gensim. However, I was able to use the embed_doc utility provided by StarSpace to convert my words/sentences into their vector representations. You can read more about the utility here.

How to build starspace in Visual Studio Code?

In order to build StarSpace on Windows, open the following in Visual Studio: In order to build StarSpace python wrapper, please refer README inside the directory python. StarSpace takes input files of the following format. Each line will be one input example, in the simplest case the input has k words, and each labels 1..r is a single word:

How many embeddings does starspace learn for each relation_type?

For each relation_type, we learn two embeddings: one for predicting tail_entity given head_entity, one for predicting head_entity given tail_entity. This example script downloads the Freebase15k data from here and runs the StarSpace model on it:


1 Answers

You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.

The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.

The Python code to convert the file would then look something like this:

with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
    line_count = '...'    # line count of the tsv file (as string)
    dimensions = '...'    # vector size (as string)
    outp.write(' '.join([line_count, dimensions]) + '\n')
    for line in inp:
        words = line.strip().split()
        outp.write(' '.join(words) + '\n')

You can then import the new file into Gensim like so:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)

I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!

like image 188
Sascha Avatar answered Nov 17 '22 02:11

Sascha