Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.
Thanks,
Amulya
You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import. The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file.
I've not been able to directly load the StarSpace embedding files using Gensim. However, I was able to use the embed_doc utility provided by StarSpace to convert my words/sentences into their vector representations. You can read more about the utility here.
In order to build StarSpace on Windows, open the following in Visual Studio: In order to build StarSpace python wrapper, please refer README inside the directory python. StarSpace takes input files of the following format. Each line will be one input example, in the simplest case the input has k words, and each labels 1..r is a single word:
For each relation_type, we learn two embeddings: one for predicting tail_entity given head_entity, one for predicting head_entity given tail_entity. This example script downloads the Freebase15k data from here and runs the StarSpace model on it:
You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.
The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.
The Python code to convert the file would then look something like this:
with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
line_count = '...' # line count of the tsv file (as string)
dimensions = '...' # vector size (as string)
outp.write(' '.join([line_count, dimensions]) + '\n')
for line in inp:
words = line.strip().split()
outp.write(' '.join(words) + '\n')
You can then import the new file into Gensim like so:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)
I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With