How to load embeddings (in tsv file) generated from StarSpace

1 Answers

You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.

The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.

The Python code to convert the file would then look something like this:

with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
    line_count = '...'    # line count of the tsv file (as string)
    dimensions = '...'    # vector size (as string)
    outp.write(' '.join([line_count, dimensions]) + '\n')
    for line in inp:
        words = line.strip().split()
        outp.write(' '.join(words) + '\n')

You can then import the new file into Gensim like so:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)

I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!

188

answered Nov 17 '22 02:11

Sascha

Related questions
                            
                                Troubleshooting tips for clustering word2vec output with DBSCAN
                            
                                Pipeline and GridSearch for Doc2Vec
                            
                                Cosine similarity between 0 and 1
                            
                                Python Gensim how to make WMD similarity run faster with multiprocessing
                            
                                Gensim get topic for a document (seen document)
                            
                                How to build a gensim dictionary that includes bigrams?
                            
                                Understanding the output of Doc2Vec from Gensim package
                            
                                Is there any way to match Gensim LDA output with topics in pyLDAvis graph?
                            
                                How to avoid decoding to str: need a bytes-like object error in pandas?
                            
                                How can I access output embedding(output vector) in gensim word2vec?
                            
                                How do you initialize a gensim corpus variable with a csr_matrix?
                            
                                Python NLP British English vs American English
                            
                                What is different between doc2vec models when the dbow_words is set to 1 or 0?
                            
                                UnpicklingError: invalid load key, '3'
                            
                                Is there any way to get the vocabulary size from doc2vec model?
                            
                                Python: What is the "size" parameter in Gensim Word2vec model class
                            
                                How to run tsne on word2vec created from gensim?
                            
                                Pyspark - Load trained model word2vec
                            
                                Python tf-idf: fast way to update the tf-idf matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load embeddings (in tsv file) generated from StarSpace

Tags:

word-embedding

gensim

Just Data

People also ask

1 Answers

Sascha

Recent Activity

Donate For Us