Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word2vec Gensim Accuracy Analysis

I'm working on a NLP application, where I have a corpus of text files. I would like to create word vectors using the Gensim word2vec algorithm.

I did a 90% training and 10% testing split. I trained the model on the appropriate set, but I would like to assess the accuracy of the model on the testing set.

I have surfed the internet for any documentation on accuracy assessment, but I could not find any methods that allowed me to do so. Does anyone know of a function that does accuracy analysis?

The way I processed my test data was that I extracted all the sentences from the text files in the test folder, and I turned it into a giant list of sentences. After that, I used a function that I though was the right one (turns out it wasn't as it gave me this error: TypeError: don't know how to handle uri). Here is how I went about doing this:

test_filenames = glob.glob('./testing/*.txt')

print("Found corpus of %s safety/incident reports:" %len(test_filenames))

test_corpus_raw = u""
for text_file in test_filenames:
    txt_file = open(text_file, 'r')
    test_corpus_raw += unicode(txt_file.readlines())
print("Test Corpus is now {0} characters long".format(len(test_corpus_raw)))

test_raw_sentences = tokenizer.tokenize(test_corpus_raw)

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

test_sentences = []
for raw_sentence in test_raw_sentences:
    if len(raw_sentence) > 0:
        test_sentences.append(sentence_to_wordlist(raw_sentence))

test_token_count = sum([len(sentence) for sentence in test_sentences])
print("The test corpus contains {0:,} tokens".format(test_token_count))


####### THIS LAST LINE PRODUCES AN ERROR: TypeError: don't know how to handle uri 
texts2vec.wv.accuracy(test_sentences, case_insensitive=True)

I have no idea how to fix this last part. Please help. Thanks in advance!

like image 269
Sam Avatar asked Oct 10 '18 06:10

Sam


People also ask

How accurate is Word2Vec?

According to the results of tests of the accuracy of the three word embedding, FastText outperforms Glove and Word2vec for the dataset of 20 newsgroups, the accuracy is 97.2% for FastText, 95.8% for Glove and 92.5% for Word2Vec.

How do you evaluate a Word2Vec model?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.

Does gensim Word2Vec use GPU?

They bring a nice speed-up parallelizing the computation of chains of matrix products. Gensim's Word2Vec is parallelized to take the advantage of machines with multi-core CPUs. Having a GPU at our disposal, it sure will be worth taking an advantage of its resources and speed up Word2Vec's training even more.

How does gensim Word2Vec work?

Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings.


1 Answers

The accuracy() method of a gensim word-vectors model (now disfavored in comparison to evaluate_word_analogies()) doesn't take your texts as input - it requires a specifically-formatted file of word-analogy challenges. This file is often named questions-words.txt.

This is a popular way to test general-purpose word-vectors, going back to the original Word2Vec paper and code-release from Google.

However, this evaluation doesn't necessarily indicate which word-vectors will be best for your needs. (For example, it's possible for a set of word-vectors to score better on these kinds of analogies, but be worse for a specific classification or info-retrieval goal.)

For good vectors for your own purposes, you should devise some task-specific evaluation, that gives a score correlated with the success on your final goal.

Also, note that as an unsupervised algorithm, word-vectors don't necessarily need a held-out test set to be evaluated. You generally want to use as much data as possible to train the word-vectors – ensuring maximal vocabulary coverage, with the most examples per word. Then you might test the word-vectors to some external standard – like the analogy questions, that weren't part of the training set at all.

Or, you'd just use the word-vectors as an additional input to some downstream task you're testing, and on that downstream task you'd withhold a test set from what's used to train some supervised algorithm. That ensures your supervised method isn't just memorizing/overfitting the labeled inputs, and gives you an indirect quality signal about whether that word-vector set helped the downstream task, or not. (And, that word-vector set could be compared against others based on how well they help that other supervised task – not against their own same unsupervised train-up step.)

like image 141
gojomo Avatar answered Oct 23 '22 14:10

gojomo