Word2vec Gensim Accuracy Analysis

Tags:

I'm working on a NLP application, where I have a corpus of text files. I would like to create word vectors using the Gensim word2vec algorithm.

I did a 90% training and 10% testing split. I trained the model on the appropriate set, but I would like to assess the accuracy of the model on the testing set.

I have surfed the internet for any documentation on accuracy assessment, but I could not find any methods that allowed me to do so. Does anyone know of a function that does accuracy analysis?

The way I processed my test data was that I extracted all the sentences from the text files in the test folder, and I turned it into a giant list of sentences. After that, I used a function that I though was the right one (turns out it wasn't as it gave me this error: TypeError: don't know how to handle uri). Here is how I went about doing this:

test_filenames = glob.glob('./testing/*.txt')

print("Found corpus of %s safety/incident reports:" %len(test_filenames))

test_corpus_raw = u""
for text_file in test_filenames:
    txt_file = open(text_file, 'r')
    test_corpus_raw += unicode(txt_file.readlines())
print("Test Corpus is now {0} characters long".format(len(test_corpus_raw)))

test_raw_sentences = tokenizer.tokenize(test_corpus_raw)

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

test_sentences = []
for raw_sentence in test_raw_sentences:
    if len(raw_sentence) > 0:
        test_sentences.append(sentence_to_wordlist(raw_sentence))

test_token_count = sum([len(sentence) for sentence in test_sentences])
print("The test corpus contains {0:,} tokens".format(test_token_count))


####### THIS LAST LINE PRODUCES AN ERROR: TypeError: don't know how to handle uri 
texts2vec.wv.accuracy(test_sentences, case_insensitive=True)

I have no idea how to fix this last part. Please help. Thanks in advance!

269

asked Oct 10 '18 06:10

Sam

1 Answers

The accuracy() method of a gensim word-vectors model (now disfavored in comparison to evaluate_word_analogies()) doesn't take your texts as input - it requires a specifically-formatted file of word-analogy challenges. This file is often named questions-words.txt.

This is a popular way to test general-purpose word-vectors, going back to the original Word2Vec paper and code-release from Google.

However, this evaluation doesn't necessarily indicate which word-vectors will be best for your needs. (For example, it's possible for a set of word-vectors to score better on these kinds of analogies, but be worse for a specific classification or info-retrieval goal.)

For good vectors for your own purposes, you should devise some task-specific evaluation, that gives a score correlated with the success on your final goal.

Also, note that as an unsupervised algorithm, word-vectors don't necessarily need a held-out test set to be evaluated. You generally want to use as much data as possible to train the word-vectors – ensuring maximal vocabulary coverage, with the most examples per word. Then you might test the word-vectors to some external standard – like the analogy questions, that weren't part of the training set at all.

Or, you'd just use the word-vectors as an additional input to some downstream task you're testing, and on that downstream task you'd withhold a test set from what's used to train some supervised algorithm. That ensures your supervised method isn't just memorizing/overfitting the labeled inputs, and gives you an indirect quality signal about whether that word-vector set helped the downstream task, or not. (And, that word-vector set could be compared against others based on how well they help that other supervised task – not against their own same unsupervised train-up step.)

141

answered Oct 23 '22 14:10

gojomo

Related questions
                            
                                Getting Test Accuracy for model.predict_generator
                            
                                how to divide pandas date time column in half hourly interval
                            
                                ValueError: Conflicting metadata name name, need distinguishing prefix in pandas
                            
                                With xarray, how to parallelize 1D operations on a multidimensional Dataset?
                            
                                Pickle and decorated classes (PicklingError: not the same object)
                            
                                How to round date time index in a pandas data frame?
                            
                                Python ctypes: pass argument by reference error
                            
                                'matplotlib' has no attribute 'cm' when deploying an app
                            
                                reload Python module within Jupyter notebook (without autoreload)
                            
                                How to use pdfminer.six's pdf2txt.py in python script and outside command line?
                            
                                Python 3 Django on App Engine Standard: App Fails to Start
                            
                                How to optimize a for loop that uses consecutive values with Numpy?
                            
                                Grouping by multiple dimensions
                            
                                How to print full precision of floating numbers [Python]
                            
                                Why is Pandas.eval() with numexpr so slow?
                            
                                How do I add a Python tag to the bdist_wheel command using setuptools?
                            
                                How to get rid of the infobar "Chrome is being controlled by automated test software" through Selenium
                            
                                Why do I sometimes get Key Error using SQS client
                            
                                Kaggle datasets into jupyter notebook
                            
                                pandas.to_json output date format in specific form

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Word2vec Gensim Accuracy Analysis

Tags:

python

nlp

gensim

word2vec

Sam

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us