Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy similarity warning : "Evaluating Doc.similarity based on empty vectors."

I'm trying to do data enhancement with a FAQ dataset. I change words, specifically nouns, by most similar words with Wordnet checking the similarity with Spacy. I use multiple for loop to go through my dataset.

import spacy
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

nlp = spacy.load('en_core_web_md')
nltk.download('wordnet')
questions = pd.read_csv("FAQ.csv")

list_questions = []
for question in questions.values:
    list_questions.append(nlp(question[0]))

for question in list_questions: 
    for token in question:
        treshold = 0.5
        if token.pos_ == 'NOUN':
            wordnet_syn = wn.synsets(str(token), pos=wn.NOUN)  
            for syn in wordnet_syn:
                for lemma in syn.lemmas():
                    similar_word = nlp(lemma.name())
                    if similar_word.similarity(token) != 1. and similar_word.similarity(token) > treshold:
                        good_word = similar_word
                        treshold = token.similarity(similar_word)

However, the following warning is printed several times and I don't understand why :

UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.

It is my similar_word.similarity(token) which creates the problem but I don't understand why. The form of my list_questions is :

list_questions = [Do you have a paper or other written explanation to introduce your model's details?, Where is the BERT code come from?, How large is a sentence vector?]

I need to check token but also the similar_word in the loop, for example, I still get the error here :

tokens = nlp(u'dog cat unknownword')
similar_word = nlp(u'rabbit')

if(similar_word):
    for token in tokens:
        if (token):
            print(token.text, similar_word.similarity(token))
like image 550
Jonor Avatar asked Apr 30 '19 12:04

Jonor


2 Answers

You get that error message when similar_word is not a valid spacy document. E.g. this is a minimal reproducible example:

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat')
#similar_word = nlp(u'rabbit')
similar_word = nlp(u'')

for token in tokens:
  print(token.text, similar_word.similarity(token))

If you change the '' to be 'rabbit' it works fine. (Cats are apparently just a fraction more similar to rabbits than dogs are!)

(UPDATE: As you point out, unknown words also trigger the warning; they will be valid spacy objects, but not have any word vector.)

So, one fix would be to check similar_word is valid, including having a valid word vector, before calling similarity():

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat')
similar_word = nlp(u'')

if(similar_word and similar_word.vector_norm):
  for token in tokens:
    if(token and token.vector_norm):
      print(token.text, similar_word.similarity(token))

Alternative Approach:

You could suppress the particular warning. It is W008. I believe setting an environmental variable SPACY_WARNING_IGNORE=W008 before running your script would do it. (Not tested.)

(See source code)


By the way, similarity() might cause some CPU load, so is worth storing in a variable, instead of calculating it three times as you currently do. (Some people might argue that is premature optimization, but I think it might also make the code more readable.)

like image 100
Darren Cook Avatar answered Nov 19 '22 16:11

Darren Cook


I have suppress the W008 warning by setting environmental variable by using this code in run file.

import os
app = Flask(__name__)

app.config['SPACY_WARNING_IGNORE'] = "W008"
os.environ["SPACY_WARNING_IGNORE"] = "W008"

if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)
like image 22
Ferdous Wahid Avatar answered Nov 19 '22 16:11

Ferdous Wahid