Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spacy Entity Linking - Word Vectors

I am very confused about how word vectors work, specifically in regards to spacy's entity linking (https://spacy.io/usage/training#entity-linker).

When adding an entity to a knowledge base, one of the parameters is the entity_vector. How do you get this? I have tried doing

nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab = nlp.vocab, entity_vector_length = 96)
for n in m_yentities:
    kb.add_entity(entity = n, freq = ___, entity_vector = **nlp(n).vector**)

The ** code gives me vectors of length 96, and so that's what I use for entity_vector_length, although in the example they use 3. I am just wondering if my approach is okay but I am kind of confused all around about this.

like image 656
formicaman Avatar asked Oct 16 '22 07:10

formicaman


1 Answers

We'll have to document this better, but let me try and explain: the KnowledgeBase stores pretrained entity vectors. These vectors are condensed versions of the descriptions of the entities. While such a description can be one or multiple words (varying length), its vector should always have a fixed size. A length of 3 is unrealistic, something like 64 or 96 makes more sense. If we have that, each entity description is mapped in a 96D space, so that we can use these descriptions in further downstream neural networks.

As shown in the example you linked, you can use the EntityEncoder to create this mapping of a multi-word description to a 96D vector, and you can play around with the length of the embeddings. Larger embeddings mean that you can capture more information, but will also require more storage.

The creation of these embedding vectors for the entity descriptions are done as an offline step, once when creating the KnowledgeBase. Then when you actually want to train a neural network to do entity linking, the size of that network will depend on the size you've chosen for your description embeddings.

Intuitively, the "entity embeddings" are a sort of averaged, condensed version of the word vectors of all the words in the entity's description.

Also, I don't know if you've seen this, but if you're looking for a more realistic way of running the Entity Linking, you can check out the scripts for processing Wikipedia & Wikidata here.

like image 196
Sofie VL Avatar answered Oct 21 '22 09:10

Sofie VL