I'm using some domain-specific language which have a lot of OOV words as well as some typos. I have noticed Spacy will just assign an all-zero vector for these OOV words, so I'm wondering what's the proper way to handle this. I appreciate clarification on all of these points if possible:
Pre-train the “token to vector” (tok2vec) layer of pipeline components, using an approximate language-modeling objective. Specifically, we load pretrained vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which match the pretrained ones
Isn't the tok2vec the part that generates the vectors? So shouldn't this command then change the produced vectors? What does it mean loading pretrained vectors and then train a component to predict these vectors? What's the purpose of doing this?
What does the --use-vectors flag do? What does the --init-tok2vec flag do? Is this included by mistake in the documentation?
It seems pretrain is not what I'm looking for, it doesn't change the vectors for a given word. What would be the easiest way to generate a new set of vectors which includes my OOV words but still contain the general knowledge of the lanaguage?
As far as I can see Spacy's pretrained models use fasttext vectors. Fasttext website mentions:
A nice feature is that you can also query for words that did not appear in your data! Indeed words are represented by the sum of its substrings. As long as the unknown word is made of known substrings, there is a representation of it!
But it seems Spacy does not use this feature. Is there a way to still make use of this for OOV words?
Thanks a lot
There are many techniques to handle out of vocabulary words : Typically a special out of vocabulary token is added to the language model. Often the first word in the document is treated as the out of vocab word ensure the out of vocab words occurs somewhere in the training data and gets a positive probability.
Word2vec is a technique for natural language processing published in 2013 by researcher Tomáš Mikolov. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.
In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
Out-of-vocabulary (OOV) are terms that are not part of the normal lexicon found in a natural language processing environment. In speech recognition, it's the audio signal that contains these terms. Word vectors are the mathematical equivalent of word meaning.
I think there is some confusion about the different components - I'll try to clarify:
nlp
model in spaCy can have predefined (static) word vectors that are accessible on the Token
level. Every token with the same
Lexeme gets the same vector. Some tokens/lexemes may indeed be
OOV, like misspellings. If you want to redefine/extend all vectors
used in a model, you can use something like init-model
(init vectors
in spaCy v3).tok2vec
layer is a machine learning component that learns how to produce suitable (dynamic) vectors for tokens. It does this by looking
at lexical attributes of the token, but may also include the static
vectors of the token (cf item 2). This component is generally not used by itself, but is part of another component, such as an NER. It will be the first layer of the NER model, and it can be trained as part of training the NER, to produce vectors that are suitable for your NER task.In spaCy v2, you can first train a tok2vec component with pretrain
, and then use this component for a subsequent train
command. Note that all settings need to be the same across both commands, for the layers to be compatible.
To answer your questions:
Isn't the tok2vec the part that generates the vectors?
If you mean the static vectors, then no. The tok2vec component produces new vectors (possibly with a different dimension) on top of the static vectors, but it won't change the static ones.
What does it mean loading pretrained vectors and then train a component to predict these vectors? What's the purpose of doing this?
The purpose is to get a tok2vec
component that is already pretrained from external vectors data. The external vectors data already embeds some "meaning" or "similarity" of the tokens, and this is -so to say- transferred into the tok2vec
component, which learns to produce the same similarities. The point is that this new tok2vec
component can then be used & further fine-tuned in the subsequent train
command (cf item 3)
Is there a way to still make use of this for OOV words?
It really depends on what your "use" is. As https://stackoverflow.com/a/57665799/7961860 mentions, you can set the vectors yourself, or you can implement a user hook which will decide on how to define token.vector
.
I hope this helps. I can't really recommend the best approach for you to follow, without understanding why you want the OOV vectors / what your use-case is. Happy to discuss further in the comments!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With