I'm trying to train an NER model using spaCy
to identify locations, (person) names, and organisations. I'm trying to understand how spaCy
recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron.
However, nowhere in the code does it appear that spaCy
uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus).
My questions are -
spaCy
is using the word vectors?I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings.
Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens. Spacy provides option to add arbitrary classes to entity recognition system and update the model to even include the new examples apart from already defined entities within model.
Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it's perfect for a quick and easy start.
You can train a word vectors table using tools such as floret , Gensim, FastText or GloVe, or download existing pretrained vectors. The init vectors command lets you convert vectors for use with spaCy and will give you a directory you can load or refer to in your training configs.
Which learning algorithm does spaCy use? spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks.
spaCy does use word embeddings for its NER model, which is a multilayer CNN. There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here. All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link.
It's quite doable to add custom vectors. There's an overview of the process in the spaCy docs, plus some example code on Github.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With