I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?
For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:
import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
doc = nlp(string_doc)
block = []
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - doc[0].i + 1
head_idx = str(head_idx)
line = [str(i+1), str(word), word.lemma_, word.tag_,
word.ent_type_, head_idx, word.dep_]
block.append(line)
return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")
Output:
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
['4', 'the', u'the', u'DT', u'', '6', u'det'],
['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
['6', 'president', u'president', u'NN', u'', '3', u'attr'],
['7', 'of', u'of', u'IN', u'', '6', u'prep'],
['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]
I would like to do the same while having as input a list of tokens...
Which learning algorithm does spaCy use? spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks.
An individual token — i.e. a word, punctuation symbol, whitespace, etc.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.
You can run Spacy's processing pipeline against already tokenised text. You need to understand, though, that the underlying statistical models have been trained on a reference corpus that has been tokenised using some strategy and if your tokenisation strategy is significantly different, you may expect some performance degradation.
Here's how to go about it using Spacy 2.0.5 and Python 3. If using Python 2, you may need to use unicode literals.
import spacy; nlp = spacy.load('en_core_web_sm')
# spaces is a list of boolean values indicating if subsequent tokens
# are followed by any whitespace
# so, create a Spacy document with your tokenisation
doc = spacy.tokens.doc.Doc(
nlp.vocab, words=['nuts', 'itch'], spaces=[True, False])
# run the standard pipeline against it
for name, proc in nlp.pipeline:
doc = proc(doc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With