Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does spacy take as input a list of tokens?

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?

For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:

import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
   doc = nlp(string_doc)
   block = []
   for i, word in enumerate(doc):
          if word.head == word:
                  head_idx = 0
          else:
                  head_idx = word.head.i - doc[0].i + 1
          head_idx = str(head_idx)
          line = [str(i+1), str(word), word.lemma_, word.tag_,
                      word.ent_type_, head_idx, word.dep_]
          block.append(line)
   return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")

Output:
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
 ['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
 ['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
 ['4', 'the', u'the', u'DT', u'', '6', u'det'],
 ['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
 ['6', 'president', u'president', u'NN', u'', '3', u'attr'],
 ['7', 'of', u'of', u'IN', u'', '6', u'prep'],
 ['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
 ['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
 ['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
 ['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
 ['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]

I would like to do the same while having as input a list of tokens...

like image 627
dada Avatar asked Jan 09 '18 13:01

dada


People also ask

How does spaCy algorithm work?

Which learning algorithm does spaCy use? spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks.

What are tokens in spaCy?

An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does spaCy load return?

Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.


1 Answers

You can run Spacy's processing pipeline against already tokenised text. You need to understand, though, that the underlying statistical models have been trained on a reference corpus that has been tokenised using some strategy and if your tokenisation strategy is significantly different, you may expect some performance degradation.

Here's how to go about it using Spacy 2.0.5 and Python 3. If using Python 2, you may need to use unicode literals.

import spacy; nlp = spacy.load('en_core_web_sm')
# spaces is a list of boolean values indicating if subsequent tokens
# are followed by any whitespace
# so, create a Spacy document with your tokenisation
doc = spacy.tokens.doc.Doc(
    nlp.vocab, words=['nuts', 'itch'], spaces=[True, False])
# run the standard pipeline against it
for name, proc in nlp.pipeline:
    doc = proc(doc)
like image 99
adam.ra Avatar answered Oct 19 '22 03:10

adam.ra