Does spacy take as input a list of tokens?

Tags:

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?

For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:

import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
   doc = nlp(string_doc)
   block = []
   for i, word in enumerate(doc):
          if word.head == word:
                  head_idx = 0
          else:
                  head_idx = word.head.i - doc[0].i + 1
          head_idx = str(head_idx)
          line = [str(i+1), str(word), word.lemma_, word.tag_,
                      word.ent_type_, head_idx, word.dep_]
          block.append(line)
   return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")

Output:
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
 ['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
 ['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
 ['4', 'the', u'the', u'DT', u'', '6', u'det'],
 ['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
 ['6', 'president', u'president', u'NN', u'', '3', u'attr'],
 ['7', 'of', u'of', u'IN', u'', '6', u'prep'],
 ['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
 ['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
 ['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
 ['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
 ['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]

I would like to do the same while having as input a list of tokens...

627

asked Jan 09 '18 13:01

dada

1 Answers

You can run Spacy's processing pipeline against already tokenised text. You need to understand, though, that the underlying statistical models have been trained on a reference corpus that has been tokenised using some strategy and if your tokenisation strategy is significantly different, you may expect some performance degradation.

Here's how to go about it using Spacy 2.0.5 and Python 3. If using Python 2, you may need to use unicode literals.

import spacy; nlp = spacy.load('en_core_web_sm')
# spaces is a list of boolean values indicating if subsequent tokens
# are followed by any whitespace
# so, create a Spacy document with your tokenisation
doc = spacy.tokens.doc.Doc(
    nlp.vocab, words=['nuts', 'itch'], spaces=[True, False])
# run the standard pipeline against it
for name, proc in nlp.pipeline:
    doc = proc(doc)

answered Oct 19 '22 03:10

adam.ra

Related questions
                            
                                Robot Framework using Python, key press without selecting any button or element in the page
                            
                                how to slice a pandas data frame according to column values?
                            
                                set 'x-message-ttl' in pika python
                            
                                pass a json string as an argument to Python script causes quotes problems
                            
                                python compare datetimes with different timezones
                            
                                How to eliminate the ☎ unicode?
                            
                                sqlite3.ProgrammingError: Cannot operate on a closed database. [Python] [sqlite]
                            
                                Python script terminated by SIGKILL rather than throwing MemoryError
                            
                                Creating an installer for a python GTK3 application
                            
                                How to structure a python project with three applications that use common module
                            
                                Google API Python unauthorized_client: Unauthorized client or scope in request
                            
                                Python2.7 contextlib.ExitStack equivalent
                            
                                CSV Writing to File Difficulties
                            
                                How to log from Python Luigi
                            
                                Get text with BeautifulSoup CSS Selector
                            
                                Extracting the hour from a datetime64[ns] variable
                            
                                Question marks in Flask Urls for routing [duplicate]
                            
                                usage of except and store error in a variable
                            
                                AWS Python Lambda Function - Upload File to S3
                            
                                Plotly legend next to each subplot, Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does spacy take as input a list of tokens?

Tags:

python-2.7

tokenize

spacy

dependency-parsing

dada

People also ask

1 Answers

adam.ra

Recent Activity

Donate For Us