Spacy - Tokenize quoted string

Tags:

I am using spacy 2.0 and using a quoted string as input.

Example string

"The quoted text 'AA XX' should be tokenized"

and expecting to extract

[The, quoted, text, 'AA XX', should, be, tokenized]

I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.

import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])

Result

[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']

What is the best way to address what I need

431

asked Jun 08 '18 01:06

user007

1 Answers

While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.

For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:

pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]

Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']

For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).

If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.

165

answered Nov 25 '22 21:11

Ines Montani

Related questions
                            
                                rsvg with Python 3.2 on Ubuntu
                            
                                File upload with Tornado
                            
                                how to get favicon.ico work on tornado
                            
                                how to find most frequent string element in numpy ndarray?
                            
                                Using numpy to write an array to stdout
                            
                                Convert String to Int without int()
                            
                                What does "Opinionated API" mean?
                            
                                How to use sys.argv in python to check length of arguments so it can run as script?
                            
                                Install Vim via Homebrew with Python AND Python3 Support
                            
                                Behaviour of all() in python
                            
                                Create new list by taking first item from first list, and last item from second list
                            
                                How to increase number of clicks per second with pyautogui?
                            
                                Vertical scrollbar for frame in Tkinter, Python
                            
                                Can I access class variables using self?
                            
                                Does python static method consume less memory than instance method
                            
                                How To Pull Data From the Adwords API and put into a Pandas Dataframe
                            
                                Length of the longest sub-array which consists of all '1'
                            
                                Cant connect to Mysql database from pyspark, getting jdbc error
                            
                                unable to use Trained Tensorflow model
                            
                                Python 3 int division operator is returning a float?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spacy - Tokenize quoted string

Tags:

python-3.x

nlp

spacy

user007

People also ask

1 Answers

Ines Montani

Recent Activity

Donate For Us