Good day SO, I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example: <pre class="prettyprint"><code>Example: Sentence: "up-scaled" Tokens: ['up', '-', 'scaled'] Expected: ['up-scaled'] </code></pre> For now, my solution is to use the matcher: <pre class="prettyprint"><code>matcher = Matcher(nlp.vocab) pattern = [{'IS_ALPHA': True, 'IS_SPACE': False}, {'ORTH': '-'}, {'IS_ALPHA': True, 'IS_SPACE': False}] matcher.add('HYPHENATED', None, pattern) def quote_merger(doc): # this will be called on the Doc object in the pipeline matched_spans = [] matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] matched_spans.append(span) for span in matched_spans: # merge into one token after collecting all matches span.merge() #print(doc) return doc nlp.add_pipe(quote_merger, first=True) # add it right after the tokenizer doc = nlp(text) </code></pre> However, this will cause an expected issue below: <pre class="prettyprint"><code>Example 2: Sentence: "I know I will be back - I had a very pleasant time" Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time'] Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time'] </code></pre> Is there a way where I can process only words separated by hyphens that do not have spaces between the characters? So that words like 'up-scaled' will be matched and combined into a single token, but not '.. back - I ..' Thank you very much EDIT: I have tried the solution posted: Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does? However, I didn't use this solution because it resulted in wrong tokenization of words with apostrophes (') and Numbers with decimals: <pre class="prettyprint"><code>Sentence: "It's" Tokens: ["I", "t's"] Expected: ["It", "'s"] Sentence: "1.50" Tokens: ["1", ".", "50"] Expected: ["1.50"] </code></pre> That is why I used Matcher instead of trying to edit the regex.

If <code>nlp = spacy.load('en')</code> throws error, use <code>nlp = spacy.load("en_core_web_sm")</code>

spaCy - Tokenization of Hyphenated words

Tags:

spacy

Good day SO,

I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example:

Example:

Sentence: "up-scaled"
Tokens: ['up', '-', 'scaled']
Expected: ['up-scaled']

For now, my solution is to use the matcher:

matcher = Matcher(nlp.vocab)
pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},
           {'ORTH': '-'},
           {'IS_ALPHA': True, 'IS_SPACE': False}]

matcher.add('HYPHENATED', None, pattern)

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    #print(doc)
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp(text)

However, this will cause an expected issue below:

Example 2:

Sentence: "I know I will be back - I had a very pleasant time"
Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']
Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']

Is there a way where I can process only words separated by hyphens that do not have spaces between the characters? So that words like 'up-scaled' will be matched and combined into a single token, but not '.. back - I ..'

Thank you very much

EDIT: I have tried the solution posted: Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

However, I didn't use this solution because it resulted in wrong tokenization of words with apostrophes (') and Numbers with decimals:

Sentence: "It's"
Tokens: ["I", "t's"]
Expected: ["It", "'s"]

Sentence: "1.50"
Tokens: ["1", ".", "50"]
Expected: ["1.50"]

That is why I used Matcher instead of trying to edit the regex.

749

asked Sep 25 '19 20:09

Benji Tan

2 Answers

The Matcher is not really the right tool for this. You should modify the tokenizer instead.

If you want to preserve how everything else is handled and only change the behavior for hyphens, you should modify the existing infix pattern and preserve all the other settings. The current English infix pattern definition is here:

https://github.com/explosion/spaCy/blob/58533f01bf926546337ad2868abe7fc8f0a3b3ae/spacy/lang/punctuation.py#L37-L49

You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer. So, if you comment out the hyphen pattern and define a custom tokenizer:

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
    )

    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)


nlp = spacy.load("en")
nlp.tokenizer = custom_tokenizer(nlp)
print([t.text for t in nlp("It's 1.50, up-scaled haven't")])
# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]

You do need to provide the current prefix/suffix/token_match settings when initializing the new Tokenizer to preserve the existing tokenizer behavior. See also (for German, but very similar): https://stackoverflow.com/a/57304882/461847

Edited to add (since this does seem unnecessarily complicated and you really should be able to redefine the infix patterns without loading a whole new custom tokenizer):

If you have just loaded the model (for v2.1.8) and you haven't called nlp() yet, you can also just replace the infix_re.finditer without creating a custom tokenizer:

nlp = spacy.load('en')
nlp.tokenizer.infix_finditer = infix_re.finditer

There's a caching bug that should hopefully be fixed in v2.2 that will let this work correctly at any point rather than just with a newly loaded model. (The behavior is extremely confusing otherwise, which is why creating a custom tokenizer has been a better general-purpose recommendation for v2.1.8.)

answered Sep 28 '22 12:09

aab

If nlp = spacy.load('en') throws error, use nlp = spacy.load("en_core_web_sm")

answered Sep 28 '22 11:09

Anurag verma

Related questions
                            
                                Cartopy examples produce a Segmentation fault
                            
                                OpenCV & Python Multithreading - Seeking within a VideoCapture Object
                            
                                Python Regular Expressions to NFA
                            
                                Render dynamically changing images with same filenames in Flask
                            
                                How to get interactive bokeh in Jupyter notebook
                            
                                asyncio: RuntimeError this event loop is already running
                            
                                How does tf.keras.layers.Conv2D with padding='same' and strides > 1 behave?
                            
                                Python logging: disable stack trace
                            
                                Why aren't torch.nn.Parameter listed when net is printed?
                            
                                How PyCharm imports differently than system command prompt (Windows)
                            
                                Keras load_model with custom objects doesn't work properly
                            
                                Iterate over two Pytorch tensors at once?
                            
                                Search for bitstring most unlike a set of bitstrings
                            
                                random.randint shows different output in Python 2.x and Python 3.x with same seed
                            
                                What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?
                            
                                How to generate a time-ordered uid in Python?
                            
                                Calling Go from Python
                            
                                Understanding the total_timesteps parameter in stable-baselines' models
                            
                                Django superuser doesn't have permission to delete models
                            
                                How to change default path of Celery beat service?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With