Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spaCy - Tokenization of Hyphenated words

Tags:

python

spacy

Good day SO,

I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example:

Example:

Sentence: "up-scaled"
Tokens: ['up', '-', 'scaled']
Expected: ['up-scaled']

For now, my solution is to use the matcher:

matcher = Matcher(nlp.vocab)
pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},
           {'ORTH': '-'},
           {'IS_ALPHA': True, 'IS_SPACE': False}]

matcher.add('HYPHENATED', None, pattern)

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    #print(doc)
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp(text)

However, this will cause an expected issue below:

Example 2:

Sentence: "I know I will be back - I had a very pleasant time"
Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']
Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']

Is there a way where I can process only words separated by hyphens that do not have spaces between the characters? So that words like 'up-scaled' will be matched and combined into a single token, but not '.. back - I ..'

Thank you very much

EDIT: I have tried the solution posted: Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

However, I didn't use this solution because it resulted in wrong tokenization of words with apostrophes (') and Numbers with decimals:

Sentence: "It's"
Tokens: ["I", "t's"]
Expected: ["It", "'s"]

Sentence: "1.50"
Tokens: ["1", ".", "50"]
Expected: ["1.50"]

That is why I used Matcher instead of trying to edit the regex.

like image 749
Benji Tan Avatar asked Sep 25 '19 20:09

Benji Tan


People also ask

How do you Tokenize words in spaCy?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

Does spaCy automatically Tokenize?

SpaCy automatically breaks your document into tokens when a document is created using the model.

What is the difference between sentence tokenization and word tokenization?

Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.


2 Answers

The Matcher is not really the right tool for this. You should modify the tokenizer instead.

If you want to preserve how everything else is handled and only change the behavior for hyphens, you should modify the existing infix pattern and preserve all the other settings. The current English infix pattern definition is here:

https://github.com/explosion/spaCy/blob/58533f01bf926546337ad2868abe7fc8f0a3b3ae/spacy/lang/punctuation.py#L37-L49

You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer. So, if you comment out the hyphen pattern and define a custom tokenizer:

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
    )

    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)


nlp = spacy.load("en")
nlp.tokenizer = custom_tokenizer(nlp)
print([t.text for t in nlp("It's 1.50, up-scaled haven't")])
# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]

You do need to provide the current prefix/suffix/token_match settings when initializing the new Tokenizer to preserve the existing tokenizer behavior. See also (for German, but very similar): https://stackoverflow.com/a/57304882/461847

Edited to add (since this does seem unnecessarily complicated and you really should be able to redefine the infix patterns without loading a whole new custom tokenizer):

If you have just loaded the model (for v2.1.8) and you haven't called nlp() yet, you can also just replace the infix_re.finditer without creating a custom tokenizer:

nlp = spacy.load('en')
nlp.tokenizer.infix_finditer = infix_re.finditer

There's a caching bug that should hopefully be fixed in v2.2 that will let this work correctly at any point rather than just with a newly loaded model. (The behavior is extremely confusing otherwise, which is why creating a custom tokenizer has been a better general-purpose recommendation for v2.1.8.)

like image 87
aab Avatar answered Sep 28 '22 12:09

aab


If nlp = spacy.load('en') throws error, use nlp = spacy.load("en_core_web_sm")

like image 20
Anurag verma Avatar answered Sep 28 '22 11:09

Anurag verma