Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

Question

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below:

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[$$\("']''')
suffix_re = re.compile(r'''[$$\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

So for this sentence: 'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.'

Now, the tokens after incorporating the custom Spacy Tokenizer are:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“medicine', '”', 'has', ';', 'become', 'a', 'profession', ',', 'and', 'more', 'importantly', ',', "it's", 'a', 'male-dominated', 'profession', '.'

Earlier, the tokens before this change were:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male', '-', 'dominated', 'profession', '.'

And, the expected tokens should be:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.'

Summary: As one can see...

the hyphen word is included and so are the other punctuation marks except for the double quotes and apostrophe...
...but now, the apostrophe and double quotes don't have the earlier or expected behaviour.
I have tried different permutations and combinations for the regex compile for the Infix but no progress to fix this issue.

Nicholas Morley · Accepted Answer

Using the default prefix_re and suffix_re gives me the expected output:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

If you want to dig into to why your regexes weren't working like SpaCy's, here are links to the relevant source code:

Prefixes and suffixes defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

With reference to characters (e.g, quotes, hyphens, etc.) defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

And the functions used to compile them (e.g., compile_prefix_regex):

https://github.com/explosion/spaCy/blob/master/spacy/util.py

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

Tags:

regex

tokenize

nlp

linguistics

spacy

Vishal

1 Answers

Nicholas Morley

Recent Activity

Donate For Us

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

Tags:

regex

tokenize

nlp

linguistics

spacy

Vishal

1 Answers

Nicholas Morley

Related questions

Recent Activity

Donate For Us