Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below:

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

So for this sentence: 'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.'

Now, the tokens after incorporating the custom Spacy Tokenizer are:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“medicine', '', 'has', ';', 'become', 'a', 'profession', ',', 'and', 'more', 'importantly', ',', "it's", 'a', 'male-dominated', 'profession', '.'

Earlier, the tokens before this change were:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '', 'medicine', '', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male', '-', 'dominated', 'profession', '.'

And, the expected tokens should be:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '', 'medicine', '', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.'

Summary: As one can see...

  • the hyphen word is included and so are the other punctuation marks except for the double quotes and apostrophe...
  • ...but now, the apostrophe and double quotes don't have the earlier or expected behaviour.
  • I have tried different permutations and combinations for the regex compile for the Infix but no progress to fix this issue.
like image 265
Vishal Avatar asked Jun 24 '18 17:06

Vishal


1 Answers

Using the default prefix_re and suffix_re gives me the expected output:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

If you want to dig into to why your regexes weren't working like SpaCy's, here are links to the relevant source code:

Prefixes and suffixes defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

With reference to characters (e.g, quotes, hyphens, etc.) defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

And the functions used to compile them (e.g., compile_prefix_regex):

https://github.com/explosion/spaCy/blob/master/spacy/util.py

like image 93
Nicholas Morley Avatar answered Nov 15 '22 19:11

Nicholas Morley