I want to design a custom tokenizer module in Python that lets users specify what tokenizer(s) to use for the input. For instance, consider the following input:
Q: What is a good way to achieve this? A: I am not so sure. I think I will use Python.
I want to be able to provide NLTK's sentence tokenization, sent_tokenize()
as an option because it works well in many situations and I don't want to re-invent the wheel. In addition to this, I also want to provide a finer-grained tokenization builder (something along the lines of a rule-engine). Let me explain:
Assume that I provider a couple of tokenizers:
SENTENCE # Tokenizes the given input by using sent_tokenize()
WORD # Tokenizes the given input by using word_tokenize()
QA # Tokenizes using a custom regular expression. E.g., Q: (.*?) A: (.*?)
I want to support rules as follows:
Therefore, the expected output is as follows:
1. QA -> SENTENCE
[
('QUESTION',
('SENTENCE', 'What is a good way to achieve this?'),
),
('ANSWER',
('SENTENCE', 'I am not so sure', 'I think I will use Python')
)
]
2. QA
[
('QUESTION', 'What is a good way to achieve this?'),
('ANSWER', 'I am not so sure. I think I will use Python')
]
What is a good design to achieve this efficiently?
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
As tokenizing is easy in Python, I'm wondering what your module is planned to provide. I mean when starting a piece of software a good design rather comes from thinking about the usage scenarios than considering data structures first.
Your examples for expected output are a bit confusing. I assume you want the tokenizers return name on left side and a list of tokens on right side. I played a bit to achieve similar results, but using lists for easier handling:
import re
# some tokenizers
def tokzr_WORD(txt): return ('WORD', re.findall(r'(?ms)\W*(\w+)', txt)) # split words
def tokzr_SENT(txt): return ('SENTENCE', re.findall(r'(?ms)\s*(.*?(?:\.|\?|!))', txt)) # split sentences
def tokzr_QA(txt):
l_qa = []
for m in re.finditer(r'(?ms)^[\s#\-\*]*(?:Q|Question)\s*:\s*(?P<QUESTION>\S.*?\?)[\s#\-\*]+(?:A|Answer)\s*:\s*(?P<ANSWER>\S.*?)$', txt): # split (Q, A) sequences
for k in ['QUESTION', 'ANSWER']:
l_qa.append(m.groupdict()[k])
return ('QA', l_qa)
def tokzr_QA_non_canonical(txt): # Note: not supported by tokenize_recursively() as not canonical.
l_qa = []
for m in re.finditer(r'(?ms)^[\s#\-\*]*(?:Q|Question)\s*:\s*(?P<QUESTION>\S.*?\?)[\s#\-\*]+(?:A|Answer)\s*:\s*(?P<ANSWER>\S.*?)$', txt): # split (Q, A) sequences
for k in ['QUESTION', 'ANSWER']:
l_qa.append((k, m.groupdict()[k]))
return l_qa
dict_tokzr = { # control string: tokenizer function
'WORD' : tokzr_WORD,
'SENTENCE': tokzr_SENT,
'QA' : tokzr_QA,
}
# the core function
def tokenize_recursively(l_tokzr, work_on, lev=0):
if isinstance(work_on, basestring):
ctrl, work_on = dict_tokzr[l_tokzr[0]](work_on) # tokenize
else:
ctrl, work_on = work_on[0], work_on[1:] # get right part
ret = [ctrl]
if len(l_tokzr) == 1:
ret.append(work_on) # add right part
else:
for wo in work_on: # dive into tree
t = tokenize_recursively(l_tokzr[1:], wo, lev + 1)
ret.append(t)
return ret
# just for printing
def nestedListLines(aList, ind=' ', d=0):
""" Returns multi-line string representation of \param aList. Use \param ind to indent per level. """
sRet = '\n' + d * ind + '['
nested = 0
for i, e in enumerate(aList):
if i:
sRet += ', '
if type(e) == type(aList):
sRet += nestedListLines(e, ind, d + 1)
nested = 1
else:
sRet += '\n' + (d + 1) * ind + repr(e) if nested else repr(e)
sRet += '\n' + d * ind + ']' if nested else ']'
return sRet
# main()
inp1 = """
* Question: I want try something. Should I?
* Answer : I'd assume so. Give it a try.
"""
inp2 = inp1 + 'Q: What is a good way to achieve this? A: I am not so sure. I think I will use Python.'
print repr(tokzr_WORD(inp1))
print repr(tokzr_SENT(inp1))
print repr(tokzr_QA(inp1))
print repr(tokzr_QA_non_canonical(inp1)) # Really this way?
print
for ctrl, inp in [ # example control sequences
('SENTENCE-WORD', inp1),
('QA-SENTENCE', inp2)
]:
res = tokenize_recursively(ctrl.split('-'), inp)
print nestedListLines(res)
Btw. Python/Lib/tokenize.py (for Python code itself) might be worth a look how to handle things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With