I tried to implement a regular expression tokenizer with nltk in python, but the result is this:
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
But the wanted result is this:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
Why? Where is the mistake?
A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.
With the help of NLTK tokenize. regexp() module, we are able to extract the tokens from string by using regular expression with RegexpTokenizer() method. Example #1 : In this example we are using RegexpTokenizer() method to extract the stream of tokens with the help of regular expressions.
If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string: >>> from nltk.
You should turn all capturing groups to non-capturing:
([A-Z]\.)+
> (?:[A-Z]\.)+
\w+(-\w+)*
-> \w+(?:-\w+)*
\$?\d+(\.\d+)?%?
to \$?\d+(?:\.\d+)?%?
The issue is that regexp_tokenize
seems to be using re.findall
that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:
pattern (str)
– The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)
Also, I am not sure you wanted to use :-_
that matches a range including all uppercase letters, put the -
to the end of the character class.
Thus, use
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With