Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nltk regular expression tokenizer

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Why? Where is the mistake?

like image 730
Juan Menashsheh Avatar asked Apr 01 '16 09:04

Juan Menashsheh


People also ask

What is tokenizer regex?

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.

How do you use Tokenize in python regex?

With the help of NLTK tokenize. regexp() module, we are able to extract the tokens from string by using regular expression with RegexpTokenizer() method. Example #1 : In this example we are using RegexpTokenizer() method to extract the stream of tokens with the help of regular expressions.

How do you tokenize a string in regex?

If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

What does NLTK tokenizer do?

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string: >>> from nltk.


1 Answers

You should turn all capturing groups to non-capturing:

  • ([A-Z]\.)+ > (?:[A-Z]\.)+
  • \w+(-\w+)* -> \w+(?:-\w+)*
  • \$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

Thus, use

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''
like image 83
Wiktor Stribiżew Avatar answered Oct 18 '22 18:10

Wiktor Stribiżew