nltk regular expression tokenizer

Tags:

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Why? Where is the mistake?

730

asked Apr 01 '16 09:04

Juan Menashsheh

1 Answers

You should turn all capturing groups to non-capturing:

([A-Z]\.)+ > (?:[A-Z]\.)+
\w+(-\w+)* -> \w+(?:-\w+)*
\$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

Thus, use

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

answered Oct 18 '22 18:10

Wiktor Stribiżew

Related questions
                            
                                Pandas keep duplicated with highest value
                            
                                How to install pyYAML on windows 10
                            
                                python idiomatic python for loop if else statement
                            
                                Calculate mean on values in python collections.Counter
                            
                                How to locate a particular "region" of values in a 2D numpy array?
                            
                                how to pass multiple url parameters in django
                            
                                How to include `search_type=count` in a query?
                            
                                How to use Bcrypt to encrypt passwords in Django
                            
                                Handling argparse escaped character as option
                            
                                User roles schema on Django
                            
                                How to install the VLC module in Python
                            
                                How to use string.format with nested dict
                            
                                Werkzeug raises BrokenFilesystemWarning
                            
                                Can I open a named pipe on Linux for non-blocked writing in Python?
                            
                                Find the coordinates of a cuboid using list comprehension in Python
                            
                                Elegant access to edge attributes in networkx
                            
                                How to test send_mail in Django?
                            
                                Drop rows with a 'question mark' value in any column in a pandas dataframe
                            
                                IncompatibleProtocolError while trying to connect to RabbitMQ
                            
                                How to manually install python-dev from source

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

nltk regular expression tokenizer

Tags:

python

regex

pattern-matching

nltk

Juan Menashsheh

People also ask

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us