Python: Tokenizing with phrases

Tags:

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.

For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:

the west wing
is
an
american
...

What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.

547

asked Apr 03 '11 20:04

yavoh

2 Answers

You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:

from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())

You will get:

['Something', 'about', 'the_west_wing']

answered Oct 10 '22 11:10

liudong

If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.

For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.

answered Oct 10 '22 09:10

Fred Foo

Related questions
                            
                                How can I upgrade the sqlite3 package in Python 2.6?
                            
                                How to filter a numpy.ndarray by date?
                            
                                Does Lua support Decorators?
                            
                                How to dynamically create module level functions from methods in a class
                            
                                Is it possible to overload from/import in Python?
                            
                                Auto-Completion In wxPython wxComboBox
                            
                                Custom Django Field to store a list of email addresses
                            
                                Why shouldn't I use async (evented) IO
                            
                                Fast way to read interleaved data?
                            
                                Twisted, gevent eventlet - When would I use them
                            
                                in Python, how to separate Local Hard Drives from Network and Floppy in Windows?
                            
                                How to lazy load a data structure (python)
                            
                                Django Python Garbage Collection woes
                            
                                Installing pyOpenSSL on Amazon Linux (EC2)
                            
                                Comparing/Clustering Trajectories (GPS data of (x,y) points) and Mining the data
                            
                                Build HTTP GET request with port number and parameters
                            
                                Trying to get Scrapy into a project to run Crawl command
                            
                                When using lxml, can the XML be rendered without namespace attributes?
                            
                                "getaddrinfo" error when trying to establish an SSH connection using Python Paramiko
                            
                                Close urllib2 connection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Tokenizing with phrases

Tags:

python

tokenize

nlp

nltk

yavoh

People also ask

2 Answers

liudong

Fred Foo

Recent Activity

Donate For Us