Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Tokenizing with phrases

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.

For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:

  • the west wing
  • is
  • an
  • american
  • ...

What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.

like image 547
yavoh Avatar asked Apr 03 '11 20:04

yavoh


People also ask

How do you tokenize a sentence in words?

Tokenization of words We use the method word_tokenize() to split a sentence into words. The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications.

What is the difference between sentence tokenization and word tokenization?

Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.


2 Answers

You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:

from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())

You will get:

['Something', 'about', 'the_west_wing']
like image 61
liudong Avatar answered Oct 10 '22 11:10

liudong


If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.

For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.

like image 21
Fred Foo Avatar answered Oct 10 '22 09:10

Fred Foo