Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get rid of punctuation using NLTK tokenizer?

People also ask

How do you remove punctuation in Python?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.

How do I remove special characters from a string in Python NLTK?

Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.

Which Tokenizer is used to split the punctuation?

The punctuation-based tokenizer splits the given text based on punctuation and whitespace. The punctuation-based tokenizer will split the words having punctuations in them too like platform. s is the whole word but using punctuation tokenizer the word will convert into 'platform', '. ', 's'.

How do you punctuate Tokenize?

Thus, the tokenizer can replace all punctuation marks with themselves by adding a space around them. Then it uses the space (“\S+”) to split the text into tokens. In the following code, we are replacing the punctuation marks as described above.


Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).


Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)