How to get rid of punctuation using NLTK tokenizer?

People also ask

How do you remove punctuation in Python?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.

How do I remove special characters from a string in Python NLTK?

Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.

Which Tokenizer is used to split the punctuation?

The punctuation-based tokenizer splits the given text based on punctuation and whitespace. The punctuation-based tokenizer will split the words having punctuations in them too like platform. s is the whole word but using punctuation tokenizer the word will convert into 'platform', '. ', 's'.

How do you punctuate Tokenize?

Thus, the tokenizer can replace all punctuation marks with themselves by adding a space around them. Then it uses the space (“\S+”) to split the text into tokens. In the following code, we are replacing the punctuation marks as described above.

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

Related questions
                            
                                How would I get everything before a : in a string Python
                            
                                Python: how to print range a-z?
                            
                                Should __init__() call the parent class's __init__()?
                            
                                Naming returned columns in Pandas aggregate function? [duplicate]
                            
                                Cleanest way to get last item from Python iterator
                            
                                How to check if one dictionary is a subset of another larger dictionary?
                            
                                How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function?
                            
                                ValueError : I/O operation on closed file
                            
                                Dictionary vs Object - which is more efficient and why?
                            
                                How to return images in flask response? [duplicate]
                            
                                Why does running the Flask dev server run itself twice?
                            
                                How can I save my secret keys and password securely in my version control system?
                            
                                How to source virtualenv activate in a Bash script
                            
                                Calling Java from Python
                            
                                argparse store false if unspecified
                            
                                inserting characters at the start and end of a string
                            
                                Fast way of counting non-zero bits in positive integer
                            
                                What is the syntax rule for having trailing commas in tuple definitions?
                            
                                Flask ImportError: No Module Named Flask
                            
                                Existence of mutable named tuple in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get rid of punctuation using NLTK tokenizer?

Tags:

python

tokenize

nlp

nltk

People also ask

Recent Activity

Donate For Us