I am using the tokenizer from NLTK in Python.
There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:
'*u*', '''','""'
Is there an elegant way of solving both problems?
Solution 1: Tokenize and strip punctuation off the tokens
>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']
Solution 2: remove punctuation then tokenize
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']
If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer
. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u*
before stripping all punctuation.
One way to go about this, then, is to tokenize on gaps like so:
>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World'] # omits *u* per your third requirement
This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A"
. Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With