findall() method to split a string into words and punctuation, e.g. result = re. findall(r"[\w'\"]+|[,.!?] ", my_str) . The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.
To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.
Split(Char, Int32, StringSplitOptions) Splits a string into a maximum number of substrings based on a specified delimiting character and, optionally, options. Splits a string into a maximum number of substrings based on the provided character separator, optionally omitting empty substrings from the result.
This is more or less the way to do it:
>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats:
Here is a Unicode-aware version:
re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']
); the second catches individual non-word characters, ignoring whitespace.
Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']
). This appears to be standard in NLP, so I consider it a feature.
If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).
import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)
Here's my entry.
I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).
>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>
One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.
Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.
This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.
import string
d = "Hello, I'm a string!"
result = []
word = ''
for char in d:
if char not in string.whitespace:
if char not in string.ascii_letters + "'":
if word:
result.append(word)
result.append(char)
word = ''
else:
word = ''.join([word,char])
else:
if word:
result.append(word)
word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With