I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.
So for instance the text may be:
hello world this is John.
or
hello world this is John or even
hello world, this, is John
How can I efficiently parse that text into the following list?
['hello', 'world', 'this', 'is', 'John']
Thanks in advance.
Use the regular expression: r'[\s,]+' to split on 1 or more white-space characters (\s) or commas (,).
import re
s = 'hello world, this, is John'
print re.split(r'[\s,]+', s)
['hello', 'world', 'this', 'is', 'John']
Since you need to split based on spaces and other special characters, the best RegEx would be \W+. Quoting from Python re documentation
\W
When the
LOCALEandUNICODEflags are not specified, matches any non-alphanumeric character; this is equivalent to the set[^a-zA-Z0-9_]. WithLOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than[0-9_]plus characters classified as not alphanumeric in the Unicode character properties database.
For Example,
data = "hello world, this, is John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']
Or, if you have the list of special characters by which the string has to be split, you can do
print re.split("[\s,]+", data)
This splits based on any whitespace character (\s) and comma (,).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With