Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex splitting on multiple whitespaces

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John. or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

like image 925
stratis Avatar asked Sep 12 '25 09:09

stratis


2 Answers

Use the regular expression: r'[\s,]+' to split on 1 or more white-space characters (\s) or commas (,).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

like image 156
Mr. Polywhirl Avatar answered Sep 13 '25 23:09

Mr. Polywhirl


Since you need to split based on spaces and other special characters, the best RegEx would be \W+. Quoting from Python re documentation

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character (\s) and comma (,).

like image 42
thefourtheye Avatar answered Sep 14 '25 00:09

thefourtheye