In Python, I'd like to split a string using a list of separators. The separators could be either commas or semicolons. Whitespace should be removed unless it is in the middle of non-whitespace, non-separator characters, in which case it should be preserved.
Test case 1: ABC,DEF123,GHI_JKL,MN OP
Test case 2: ABC;DEF123;GHI_JKL;MN OP
Test case 3: ABC ; DEF123,GHI_JKL ; MN OP
Sounds like a case for regular expressions, which is fine, but if it's easier or cleaner to do it another way that would be even better.
Thanks!
This should be much faster than regex and you can pass a list of separators as you wanted:
def split(txt, seps):
default_sep = seps[0]
# we skip seps[0] because that's the default separator
for sep in seps[1:]:
txt = txt.replace(sep, default_sep)
return [i.strip() for i in txt.split(default_sep)]
How to use it:
>>> split('ABC ; DEF123,GHI_JKL ; MN OP', (',', ';'))
['ABC', 'DEF123', 'GHI_JKL', 'MN OP']
Performance test:
import timeit
import re
TEST = 'ABC ; DEF123,GHI_JKL ; MN OP'
SEPS = (',', ';')
rsplit = re.compile("|".join(SEPS)).split
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 1.6242462980007986
print(timeit.timeit(lambda: split(TEST, SEPS)))
# 1.3588597209964064
And with a much longer input string:
TEST = 100 * 'ABC ; DEF123,GHI_JKL ; MN OP , '
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 130.67168392999884
print(timeit.timeit(lambda: split(TEST, SEPS)))
# 50.31940778599528
Using regular expressions, try
[s.strip() for s in re.split(",|;", string)]
or
[t.strip() for s in string.split(",") for t in s.split(";")]
without.
Taking the above answer, with your test cases, you want to use a regular expression, and one or more separation characters. In your case, the separation characters seem to be ',', '|', ';' and whitespace. Whitespace in python is '\w', so the comprehension is:
import re
list = [s for s in re.split("[,|;\W]+", string)]
I cannot reply to sven's answer above, but I split on one or more of the characters inside the brackets, and don't have to use the strip() method.
Yikes, I didn't read the question correctly... Sven's answer with the strip works; mine assumes the whitespace is another separation.
>>> re.split('\s*,\s*|\s*;\s*', 'a , b; cdf')
['a', 'b', 'cdf']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With