Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Split string by list of separators

In Python, I'd like to split a string using a list of separators. The separators could be either commas or semicolons. Whitespace should be removed unless it is in the middle of non-whitespace, non-separator characters, in which case it should be preserved.

Test case 1: ABC,DEF123,GHI_JKL,MN OP
Test case 2: ABC;DEF123;GHI_JKL;MN OP
Test case 3: ABC ; DEF123,GHI_JKL ; MN OP

Sounds like a case for regular expressions, which is fine, but if it's easier or cleaner to do it another way that would be even better.

Thanks!

like image 367
blah238 Avatar asked Jan 14 '11 23:01

blah238


4 Answers

This should be much faster than regex and you can pass a list of separators as you wanted:

def split(txt, seps):
    default_sep = seps[0]

    # we skip seps[0] because that's the default separator
    for sep in seps[1:]:
        txt = txt.replace(sep, default_sep)
    return [i.strip() for i in txt.split(default_sep)]

How to use it:

>>> split('ABC ; DEF123,GHI_JKL ; MN OP', (',', ';'))
['ABC', 'DEF123', 'GHI_JKL', 'MN OP']

Performance test:

import timeit
import re


TEST = 'ABC ; DEF123,GHI_JKL ; MN OP'
SEPS = (',', ';')


rsplit = re.compile("|".join(SEPS)).split
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 1.6242462980007986

print(timeit.timeit(lambda: split(TEST, SEPS)))
# 1.3588597209964064

And with a much longer input string:

TEST = 100 * 'ABC ; DEF123,GHI_JKL ; MN OP , '

print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 130.67168392999884

print(timeit.timeit(lambda: split(TEST, SEPS)))
# 50.31940778599528
like image 82
Joschua Avatar answered Oct 12 '22 12:10

Joschua


Using regular expressions, try

[s.strip() for s in re.split(",|;", string)]

or

[t.strip() for s in string.split(",") for t in s.split(";")]

without.

like image 22
Sven Marnach Avatar answered Oct 12 '22 11:10

Sven Marnach


Taking the above answer, with your test cases, you want to use a regular expression, and one or more separation characters. In your case, the separation characters seem to be ',', '|', ';' and whitespace. Whitespace in python is '\w', so the comprehension is:

import re
list = [s for s in re.split("[,|;\W]+", string)]

I cannot reply to sven's answer above, but I split on one or more of the characters inside the brackets, and don't have to use the strip() method.

Yikes, I didn't read the question correctly... Sven's answer with the strip works; mine assumes the whitespace is another separation.

like image 35
tmarthal Avatar answered Oct 12 '22 13:10

tmarthal


>>> re.split('\s*,\s*|\s*;\s*', 'a , b; cdf')
['a', 'b', 'cdf']
like image 42
Raph Levien Avatar answered Oct 12 '22 12:10

Raph Levien