Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Split a comma-delimited string directly into a set

I have some code that does something like:

if string in comma_delimited_string.split(','):
    return True

This website says that membership testing with sets and dicts is much faster that with lists or tuples. I know doing set(comma_delimited_string.split(',')) wouldn't improve speed because a list is still being created before it's converted into a set (or at least, it appeared to slow it down when I timed it).

I was wondering then, (mostly out of curiosity than real benefit to my code), is there a way to achieve the same effect of comma_delimited_string.split(',') but directly creating a set, instead of a list, with the intention of speeding up the above operation?

like image 670
dieggsy Avatar asked Dec 19 '22 15:12

dieggsy


2 Answers

You're ignoring the fact that in order to convert anything to a set, you need to iterate it. And that iteration is exactly the same as you are already doing in order to search the original list. So there cannot be any advantage in doing this, only overhead.

Searching against a set is more efficient if you're doing it multiple times, as that allows you to amortise the cost of the conversion. But the conversion itself is always going to be a linear scan; there's no way of avoiding that.

like image 114
Daniel Roseman Avatar answered Dec 21 '22 10:12

Daniel Roseman


No, the str.split operation always returns a list and trying to convert that into a set will cost time. Also writing your own handmade split that produces directly a set will be slower, because str.split is implemente in C (the source code should be under Objects/stringlib/split.h)

Note however that if your string does not contain a comma and you expect string to not be a substring of the elements returned by split, then you can just do:

if string in comma_delimited_string:

If string contains a comma, then your test will always fail (because by definition the elements of text.split(',') will never contain one.

The case in which the above condition fail is when you have something like:

if "a" in "aaa,bb,c".split(',')

because in this case "a" in ["aaa", "bb", "c"] fails.

Alternatively you could use a regex:

import re
if re.search(r'(^{0},)|(,{0},)|(,{0}$)|(^{0}$)'.format(re.escape(string)), comma_delimited_string):

However I don't know whether this would be faster, it probably depends on your inputs.

like image 29
Bakuriu Avatar answered Dec 21 '22 10:12

Bakuriu