My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:
Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5
Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server).
From what I see, I have the following options:
split()
Solution 1 would include splitting at |
and then splitting the last element of the resulting list at <>
for this example, while solution 2 would probably result in a regular expression like:
((.+)|)+((.+)(<>)?)+
Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea.
Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with |
and strings that completely lack the <>
. At least the Regular Expression-based Solution would do that
My understanding would be that split()
would use more memory (since you basically get two resulting lists, one that splits at |
and the second one that splits at <>
), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. split()
is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking
Some notes:
|
from the parts with the seperator <>
, so a simple flat list as generated by re.split(\||<>,input)
(as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.To sum the question up: Which solution would be the most efficient one, for what reasons.
Due to multiple requests, I have run some timeit on the split()
-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan:
import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def regexit(input): return re.split( "\||<>", input ) def mgibsonbr(input): # Solution by @mgibsonbr items = re.split(r'\||<>', input) # Split input in items offset = 0 result = [] # The result: strings for regular itens, lists for <> separated ones acc = None for i in items: delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>' offset += len(i) + len(delimiter) if delimiter == '<>': # Will always put the item in a list if acc is None: acc = [i] # Create one if doesn't exist result.append(acc) else: acc.append(i) else: if acc is not None: # If there was a list, put the last item in it acc.append(i) else: result.append(i) # Add the regular items acc = None # Clear the list, since what will come next is a regular item or a new list return result def split2(input): # Solution by @duncan res0 = input.split("|") res1, res2 = [], [] for r in res0: if "<>" in r: res2.append(r.split("<>")) else: res1.append(r) return res1, res2 print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
The results:
mgibs: 14.7349407408 split: 6.403942732 split2: 3.68306812233 regex: 5.28414318792 mgibs: 107.046683735 split: 46.0844590775 split2: 26.5595985591 regex: 28.6513302646
At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless).
Thanks for the input, everyone.
Syntax. The split() function takes two parameters as an argument, i.e., separator and maxsplit.
Method 1: Split multiple characters from string using re. split() This is the most efficient and commonly used method to split multiple characters at once. It makes use of regex(regular expressions) in order to do this.
The split() method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
I was slightly surprised that split()
performed so badly in your code so I looked at it a bit more closely and noticed that you're calling list.remove()
in the inner loop. Also you're calling split()
an extra time on each string. Get rid of those and a solution using split()
beats the regex hands down on shorter strings and comes a pretty close second on the longer one.
import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def split2(input): res0 = input.split("|") res1, res2 = [], [] for r in res0: if "<>" in r: res2.append(r.split("<>")) else: res1.append(r) return res1, res2 def regexit(input): return re.split( "\||<>", input ) rSplitter = re.compile("\||<>") def regexit2(input): return rSplitter.split(input) print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
Which gives the following result:
split: 1.8427431439631619 split2: 1.0897291360306554 regex: 1.6694280610536225 regex2: 1.2277749050408602 split: 14.356198082969058 split2: 8.009285948995966 regex: 9.526430513011292 regex2: 9.083608677960001
and of course split2()
gives the nested lists that you wanted whereas the regex solution doesn't.
Edit: I've updated this answer to include @F1Rumors suggestion that compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.
Also I updated the code so it runs on Python 3.
I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:
>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5" >>> re.split( "\||<>", input ) >>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']
I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.
Though having said that I could easily be wrong, and the only way to be sure would be to time it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With