My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this: <pre class="prettyprint"><code>Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5 </code></pre> Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server). From what I see, I have the following options: <ol> <li>repeatedly use <code>split()</code> </li> <li>Use a regular expression and Regex functions</li> <li>Some other Python functions I have not thought about yet (There are probably some)</li> </ol> Solution 1 would include splitting at <code>|</code> and then splitting the last element of the resulting list at <code><></code> for this example, while solution 2 would probably result in a regular expression like: <code>((.+)|)+((.+)(<>)?)+</code> Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea. Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with <code>|</code> and strings that completely lack the <code><></code>. At least the Regular Expression-based Solution would do that My understanding would be that <code>split()</code> would use more memory (since you basically get two resulting lists, one that splits at <code>|</code> and the second one that splits at <code><></code>), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. <code>split()</code> is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking Some notes: <ul> <li>Yes, I could just benchmark both solutions, but I'm trying to learn something about python in general and how it works here, and if I just benchmark these two, I still don't know what python functions I have missed.</li> <li>Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about python.</li> <li> Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were seperated by <code>|</code> from the parts with the seperator <code><></code>, so a simple flat list as generated by <code>re.split(\||<>,input)</code> (as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.</li> </ul> To sum the question up: Which solution would be the most efficient one, for what reasons. Due to multiple requests, I have run some timeit on the <code>split()</code>-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan: <pre class="prettyprint"><code>import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def regexit(input): return re.split( "\||<>", input ) def mgibsonbr(input): # Solution by @mgibsonbr items = re.split(r'\||<>', input) # Split input in items offset = 0 result = [] # The result: strings for regular itens, lists for <> separated ones acc = None for i in items: delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>' offset += len(i) + len(delimiter) if delimiter == '<>': # Will always put the item in a list if acc is None: acc = [i] # Create one if doesn't exist result.append(acc) else: acc.append(i) else: if acc is not None: # If there was a list, put the last item in it acc.append(i) else: result.append(i) # Add the regular items acc = None # Clear the list, since what will come next is a regular item or a new list return result def split2(input): # Solution by @duncan res0 = input.split("|") res1, res2 = [], [] for r in res0: if "<>" in r: res2.append(r.split("<>")) else: res1.append(r) return res1, res2 print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() </code></pre> The results: <pre class="prettyprint"><code>mgibs: 14.7349407408 split: 6.403942732 split2: 3.68306812233 regex: 5.28414318792 mgibs: 107.046683735 split: 46.0844590775 split2: 26.5595985591 regex: 28.6513302646 </code></pre> At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless). Thanks for the input, everyone.

I was slightly surprised that <code>split()</code> performed so badly in your code so I looked at it a bit more closely and noticed that you're calling <code>list.remove()</code> in the inner loop. Also you're calling <code>split()</code> an extra time on each string. Get rid of those and a solution using <code>split()</code> beats the regex hands down on shorter strings and comes a pretty close second on the longer one. <pre class="prettyprint"><code>import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def split2(input): res0 = input.split("|") res1, res2 = [], [] for r in res0: if "<>" in r: res2.append(r.split("<>")) else: res1.append(r) return res1, res2 def regexit(input): return re.split( "\||<>", input ) rSplitter = re.compile("\||<>") def regexit2(input): return rSplitter.split(input) print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) </code></pre> Which gives the following result: <pre class="prettyprint"><code>split: 1.8427431439631619 split2: 1.0897291360306554 regex: 1.6694280610536225 regex2: 1.2277749050408602 split: 14.356198082969058 split2: 8.009285948995966 regex: 9.526430513011292 regex2: 9.083608677960001 </code></pre> and of course <code>split2()</code> gives the nested lists that you wanted whereas the regex solution doesn't. Edit: I've updated this answer to include @F1Rumors suggestion that compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer. Also I updated the code so it runs on Python 3.

Most efficient way to split strings in Python

Tags:

python

optimization

split

My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:

Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5

Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server).

From what I see, I have the following options:

repeatedly use split()
Use a regular expression and Regex functions
Some other Python functions I have not thought about yet (There are probably some)

Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:

((.+)|)+((.+)(<>)?)+

Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea.

Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with | and strings that completely lack the <>. At least the Regular Expression-based Solution would do that

My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. split() is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking

Some notes:

Yes, I could just benchmark both solutions, but I'm trying to learn something about python in general and how it works here, and if I just benchmark these two, I still don't know what python functions I have missed.
Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about python.
Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were seperated by | from the parts with the seperator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.

To sum the question up: Which solution would be the most efficient one, for what reasons.

Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan:

import timeit import re  def splitit(input):     res0 = input.split("|")     res = []     for element in res0:         t = element.split("<>")         if t != [element]:             res0.remove(element)             res.append(t)     return (res0, res)  def regexit(input):     return re.split( "\||<>", input )   def mgibsonbr(input): # Solution by @mgibsonbr     items = re.split(r'\||<>', input) # Split input in items     offset = 0     result = [] # The result: strings for regular itens, lists for <> separated ones     acc = None     for i in items:         delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'         offset += len(i) + len(delimiter)         if delimiter == '<>': # Will always put the item in a list             if acc is None:                 acc = [i] # Create one if doesn't exist                 result.append(acc)             else:                 acc.append(i)         else:             if acc is not None: # If there was a list, put the last item in it                 acc.append(i)             else:                 result.append(i) # Add the regular items             acc = None # Clear the list, since what will come next is a regular item or a new list     return result  def split2(input): # Solution by @duncan     res0 = input.split("|")     res1, res2 = [], []     for r in res0:         if "<>" in r:             res2.append(r.split("<>"))         else:             res1.append(r)     return res1, res2  print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()

The results:

mgibs: 14.7349407408 split: 6.403942732 split2: 3.68306812233 regex: 5.28414318792 mgibs: 107.046683735 split: 46.0844590775 split2: 26.5595985591 regex: 28.6513302646

At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless).

Thanks for the input, everyone.

625

asked Mar 07 '12 14:03

malexmave

2 Answers

I was slightly surprised that split() performed so badly in your code so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.

import timeit import re  def splitit(input):     res0 = input.split("|")     res = []     for element in res0:         t = element.split("<>")         if t != [element]:             res0.remove(element)             res.append(t)     return (res0, res)  def split2(input):     res0 = input.split("|")     res1, res2 = [], []     for r in res0:         if "<>" in r:             res2.append(r.split("<>"))         else:             res1.append(r)     return res1, res2  def regexit(input):     return re.split( "\||<>", input )  rSplitter = re.compile("\||<>")  def regexit2(input):     return rSplitter.split(input)  print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())

Which gives the following result:

split: 1.8427431439631619 split2: 1.0897291360306554 regex: 1.6694280610536225 regex2: 1.2277749050408602 split: 14.356198082969058 split2: 8.009285948995966 regex: 9.526430513011292 regex2: 9.083608677960001

and of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.

Edit: I've updated this answer to include @F1Rumors suggestion that compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

Also I updated the code so it runs on Python 3.

answered Sep 19 '22 19:09

Duncan

I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:

>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5" >>> re.split( "\||<>", input ) >>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']

I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.

Though having said that I could easily be wrong, and the only way to be sure would be to time it.

answered Sep 23 '22 19:09

obmarg

Related questions
                            
                                Create conda environment: "Found conflicts!" when solving environment and "Finding shortest conflict path" running forever
                            
                                Is there a way of drawing a caption box in matplotlib
                            
                                Can SQLAlchemy be configured to be non-blocking?
                            
                                How to install the Python development headers on Mac OS X?
                            
                                What is the multiplication operator actually doing with numpy arrays? [duplicate]
                            
                                Positive integer from Python hash() function
                            
                                Why are Python and Ruby so slow, while Lisp implementations are fast?
                            
                                mean calculation in pandas excluding zeros
                            
                                Append 2D array to 3D array, extending third dimension
                            
                                How to make python .post() requests to retry?
                            
                                How can I access different Anaconda environment from Pycharm (on Windows 10)
                            
                                Keras confusion about number of layers
                            
                                Pickle alternatives
                            
                                Python with selenium: unable to locate element which really exist
                            
                                built-in max heap API in Python
                            
                                mean from pandas and numpy differ
                            
                                How do I validate xml against a DTD file in Python
                            
                                Python NotImplemented constant
                            
                                How do I read the output of the IPython %prun (profiler) command?
                            
                                cx_Oracle and Exception Handling - Good practices?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With