Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to split strings in Python

My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:

Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5 

Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server).

From what I see, I have the following options:

  1. repeatedly use split()
  2. Use a regular expression and Regex functions
  3. Some other Python functions I have not thought about yet (There are probably some)

Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:

((.+)|)+((.+)(<>)?)+

Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea.

Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with | and strings that completely lack the <>. At least the Regular Expression-based Solution would do that

My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. split() is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking

Some notes:

  • Yes, I could just benchmark both solutions, but I'm trying to learn something about python in general and how it works here, and if I just benchmark these two, I still don't know what python functions I have missed.
  • Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about python.
  • Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were seperated by | from the parts with the seperator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.

To sum the question up: Which solution would be the most efficient one, for what reasons.

Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan:

import timeit import re  def splitit(input):     res0 = input.split("|")     res = []     for element in res0:         t = element.split("<>")         if t != [element]:             res0.remove(element)             res.append(t)     return (res0, res)  def regexit(input):     return re.split( "\||<>", input )   def mgibsonbr(input): # Solution by @mgibsonbr     items = re.split(r'\||<>', input) # Split input in items     offset = 0     result = [] # The result: strings for regular itens, lists for <> separated ones     acc = None     for i in items:         delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'         offset += len(i) + len(delimiter)         if delimiter == '<>': # Will always put the item in a list             if acc is None:                 acc = [i] # Create one if doesn't exist                 result.append(acc)             else:                 acc.append(i)         else:             if acc is not None: # If there was a list, put the last item in it                 acc.append(i)             else:                 result.append(i) # Add the regular items             acc = None # Clear the list, since what will come next is a regular item or a new list     return result  def split2(input): # Solution by @duncan     res0 = input.split("|")     res1, res2 = [], []     for r in res0:         if "<>" in r:             res2.append(r.split("<>"))         else:             res1.append(r)     return res1, res2  print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() 

The results:

mgibs: 14.7349407408 split: 6.403942732 split2: 3.68306812233 regex: 5.28414318792 mgibs: 107.046683735 split: 46.0844590775 split2: 26.5595985591 regex: 28.6513302646 

At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless).

Thanks for the input, everyone.

like image 625
malexmave Avatar asked Mar 07 '12 14:03

malexmave


People also ask

How many arguments does split () take in Python?

Syntax. The split() function takes two parameters as an argument, i.e., separator and maxsplit.

How do you split multiple strings in Python?

Method 1: Split multiple characters from string using re. split() This is the most efficient and commonly used method to split multiple characters at once. It makes use of regex(regular expressions) in order to do this.

How do you split a string into 3 parts in Python?

The split() method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.


2 Answers

I was slightly surprised that split() performed so badly in your code so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.

import timeit import re  def splitit(input):     res0 = input.split("|")     res = []     for element in res0:         t = element.split("<>")         if t != [element]:             res0.remove(element)             res.append(t)     return (res0, res)  def split2(input):     res0 = input.split("|")     res1, res2 = [], []     for r in res0:         if "<>" in r:             res2.append(r.split("<>"))         else:             res1.append(r)     return res1, res2  def regexit(input):     return re.split( "\||<>", input )  rSplitter = re.compile("\||<>")  def regexit2(input):     return rSplitter.split(input)  print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) 

Which gives the following result:

split: 1.8427431439631619 split2: 1.0897291360306554 regex: 1.6694280610536225 regex2: 1.2277749050408602 split: 14.356198082969058 split2: 8.009285948995966 regex: 9.526430513011292 regex2: 9.083608677960001 

and of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.

Edit: I've updated this answer to include @F1Rumors suggestion that compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

Also I updated the code so it runs on Python 3.

like image 84
Duncan Avatar answered Sep 19 '22 19:09

Duncan


I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:

>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5" >>> re.split( "\||<>", input ) >>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5'] 

I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.

Though having said that I could easily be wrong, and the only way to be sure would be to time it.

like image 28
obmarg Avatar answered Sep 23 '22 19:09

obmarg