Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re.split() vs split()

Tags:

In my quests of optimization, I discovered that that built-in split() method is about 40% faster that the re.split() equivalent.

A dummy benchmark (easily copy-pasteable):

import re, time, random 

def random_string(_len):
    letters = "ABC"
    return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ])

r = random_string(2000000)
pattern = re.compile(r"A")

start = time.time()
pattern.split(r)
print "with re.split : ", time.time() - start

start = time.time()
r.split("A")
print "with built-in split : ", time.time() - start

Why this difference?

like image 673
hymloth Avatar asked Sep 21 '11 14:09

hymloth


People also ask

What is the difference between re split and split in Python?

split() methods of "re" module in Python. The "re" module in Python provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. split() – uses a regex pattern to split a given string into a list.

What is re split () in Python?

The re. split() function splits the given string according to the occurrence of a particular character or pattern. Upon finding the pattern, this function returns the remaining characters from the string in a list.

Is there any difference between re match () and re search () in the Python re module?

There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.


2 Answers

re.split is expected to be slower, as the usage of regular expressions incurs some overhead.

Of course if you are splitting on a constant string, there is no point in using re.split().

like image 105
NullUserException Avatar answered Sep 18 '22 11:09

NullUserException


When in doubt, check the source code. You can see that Python s.split() is optimized for whitespace and inlined. But s.split() is for fixed delimiters only.

For the speed tradeoff, a re.split regular expression based split is far more flexible.

>>> re.split(':+',"One:two::t h r e e:::fourth field")
['One', 'two', 't h r e e', 'fourth field']
>>> "One:two::t h r e e:::fourth field".split(':')
['One', 'two', '', 't h r e e', '', '', 'fourth field']
# would require an addition step to find the empty fields...
>>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field")
['One', 'two', 't h r e e', 'fourth field']
# try that without a regex split in an understandable way...

That re.split() is only 29% slower (or that s.split() is only 40% faster) is what should be amazing.

like image 30
the wolf Avatar answered Sep 21 '22 11:09

the wolf