In my quests of optimization, I discovered that that built-in split() method is about 40% faster that the re.split() equivalent.
A dummy benchmark (easily copy-pasteable):
import re, time, random
def random_string(_len):
letters = "ABC"
return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ])
r = random_string(2000000)
pattern = re.compile(r"A")
start = time.time()
pattern.split(r)
print "with re.split : ", time.time() - start
start = time.time()
r.split("A")
print "with built-in split : ", time.time() - start
Why this difference?
split() methods of "re" module in Python. The "re" module in Python provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. split() – uses a regex pattern to split a given string into a list.
The re. split() function splits the given string according to the occurrence of a particular character or pattern. Upon finding the pattern, this function returns the remaining characters from the string in a list.
There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.
re.split
is expected to be slower, as the usage of regular expressions incurs some overhead.
Of course if you are splitting on a constant string, there is no point in using re.split()
.
When in doubt, check the source code. You can see that Python s.split()
is optimized for whitespace and inlined. But s.split()
is for fixed delimiters only.
For the speed tradeoff, a re.split regular expression based split is far more flexible.
>>> re.split(':+',"One:two::t h r e e:::fourth field")
['One', 'two', 't h r e e', 'fourth field']
>>> "One:two::t h r e e:::fourth field".split(':')
['One', 'two', '', 't h r e e', '', '', 'fourth field']
# would require an addition step to find the empty fields...
>>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field")
['One', 'two', 't h r e e', 'fourth field']
# try that without a regex split in an understandable way...
That re.split()
is only 29% slower (or that s.split()
is only 40% faster) is what should be amazing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With