I have a large string with length of the order 5*10^6
.
I have to do some processing by dividing it into blocks of 16 characters. I used a custom made function to split the string assuming its performance would be better than the splice approach.
The functions are as follows :
def spliceSplitter(s):
sum = 0
while len(s) > 0:
block = s[:16]
# Assuming the process to be done with data block is calculating its length.
sum += len(block)
s = s[16:]
return sum
And the custom Function :
def normalSplitter(s):
sum = 0
l = len(s)
data =""
for i in xrange(l):
if i%16 == 0:
# Assuming the process to be done with data block is calculating its length.
sum += len(data)
data = ""
data += s[i]
return sum+len(data)
I used a cProfiler on both of them and the results are as follows (time in seconds) :
String Length | Splice Splitter | Normal Splitter
---------------------------------------------------------
5000000 | 289.0 | 1.274
500000 | 0.592 | 0.134
50000 | 0.25 | 0.28
5000 | 0.001 | 0.003
I am generating the string as follows :
s = ''.join([str(random.randint(1,9)) for x in xrange(5000000)])
My Question :
NOTE: The process(data)
that I have to perform has no return value.
Using Yield and improved Splice Splitter the performace resulted in the following :
String Length | Splice Splitter | Normal Splitter | Yield/Generator
-------------------------------------------------------------------------------
5000000 | 0.148 | 1.274 | 0.223
500000 | 0.016 | 0.134 | 0.29
50000 | 0.003 | 0.28 | 0.005
5000 | ~0.000 | 0.003 | ~0.000
Code :
def pythonicSplitter(s):
gen = (s[i:i+16] for i in xrange(0,len(s),16))
sum = 0
for data in gen:
sum += len(data)
return sum
def spliceSplitter(s):
sum = 0
for x in xrange(0, len(s), 16):
block = s[x:x+16]
# Assuming the process to be done with data block is calculating its length.
sum += len(block)
return sum
Reason for improved Performance :
s = s[16:]
as pointed out in Patrick Collin's answer. This was leading to a Time Complexity of ~O(N^2)
.s
was replaced with s[x:x+16]
, the code complexity was reduced to O(N*16)
thus improving its performance by a huge margin. The Yield/Generator Function does the same thing (pythonicSplitter()
), but due to the Large number of calls to the generator(iterators), the time taken to complete the operation is slighly larger than the one used in Splice Splitter.The string is a sequence-based data type, and slicing a string gives more flexibility to work on a string data as indexing only facilitates a single char extraction, whereas, with slicing, a sequence/ range of characters or substring can be accessed.
Short answer: str slices, in general, copy. That means that your function that does a slice for each of your string's n suffixes is doing O(n2) work.
slice() extracts parts of a string and returns the extracted parts in a new string. substr() extracts parts of a string, beginning at the character at the specified position, and returns the specified number of characters. substring() extracts parts of a string and returns the extracted parts in a new string.
the time complexity of slicing in python is O(k) please visit https://wiki.python.org/moin/TimeComplexity#list for more. the learning experience you deserve. the doubt. O(n) where n is the length of the slice.
I would guess that this line: s = s[16:]
is causing s
to be overwritten for every loop iteration, copying the whole string over. block = s[:16]
is also copying a string, so you're essentially writing the string in to memory twice on every loop iteration. data = ""
in normalSplitter()
is also probably ensuring that you never keep more than 16 characters of your copy of the string in memory at one time, and you never do any copy operations on the whole string.
That's pushing a lot of data around, and, I expect, causing you to start getting cache misses on the largest size string (although evidently the smaller strings were able to fit inside the cache comfortably). Try using a solution like this one.
def newSplitter(s, n=16):
for i in xrange(0, len(s), n):
yield l[i:i+n]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With