Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency : String Slice Vs Custom Function

I have a large string with length of the order 5*10^6.

I have to do some processing by dividing it into blocks of 16 characters. I used a custom made function to split the string assuming its performance would be better than the splice approach.

The functions are as follows :

def spliceSplitter(s):
     sum = 0
     while len(s) > 0:
             block = s[:16]
             # Assuming the process to be done with data block is calculating its length.
             sum += len(block)
             s = s[16:]
     return sum

And the custom Function :

def normalSplitter(s):
     sum = 0
     l = len(s)
     data =""
     for i in xrange(l):
             if i%16 == 0:
                     # Assuming the process to be done with data block is calculating its length.
                     sum += len(data)
                     data = ""
             data += s[i]
     return sum+len(data)

I used a cProfiler on both of them and the results are as follows (time in seconds) :

String Length     |  Splice Splitter   |  Normal Splitter
---------------------------------------------------------
5000000           |  289.0             |  1.274 
500000            |  0.592             |  0.134
50000             |  0.25              |  0.28
5000              |  0.001             |  0.003 

I am generating the string as follows :

s = ''.join([str(random.randint(1,9)) for x in xrange(5000000)])

My Question :

  • Is there a pythonic way to get the same or better Efficiency as the Custom Normal Splitter? Maybe splitting the entire string before hand, storing it into a list and then operating it iteratively.
  • Why is the performance for the Splice Splitter better for smaller strings ? (Just Curious about this one)

NOTE: The process(data) that I have to perform has no return value.

EDIT

Using Yield and improved Splice Splitter the performace resulted in the following :

String Length     |  Splice Splitter   |  Normal Splitter  |  Yield/Generator
-------------------------------------------------------------------------------
5000000           |  0.148             |  1.274            |  0.223
500000            |  0.016             |  0.134            |  0.29
50000             |  0.003             |  0.28             |  0.005
5000              |  ~0.000            |  0.003            |  ~0.000

Code :

def pythonicSplitter(s):
     gen = (s[i:i+16] for i in xrange(0,len(s),16))
     sum = 0
     for data in gen:
             sum += len(data)
     return sum
def spliceSplitter(s):
    sum = 0
    for x in xrange(0, len(s), 16):
         block = s[x:x+16]
         # Assuming the process to be done with data block is calculating its length.
         sum += len(block)
    return sum

Reason for improved Performance :

  • The Splice Splitter was repeatedly creating the new string in each iteration using splice. s = s[16:] as pointed out in Patrick Collin's answer. This was leading to a Time Complexity of ~O(N^2).
  • Once the repeated creation of the string s was replaced with s[x:x+16], the code complexity was reduced to O(N*16) thus improving its performance by a huge margin. The Yield/Generator Function does the same thing (pythonicSplitter()), but due to the Large number of calls to the generator(iterators), the time taken to complete the operation is slighly larger than the one used in Splice Splitter.
  • The Normal Splitter is also doing the same thing, by creating blocks of length 16. But since in Python the strings are immutable, the time complexity of creating these blocks is significantly larger than the inbuilt optimized slice function.
like image 506
Kyuubi Avatar asked Jun 16 '14 14:06

Kyuubi


People also ask

What are the advantages of using slice of strings?

The string is a sequence-based data type, and slicing a string gives more flexibility to work on a string data as indexing only facilitates a single char extraction, whereas, with slicing, a sequence/ range of characters or substring can be accessed.

What is the complexity of string slicing?

Short answer: str slices, in general, copy. That means that your function that does a slice for each of your string's n suffixes is doing O(n2) work.

What is the difference between the slice () and substring () method?

slice() extracts parts of a string and returns the extracted parts in a new string. substr() extracts parts of a string, beginning at the character at the specified position, and returns the specified number of characters. substring() extracts parts of a string and returns the extracted parts in a new string.

What is the time complexity of string slicing in python?

the time complexity of slicing in python is O(k) please visit https://wiki.python.org/moin/TimeComplexity#list for more. the learning experience you deserve. the doubt. O(n) where n is the length of the slice.


1 Answers

I would guess that this line: s = s[16:] is causing s to be overwritten for every loop iteration, copying the whole string over. block = s[:16] is also copying a string, so you're essentially writing the string in to memory twice on every loop iteration. data = "" in normalSplitter() is also probably ensuring that you never keep more than 16 characters of your copy of the string in memory at one time, and you never do any copy operations on the whole string.

That's pushing a lot of data around, and, I expect, causing you to start getting cache misses on the largest size string (although evidently the smaller strings were able to fit inside the cache comfortably). Try using a solution like this one.

def newSplitter(s, n=16):
    for i in xrange(0, len(s), n):
        yield l[i:i+n]
like image 149
Patrick Collins Avatar answered Oct 22 '22 16:10

Patrick Collins