Efficiency : String Slice Vs Custom Function

Tags:

I have a large string with length of the order 5*10^6.

I have to do some processing by dividing it into blocks of 16 characters. I used a custom made function to split the string assuming its performance would be better than the splice approach.

The functions are as follows :

Click to copy

def spliceSplitter(s):
     sum = 0
     while len(s) > 0:
             block = s[:16]
             # Assuming the process to be done with data block is calculating its length.
             sum += len(block)
             s = s[16:]
     return sum

And the custom Function :

Click to copy

def normalSplitter(s):
     sum = 0
     l = len(s)
     data =""
     for i in xrange(l):
             if i%16 == 0:
                     # Assuming the process to be done with data block is calculating its length.
                     sum += len(data)
                     data = ""
             data += s[i]
     return sum+len(data)

I used a cProfiler on both of them and the results are as follows (time in seconds) :

Click to copy

String Length     |  Splice Splitter   |  Normal Splitter
---------------------------------------------------------
5000000           |  289.0             |  1.274 
500000            |  0.592             |  0.134
50000             |  0.25              |  0.28
5000              |  0.001             |  0.003

I am generating the string as follows :

Click to copy

s = ''.join([str(random.randint(1,9)) for x in xrange(5000000)])

My Question :

Is there a pythonic way to get the same or better Efficiency as the Custom Normal Splitter? Maybe splitting the entire string before hand, storing it into a list and then operating it iteratively.
Why is the performance for the Splice Splitter better for smaller strings ? (Just Curious about this one)

NOTE: The process(data) that I have to perform has no return value.

EDIT

Using Yield and improved Splice Splitter the performace resulted in the following :

Click to copy

String Length     |  Splice Splitter   |  Normal Splitter  |  Yield/Generator
-------------------------------------------------------------------------------
5000000           |  0.148             |  1.274            |  0.223
500000            |  0.016             |  0.134            |  0.29
50000             |  0.003             |  0.28             |  0.005
5000              |  ~0.000            |  0.003            |  ~0.000

Code :

Click to copy

def pythonicSplitter(s):
     gen = (s[i:i+16] for i in xrange(0,len(s),16))
     sum = 0
     for data in gen:
             sum += len(data)
     return sum
def spliceSplitter(s):
    sum = 0
    for x in xrange(0, len(s), 16):
         block = s[x:x+16]
         # Assuming the process to be done with data block is calculating its length.
         sum += len(block)
    return sum

Reason for improved Performance :

The Splice Splitter was repeatedly creating the new string in each iteration using splice. s = s[16:] as pointed out in Patrick Collin's answer. This was leading to a Time Complexity of ~O(N^2).
Once the repeated creation of the string s was replaced with s[x:x+16], the code complexity was reduced to O(N*16) thus improving its performance by a huge margin. The Yield/Generator Function does the same thing (pythonicSplitter()), but due to the Large number of calls to the generator(iterators), the time taken to complete the operation is slighly larger than the one used in Splice Splitter.
The Normal Splitter is also doing the same thing, by creating blocks of length 16. But since in Python the strings are immutable, the time complexity of creating these blocks is significantly larger than the inbuilt optimized slice function.

506

asked Jun 16 '14 14:06

Kyuubi

1 Answers

I would guess that this line: s = s[16:] is causing s to be overwritten for every loop iteration, copying the whole string over. block = s[:16] is also copying a string, so you're essentially writing the string in to memory twice on every loop iteration. data = "" in normalSplitter() is also probably ensuring that you never keep more than 16 characters of your copy of the string in memory at one time, and you never do any copy operations on the whole string.

That's pushing a lot of data around, and, I expect, causing you to start getting cache misses on the largest size string (although evidently the smaller strings were able to fit inside the cache comfortably). Try using a solution like this one.

Click to copy

def newSplitter(s, n=16):
    for i in xrange(0, len(s), n):
        yield l[i:i+n]

149

answered Oct 22 '22 16:10

Patrick Collins

Related questions
                            
                                Why is PyQt executing my actions three times?
                            
                                Comparing pandas.Series for equality when they are in different orders
                            
                                Animating pngs in matplotlib using ArtistAnimation
                            
                                Python 3.4 asyncio task doesn't get fully executed
                            
                                Wrapping a LAPACKE function using Cython
                            
                                How to get a list of most popular pages from Google Analytics in Python (Django)?
                            
                                With PyQt, what is the preferred (efficient) method for monitoring window size and adjusting layouts?
                            
                                Understanding axis in Python
                            
                                Understanding LDA Transformed Corpus in Gensim
                            
                                How do i work with pre-compiled libraries in cython?
                            
                                Porting an old fortran program to work with python+numpy [closed]
                            
                                SSH Key-Forwarding using python paramiko
                            
                                python yaml.dump format list in other YAML format
                            
                                How to move or resize X11 windows (even if they are maximized)?
                            
                                Can you list all folders in an S3 bucket?
                            
                                Sympy absolute value of complex exponential
                            
                                How to filter a sqlalchemy query by a column in latest child item
                            
                                Scrapy shell gets 301 redirected to URL without parameters
                            
                                Custom domain routing to Flask server with custom domain always showing in address bar
                            
                                Sort Dictionary Keys in natural order [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiency : String Slice Vs Custom Function

Tags:

performance

python

string

EDIT

Kyuubi

People also ask

1 Answers

Patrick Collins

Recent Activity

Donate For Us