Time complexity of string slice

Tags:

time-complexity

What's the time complexity of slicing a Python string? Given that Python strings are immutable, I can imagine slicing them being either O(1) or O(n) depending upon how slicing is implemented.

I need to write a function that iterates over all suffixes of a (potentially big) string. I could avoid slicing the string by representing a suffix as a tuple of the entire string plus an index to start reading characters from, but that's ugly. If instead I naively write my function like this:

def do_something_on_all_suffixes(big_string):     for i in range(len(big_string)):         suffix = big_string[i:]         some_constant_time_operation(suffix)

... will its time complexity be O(n) or O(n²), where n is len(big_string)?

228

asked Feb 03 '16 14:02

Mark Amery

2 Answers

Short answer: str slices, in general, copy. That means that your function that does a slice for each of your string's n suffixes is doing O(n²) work. That said, you can avoid copies if you can work with bytes-like objects using memoryviews to get zero-copy views of the original bytes data. See How you can do zero copy slicing below for how to make it work.

Long answer: (C)Python str do not slice by referencing a view of a subset of the data. There are exactly three modes of operation for str slicing:

Complete slice, e.g. mystr[:]: Returns a reference to the exact same str (not just shared data, the same actual object, mystr is mystr[:] since str is immutable so there is no risk to doing so)
The zero length slice and (implementation dependent) cached length 1 slices; the empty string is a singleton (mystr[1:1] is mystr[2:2] is ''), and low ordinal strings of length one are cached singletons as well (on CPython 3.5.0, it looks like all characters representable in latin-1, that is Unicode ordinals in range(256), are cached)
All other slices: The sliced str is copied at creation time, and thereafter unrelated to the original str

The reason why #3 is the general rule is to avoid issues with large str being kept in memory by a view of a small portion of it. If you had a 1GB file, read it in and sliced it like so (yes, it's wasteful when you can seek, this is for illustration):

with open(myfile) as f:     data = f.read()[-1024:]

then you'd have 1 GB of data being held in memory to support a view that shows the final 1 KB, a serious waste. Since slices are usually smallish, it's almost always faster to copy on slice instead of create views. It also means str can be simpler; it needs to know its size, but it need not track an offset into the data as well.

How you can do zero copy slicing

There are ways to perform view based slicing in Python, and in Python 2, it will work on str (because str is bytes-like in Python 2, supporting the buffer protocol). With Py2 str and Py3 bytes (as well as many other data types like bytearray, array.array, numpy arrays, mmap.mmaps, etc.), you can create a memoryview that is a zero copy view of the original object, and can be sliced without copying data. So if you can use (or encode) to Py2 str/Py3 bytes, and your function can work with arbitrary bytes-like objects, then you could do:

def do_something_on_all_suffixes(big_string):     # In Py3, may need to encode as latin-1 or the like     remaining_suffix = memoryview(big_string)     # Rather than explicit loop, just replace view with one shorter view     # on each loop     while remaining_suffix:  # Stop when we've sliced to empty view         some_constant_time_operation(remaining_suffix)         remaining_suffix = remaining_suffix[1:]

The slices of memoryviews do make new view objects (they're just ultra-lightweight with fixed size unrelated to the amount of data they view), just not any data, so some_constant_time_operation can store a copy if needed and it won't be changed when we slice it down later. Should you need a proper copy as Py2 str/Py3 bytes, you can call .tobytes() to get the raw bytes obj, or (in Py3 only it appears), decode it directly to a str that copies from the buffer, e.g. str(remaining_suffix[10:20], 'latin-1').

180

answered Oct 05 '22 17:10

ShadowRanger

It all depends on how big your slices are. I threw together the following two benchmarks. The first slices the entire string and the second only a little bit. Curve fitting with this tool gives

# s[1:-1] y = 0.09 x^2 + 10.66 x - 3.25  # s[1:1000] y = -0.15 x + 17.13706461

The first looks quite linear for slices of strings up to 4MB. I guess this really measures the time taken to construct a second string. The second is pretty constant, although it's so fast it's probably not that stable.

enter image description here

import time  def go(n):     start = time.time()     s = "abcd" * n     for j in xrange(50000):          #benchmark one         a = s[1:-1]          #benchmark two         a = s[1:1000]      end = time.time()     return (end - start) * 1000  for n in range(1000, 100000, 5000):     print n/1000.0, go(n)

answered Oct 05 '22 16:10

jozxyqk

Related questions
                            
                                Django unique=True not working
                            
                                Is it safe to just implement __lt__ for a class that will be sorted?
                            
                                How to share secondary y-axis between subplots in matplotlib
                            
                                Difference between various numpy random functions
                            
                                Why Python's list does not have shift/unshift methods?
                            
                                How can i process multi loss in pytorch?
                            
                                Inspect python class attributes
                            
                                How to compare a list of lists/sets in python?
                            
                                How can I subclass a Pandas DataFrame?
                            
                                Write dictionary of lists to a CSV file
                            
                                What is the intended use of the DEFAULT section in config files used by ConfigParser?
                            
                                How do I send HTML Formatted emails, through the gmail-api for python
                            
                                What is the difference between postgres and postgresql_psycopg2 as a database engine for django?
                            
                                Use lambda expression to count the elements that I'm interested in Python
                            
                                How to make a copy of a python module at runtime?
                            
                                Simulate Python keypresses for controlling a game
                            
                                Retrieve name of column from its Index in Pandas
                            
                                Cross-platform subprocess with hidden window
                            
                                Recursive list comprehension in Python?
                            
                                How come unpacking is faster than accessing by index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With