Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python's immutable strings and their slices

The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the .split() method. However, a new string object is constructed each time. Why? The single reason I see is that it can make garbage collection a bit more difficult.

True: in regular situations the memory overhead is linear and isn't noticeable. Copying is fast, and so is allocation. But there is already too much done in Python, so maybe such buffers are worth the effort?

EDIT:

It seems that forming substrings this way would make memory management much more complicated. The case where only 20% of the arbitrary string is used, and we can't deallocate the rest of the string, is a simple example. We can improve the memory allocator, so it would be able to deallocate strings partially, but probably it would be mostly a disprovement. All the standard functions can anyway be emulated with buffer or memoryview if memory becomes critical. The code wouldn't be that concise, but one has to give up something in order to get something.

like image 705
gukoff Avatar asked Aug 04 '13 10:08

gukoff


2 Answers

The underlying string representation is null-terminated, even though it keeps track of the length, hence you cannot have a string object that references a sub-string that isn't a suffix. This already limits the usefulness of your proposal since it would add a lot of complications to deal differently with suffices and non-suffices (and giving up with null-terminating strings brings other consequences).

Allowing to refer to sub-strings of a string means to complicate a lot garbage collection and string handling. For every string you'd have to keep track how many objects refer to each character, or to each range of indices. This means complicating a lot the struct of string objects and any operation that deals with them, meaning a, probably big, slow down.

Add the fact that starting with python3 strings have 3 different internal representations, and things are going to be too messy to be maintainable, and your proposal probably doesn't give enough benefits to be accepted.


An other problem with this kind of "optimization" is when you want to deallocate "big strings":

a = "Some string" * 10 ** 7
b = a[10000]
del a

After this operations you have the substring b that prevents a, a huge string, to be deallocated. Surely you could do copies of small strings, but what if b = a[:10000](or another big number)? 10000 characters looks like a big string which ought to use the optimization to avoid copying, but it is preventing to realease megabytes of data. The garbage collector would have to keep checking whether it is worth to deallocate a big string object and make copies or not, and all these operations must be as fast as possible, otherwise you end up decreasing time-performances.

99% of the times the strings used in the programs are "small"(max 10k characters), hence copying is really fast, while the optimizations you propose start to become effective with really big strings(e.g. take substrings of size 100k from huge texts) and are much slower with really small strings, which is the common case, i.e. the case that should be optimized.


If you think important then you are free to propose a PEP, show an implementation and the resultant changes in speed/memory usage of your proposal. If it is really worth the effort it may be included in a future version of python.

like image 102
Bakuriu Avatar answered Nov 14 '22 21:11

Bakuriu


That's how slices work. Slices always perform a shallow copy, allowing you to do things like

>>> x = [1,2,3]
>>> y = x[:]

Now it would be possible to make an exception for strings, but is it really worth it? Eric Lippert blogged about his decision not to do that for .NET; I guess his argument is valid for Python as well.

See also this question.

like image 31
Tim Pietzcker Avatar answered Nov 14 '22 20:11

Tim Pietzcker