Python Strings are immutable so why does s.split( ) return a list of new strings

Tags:

By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.

Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.

403

asked May 08 '17 06:05

san

2 Answers

Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.

The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.

I want to know about the top 10 most common user agents, so I keep this around in a list.

Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?

Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str) keeping a substring of a field prevents GC for object.

Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.

answered Sep 18 '22 17:09

Dietrich Epp

In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.

A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.

answered Sep 21 '22 17:09

Kevin

Related questions
                            
                                Python - Create Counter() from mapping, non-integer values
                            
                                Django-filter with DRF - How to do 'and' when applying multiple values with the same lookup?
                            
                                Modify OHLC resample code as per deprecated warning
                            
                                using matplotlib colormap with pandas dataframe.plot function
                            
                                Sqlalchemy - add columns to a query
                            
                                What does scipy.signal.convolve2d calculate? [duplicate]
                            
                                python garbage collection about list append itself [duplicate]
                            
                                Python namedtuple as argument to apply_async(..) callback
                            
                                timezone aware datetime objects in django templates
                            
                                What means the serialize=False on Primary-key field?
                            
                                how to get parse tree using python nltk?
                            
                                Immutability in Python [duplicate]
                            
                                Which one is more secure to use? uuid, binascii.hexlify(os.urandom()) or random.SystemRandom()?
                            
                                Is overloading broken in cppclass Cython/C++ definitions?
                            
                                Reset default matplotlib colormap values after using 'set_under' or 'set_over'
                            
                                Virtualenv activate script won't run in bash script with set -euo
                            
                                Wordcloud Python with generate_from_frequencies
                            
                                Python to mysql 'Timestamp' object has no attribute 'translate'
                            
                                Logistic Regression: How to find top three feature that have highest weights?
                            
                                Python pandas load csv ANSI Format as UTF-8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Strings are immutable so why does s.split( ) return a list of new strings

Tags:

python

string

split

python-internals

san

People also ask

2 Answers

Dietrich Epp

Kevin

Recent Activity

Donate For Us