Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python difference between mutating and re-assigning a list ( _list = and _list[:] = )

So I frequently write code following a pattern like this:

_list = list(range(10)) # Or whatever
_list = [some_function(x) for x in _list]
_list = [some_other_function(x) for x in _list]

etc

I saw now on a different question a comment that explained how this approach creates a new list each time and it is better to mutate the existing list, like so:

_list[:] = [some_function(x) for x in _list]

It's the first time I've seen this explicit recommendation and I'm wondering what the implications are:

1) Does the mutation save memory? Presumably the references to the "old" list would drop to zero after a re-assignment and the "old" list is disregarded, but is there a delay before that happens where I'm potentially using more memory than I need to when I use re-assignment instead of mutating the list?

2) Is there a computational cost to using mutation? I suspect changing something inplace is more expensive than creating a new list and just dropping the old one?

In terms of safety, I wrote a script to test this:

def some_function(number: int):
    return number*10

def main():
    _list1 = list(range(10))
    _list2 = list(range(10))

    a = _list1
    b = _list2 

    _list1 = [some_function(x) for x in _list1]
    _list2[:] = [some_function(x) for x in _list2]

    print(f"list a: {a}")
    print(f"list b: {b}")


if __name__=="__main__":
    main()

Which outputs:

list a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
list b: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

So mutation does seem to have the drawback of being more likely to cause side effects. Although these might be desirable. Are there any PEPs that discuss this safety aspect, or other best practice guides?

Thank you.

EDIT: Conflicting Answers: So more tests on memory So I have received two conflicting answers so far. In the comments, jasonharper has written that the right hand side of an equation does not know about the left hand side, and therefore memory usage cannot possibly be affected by what appears on the left. However, in the answers, Masoud has written that "when [reassignment] is used, two new and old _lists with two different identities and values are created. Afterward, old _list is garbage collected. But when a container is mutated, every single value is retrieved, changed in CPU and updated one-by-one. So the list is not duplicated." This seems to indicate that there is a big memory cost to doing reassignment.

I decided to try using memory-profiler to dig deeper. Here is the test script:

from memory_profiler import profile


def normalise_number(number: int):
    return number%1000


def change_to_string(number: int):
    return "Number as a string: " + str(number) + "something" * number


def average_word_length(string: str):
    return len(string)/len(string.split())


@profile(precision=8)
def mutate_list(_list):
    _list[:] = [normalise_number(x) for x in _list]
    _list[:] = [change_to_string(x) for x in _list]
    _list[:] = [average_word_length(x) for x in _list]


@profile(precision=8)
def replace_list(_list):
    _list = [normalise_number(x) for x in _list]
    _list = [change_to_string(x) for x in _list]
    _list = [average_word_length(x) for x in _list]
    return _list


def main():
    _list1 = list(range(1000))
    mutate_list(_list1)

    _list2 = list(range(1000))
    _list2 = replace_list(_list2)

if __name__ == "__main__":
    main()

Please note that I am aware that, eg, this the find average word length function isn't particularly well written. Just for testing sake.

Here are the results:

Line #    Mem usage    Increment   Line Contents
================================================
    16  32.17968750 MiB  32.17968750 MiB   @profile(precision=8)
    17                             def mutate_list(_list):
    18  32.17968750 MiB   0.00000000 MiB       _list[:] = [normalise_number(x) for x in _list]
    19  39.01953125 MiB   0.25781250 MiB       _list[:] = [change_to_string(x) for x in _list]
    20  39.01953125 MiB   0.00000000 MiB       _list[:] = [average_word_length(x) for x in _list]


Filename: temp2.py

Line #    Mem usage    Increment   Line Contents
================================================
    23  32.42187500 MiB  32.42187500 MiB   @profile(precision=8)
    24                             def replace_list(_list):
    25  32.42187500 MiB   0.00000000 MiB       _list = [normalise_number(x) for x in _list]
    26  39.11328125 MiB   0.25781250 MiB       _list = [change_to_string(x) for x in _list]
    27  39.11328125 MiB   0.00000000 MiB       _list = [average_word_length(x) for x in _list]
    28  32.46484375 MiB   0.00000000 MiB       return _list

What I found is that even if I increase the list size to like 100000, reassignment consistently uses more memory, but, like, only maybe 1% more. This makes me think that the additional memory cost is probably just an extra pointer somewhere, not the cost of an entire list.

To further test the hypothesis, I performed time based profiling at intervals of 0.00001 seconds and graphed the results. I wanted to see whether there was perhaps a momentary spike in memory usage that dissappeared instantly due to garbage collection (reference counting). But alas, I have not found such a spike.

Can anyone explain these results? What exactly is happening under the hood here that causes this slight but consistent increase in memory usage?

like image 872
Neil Avatar asked May 25 '19 20:05

Neil


People also ask

What is the difference between mutating a value and rebinding a variable?

"Mutate" and "bind"/"rebind" are two mutually exclusive operations. Mutating changes an object, whereas binding changes a name.

What is the diff between list () and [:] in Python?

7 Answers. Show activity on this post. When reading, list is a reference to the original list, and list[:] shallow-copies the list.

What does mutating a list mean in Python?

Mutating methods are ones that change the object after the method has been used. Non-mutating methods do not change the object after the method has been used. The count and index methods are both non-mutating. Count returns the number of occurrences of the argument given but does not change the original string or list.

What is mutating a list?

Altering the values of a list.


2 Answers

It's hard to answer this canonically because the actual details are implementation-dependent or even type-dependent.

For example in CPython when an object reaches reference-count zero then it's disposed and the memory is freed immediately. However some types have an additional "pool" that references instances without you knowing it. For example CPython has a "pool" of unused list instances. When the last reference of a list is dropped in Python code it may be added to this "free list" instead of releasing the memory (one would need to invoke something PyList_ClearFreeList to reclaim that memory).

But a list is not just the memory that is needed for the list, a list contains objects. Even when the memory of the list is reclaimed the objects that were in the list could remain, for example there is still a reference to that object somewhere else, or that type itself has also a "free list".

If you look at other implementations like PyPy then even in the absence of a "pool" an object isn't disposed of immediately when no-one references it anymore, it's only disposed of "eventually".

So how does this relate to your examples you may wonder.

Let's have a look at your examples:

_list = [some_function(x) for x in _list]

Before this line runs there is one list instance assigned to the variable _list. Then you create a new list using the list-comprehension and assign it to the name _list. Shortly before this assign there are two lists in memory. The old list and the list created by the comprehension. After the assignment there is one list referenced by the name _list (the new list) and one list with a reference count that has been decremented by 1. In case the old list isn't referenced anywhere else and thus reached a reference count of 0, it may be returned to the pool, it may be disposed or it may be disposed eventually. Same for the contents of the old list.

What about the other example:

_list[:] = [some_function(x) for x in _list]

Before this line runs there is again one list assigned to the name _list. When the line executes it also creates a new list through the list comprehension. But instead of assigning the new list to the name _list it's going to replace the contents of the old list with those of the new list. However while it's clearing the old list it will have two lists that are kept in memory. After this assignment the old list is still available through the name _list but the list created by the list-comprehension isn't referenced anymore, it reaches a reference count of 0 and what happens to it depends. It can be put in the "pool" of free lists, it could be disposed immediately, it could also be disposed at some unknown point in the future. Same for the original contents of the old list which were cleared.

So where is the difference:

Actually there is not a lot of difference. In both cases Python has to keep two lists completely in memory. However the first approach will release the reference to the old list faster than the second approach will release the reference to the intermediate list in memory, simply because it has to be kept alive while the contents are copied.

However releasing the reference faster will not guarantee that it actually results in "less memory" since it might be returned to the pool or the implementation only frees memory at some (unknown) point in the future.

A less memory expensive alternative

Instead of creating and discarding lists you could chain iterators/generators and consume them when you need to iterate them (or you need the actual list).

So instead of doing:

_list = list(range(10)) # Or whatever
_list = [some_function(x) for x in _list]
_list = [some_other_function(x) for x in _list]

You could do:

def generate_values(it):
    for x in it:
        x = some_function(x)
        x = some_other_function(x)
        yield x

And then simply consume that:

for item in generate_values(range(10)):
    print(item)

Or consume it with a list:

list(generate_values(range(10)))

These will not (except when you pass it to list) create any lists at all. A generator is a state-machine that processes the elements one at a time when requested.

like image 118
MSeifert Avatar answered Oct 31 '22 10:10

MSeifert


According to CPython documentation :

Some objects contain references to other objects; these are called containers. Examples of containers are tuples, lists and dictionaries. The references are part of a container’s value. In most cases, when we talk about the value of a container, we imply the values, not the identities of the contained objects; however, when we talk about the mutability of a container, only the identities of the immediately contained objects are implied.

So when a list is mutated, the references contained in the list are mutated, while the identity of the object is unchanged. Interestingly, while mutable objects with identical values are not allowed to have the same identity, identical immutable objects can have similar identity (because they are immutable!).

a = [1, 'hello world!']
b = [1, 'hello world!']
print([hex(id(_)) for _ in a])
print([hex(id(_)) for _ in b])
print(a is b)

#on my machine, I got:
#['0x55e210833380', '0x7faa5a3c0c70']
#['0x55e210833380', '0x7faa5a3c0c70']
#False

when code:

_list = [some_function(x) for x in _list]

is used, two new and old _lists with two different identities and values are created. Afterward, old _list is garbage collected. But when a container is mutated, every single value is retrieved, changed in CPU and updated one-by-one. So the list is not duplicated.

Regarding processing efficiency, its easily comparable:

import time

my_list = [_ for _ in range(1000000)]

start = time.time()
my_list[:] = [_ for _ in my_list]
print(time.time()-start)  # on my machine 0.0968618392944336 s


start = time.time()
my_list = [_ for _ in my_list]
print(time.time()-start)  # on my machine 0.05194497108459473 s

update: A list can be considered to be made of two parts: references to (id of) other objects and references value. I used a code to demonstrate the percentage of memory that a list object directly occupies to total memory consumed (list object + referred objects):

import sys
my_list = [str(_) for _ in range(10000)]

values_mem = 0
for item in my_list:
    values_mem+= sys.getsizeof(item)

list_mem = sys.getsizeof(my_list)

list_to_total = 100 * list_mem/(list_mem+values_mem)
print(list_to_total) #result ~ 14%
like image 36
Masoud Avatar answered Oct 31 '22 12:10

Masoud