When I try to update a set while iterating over its elements, what should be its behavior? I tried it over various scenarios and it does not iterate over elements added after iteration is started and also the elements removed during iteration. If I remove and put back any element during iteration, that element is being considered. What's the exact behavior and how does it work? This prints all the permutations of a string: <pre class="prettyprint"><code>def permutations(s): ans = [] def helper(created, remaining): if len(created) == len(s): ans.append(''.join(created)) return for ch in remaining: remaining.remove(ch) created.append(ch) helper(created, remaining) remaining.add(ch) created.pop() helper([], set(s)) return ans </code></pre> Here the behavior is unpredictable, sometimes <code>e</code> is printed while sometimes it's not: <pre class="prettyprint"><code>ab = set(['b','c','d']) x = True for ch in ab: if x: ab.remove('c') ab.add('e') x = False print(ch) </code></pre> Here I always see <code>'c'</code> only once. Even when first character is <code>'c'</code>: <pre class="prettyprint"><code>ab = set(['b','c','d']) x = True for ch in ab: if x: ab.remove('c') ab.add('c') x = False print(ch) </code></pre> And an alternate way to achieve the same objective of the above function: <pre class="prettyprint"><code>def permwdups(s): ans = [] def helper(created, remaining): if len(created) == len(s): ans.append(''.join(created)) return for ch in remaining: if (remaining[ch]<=0): continue remaining[ch] -=1 created.append(ch) helper(created, remaining) remaining[ch] +=1 created.pop() counts = {} for i in range(len(s)): if s[i] not in counts: counts[s[i]] = 1 else: counts[s[i]]+= 1 helper([], counts) print(len(set(ans))) return ans </code></pre>

It's actually very easy, <code>set</code>s are implemented in CPython as <code>hash</code> - <code>item</code> table: <pre class="prettyprint"><code> hash | item ----------------- - | - ----------------- - | - ... </code></pre> CPython uses open-addressing so not all rows in the table are filled and it determines the position of an element based on the (truncated) hash of the item with a "pseudo-randomized" position determination in case of collisions. I will ignore truncated-hash-collisions in this answer. I'll also ignore the specifics of the hash-truncation and just work with integers, which all (with some exceptions) map their hash to the actual value: <pre class="prettyprint"><code>>>> hash(1) 1 >>> hash(2) 2 >>> hash(20) 20 </code></pre> So when you create a <code>set</code> with the values 1, 2 and 3 you'll have (roughly) the following table: <pre class="prettyprint"><code> hash | item ----------------- - | - ----------------- 1 | 1 ----------------- 2 | 2 ----------------- 3 | 3 ... </code></pre> The set is iterated from the top of the table to the end of the table and empty "rows" are ignored. So if you would iterate over that set without modifying it you would get the numbers 1, 2 and 3: <pre class="prettyprint"><code>>>> for item in {1, 2, 3}: print(item) 1 2 3 </code></pre> Basically the iterator starts in row 0 and sees that the row is empty and goes to row 1 which contains the item <code>1</code>. This item is returned by the iterator. The next iteration it's in row 2 and returns the value there, which is <code>2</code>, then it goes to row 3 and returns the <code>3</code> that is stored there. The following iteration the iterator is in row 4 which is empty, so it goes to row 5 which is also empty, then to row 6, .... until it reaches the end of the table and throws a <code>StopIteration</code> exception, which signals that the iterator finished. <img src="https://i.stack.imgur.com/z5kxG.gif" alt="enter image description here"> However if you would change the set while iterating over it the current position (row) where the iterator is is remembered. That means if you add an item in a previous row the iterator won't return it, if it's added in a later row it will be returned (at least if it's not removed again before the iterator is there). Assume, you always remove the current item and add an integer which is <code>item + 1</code> to the set. Something like this: <pre class="prettyprint"><code>s = {1} for item in s: print(item) s.discard(item) s.add(item+1) </code></pre> The set before the iteration looks like this: <pre class="prettyprint"><code> hash | item ----------------- - | - ----------------- 1 | 1 ----------------- - | - ... </code></pre> The iterator will start in row 0 find it is empty and goes to row 1 which contains the value <code>1</code> which is then returned and printed. If the arrow would indicate the position of the iterator it would look like this in the first iteration: <pre class="prettyprint"><code> hash | item ----------------- - | - ----------------- 1 | 1 <---------- ----------------- - | - </code></pre> Then the <code>1</code> is removed and the 2 is added: <pre class="prettyprint"><code> hash | item ----------------- - | - ----------------- - | - <---------- ----------------- 2 | 2 </code></pre> So in the next iteration the iterator finds the value <code>2</code> and returns it. Then the two is added and a 3 is added: <pre class="prettyprint"><code> hash | item ----------------- - | - ----------------- - | - ----------------- - | - <---------- ----------------- 3 | 3 </code></pre> And so on. Until it reaches <code>7</code>. At that point something interesting happens: The truncated hash of <code>8</code> means that the <code>8</code> will be put in row 0, however row 0 has already been iterated over so it will stop with <code>7</code>. The actual value might depend on Python version and the add/removal history of the set, for example just changing the order of the <code>set.add</code> and <code>set.discard</code> operations will give a different result (goes up to 15 because the set is resized). <img src="https://i.stack.imgur.com/rvoI0.gif" alt="enter image description here"> For the same reason the iterator would only display <code>1</code> if one would add the <code>item - 1</code> in each iteration, because <code>0</code> would be (because of hash 0) to the first row: <pre class="prettyprint"><code>s = {1} for item in s: print(item) s.discard(item) s.add(item-1) hash | item ----------------- - | - ----------------- 1 | 1 <---------- ----------------- - | - hash | item ----------------- 0 | 0 ----------------- - | - ----------------- - | - <---------- </code></pre> Visualized with a simple GIF: <img src="https://i.stack.imgur.com/lj1yo.gif" alt="enter image description here"> Note that these examples are very simplistic, if the <code>set</code> resizes during the iteration it will re-distribute the stored items based on the "new" truncated hash and it will also remove dummies that are left behind when you remove an item from a set. In that case you still can (roughly) predict what will happen but it will become a lot more convoluted. One additional but very important fact is that Python (since Python 3.4) randomizes the hash value of strings for each interpreter. That means that each Python session will produce different hash-values for strings. So if you add/remove strings while iterating the behavior will also be random. Suppose you have this script: <pre class="prettyprint"><code>s = {'a'} for item in s: print(item) s.discard(item) s.add(item*2) </code></pre> and you run it multiple times from the command line the result will be different. For example my first run: <pre class="prettyprint"><code>'a' 'aa' </code></pre> My second / third / fourth run: <pre class="prettyprint"><code>'a' </code></pre> My fifth run: <pre class="prettyprint"><code>'a' 'aa' </code></pre> That's because scripts from the command line always start a new interpreter session. If you run the script multiple times in the same session the results won't differ. For example: <pre class="prettyprint"><code>>>> def fun(): ... s = {'a'} ... for item in s: ... print(item) ... s.discard(item) ... s.add(item*2) >>> for i in range(10): ... fun() </code></pre> produces: <pre class="prettyprint"><code>a aa a aa a aa a aa a aa a aa a aa a aa a aa a aa </code></pre> But it could also have given 10 times <code>a</code> or 10 times <code>a</code>, <code>aa</code> and <code>aaaa</code>, ... <hr> To summarize it: <ul> <li>A value added to a set during iteration will be displayed if the item is put in a position that hasn't been iterated over. The position depends on the truncated hash of the item and the collision strategy. </li> <li>The truncation of the hash depends on the size of the set and that size depends on the add/removal history of the set.</li> <li>With strings one cannot predict the position because their hash values are randomized on a per-session-basis in recent Python versions.</li> <li>And most importantly: Avoid modifying the set / list / dict / ... while iterating over it. It almost always leads to problems and even if it doesn't it will confuse anyone reading it! Although there are a few, very rare cases where adding elements to a list while iterating over it makes sense. That would require very specific comments alongside it, otherwise it would look like a mistake! Especially with sets and dicts you will rely on implementation details that might change at any point!</li> </ul> <hr> Just in case you're curious I inspected the internals of the set using (somewhat fragile and probably only works on Python 3.6 and definitely not usable in production-code) Cython introspection in a Jupyter Notebook: <pre class="prettyprint"><code>%load_ext Cython %%cython from cpython cimport PyObject, PyTypeObject cimport cython cdef extern from "Python.h": ctypedef Py_ssize_t Py_hash_t struct setentry: PyObject *key Py_hash_t hash ctypedef struct PySetObject: Py_ssize_t ob_refcnt PyTypeObject *ob_type Py_ssize_t fill Py_ssize_t used Py_ssize_t mask setentry *table Py_hash_t hash Py_ssize_t finger setentry smalltable[8] PyObject *weakreflist cpdef print_set_table(set inp): cdef PySetObject* innerset = <PySetObject *>inp for idx in range(innerset.mask+1): if (innerset.table[idx].key == NULL): print(idx, '<EMPTY>') else: print(idx, innerset.table[idx].hash, <object>innerset.table[idx].key) </code></pre> Which prints the key-value table inside the set: <pre class="prettyprint"><code>a = {1} print_set_table(a) for idx, item in enumerate(a): print('\nidx', idx) a.discard(item) a.add(item+1) print_set_table(a) </code></pre> Note that the output will contain dummys (left-overs from deleted set-items) and they will sometimes disappear (when the set get's too full or resized).

Updating a set while iterating over its elements

Tags:

python

iteration

python-internals

set

When I try to update a set while iterating over its elements, what should be its behavior?

I tried it over various scenarios and it does not iterate over elements added after iteration is started and also the elements removed during iteration. If I remove and put back any element during iteration, that element is being considered. What's the exact behavior and how does it work?

This prints all the permutations of a string:

def permutations(s):
    ans = []
    def helper(created, remaining):
        if len(created) == len(s):
            ans.append(''.join(created))
            return
        for ch in remaining:
            remaining.remove(ch)
            created.append(ch)
            helper(created, remaining)
            remaining.add(ch)
            created.pop()
    helper([], set(s))
    return ans

Here the behavior is unpredictable, sometimes e is printed while sometimes it's not:

ab = set(['b','c','d'])
x = True
for ch in ab:
    if x:
        ab.remove('c')
        ab.add('e')
        x = False
    print(ch)

Here I always see 'c' only once. Even when first character is 'c':

ab = set(['b','c','d'])
x = True
for ch in ab:
    if x:
        ab.remove('c')
        ab.add('c')
        x = False
    print(ch)

And an alternate way to achieve the same objective of the above function:

def permwdups(s):
    ans = []
    def helper(created, remaining):
        if len(created) == len(s):
            ans.append(''.join(created))
            return
        for ch in remaining:
            if (remaining[ch]<=0):
                continue
            remaining[ch] -=1
            created.append(ch)
            helper(created, remaining)
            remaining[ch] +=1
            created.pop()
    counts = {}
    for i in range(len(s)):
        if s[i] not in counts:
            counts[s[i]] = 1
        else:
            counts[s[i]]+= 1
    helper([], counts)
    print(len(set(ans)))
    return ans

639

asked Dec 29 '17 06:12

user3285099

1 Answers

It's actually very easy, sets are implemented in CPython as hash - item table:

  hash  |  item  
-----------------
    -   |    -
-----------------
    -   |    -
       ...

CPython uses open-addressing so not all rows in the table are filled and it determines the position of an element based on the (truncated) hash of the item with a "pseudo-randomized" position determination in case of collisions. I will ignore truncated-hash-collisions in this answer.

I'll also ignore the specifics of the hash-truncation and just work with integers, which all (with some exceptions) map their hash to the actual value:

>>> hash(1)
1
>>> hash(2)
2
>>> hash(20)
20

So when you create a set with the values 1, 2 and 3 you'll have (roughly) the following table:

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1
-----------------
    2   |    2
-----------------
    3   |    3
       ...

The set is iterated from the top of the table to the end of the table and empty "rows" are ignored. So if you would iterate over that set without modifying it you would get the numbers 1, 2 and 3:

>>> for item in {1, 2, 3}: print(item)
1
2
3

Basically the iterator starts in row 0 and sees that the row is empty and goes to row 1 which contains the item 1. This item is returned by the iterator. The next iteration it's in row 2 and returns the value there, which is 2, then it goes to row 3 and returns the 3 that is stored there. The following iteration the iterator is in row 4 which is empty, so it goes to row 5 which is also empty, then to row 6, .... until it reaches the end of the table and throws a StopIteration exception, which signals that the iterator finished.

enter image description here

However if you would change the set while iterating over it the current position (row) where the iterator is is remembered. That means if you add an item in a previous row the iterator won't return it, if it's added in a later row it will be returned (at least if it's not removed again before the iterator is there).

Assume, you always remove the current item and add an integer which is item + 1 to the set. Something like this:

s = {1}
for item in s: 
    print(item)
    s.discard(item)
    s.add(item+1)

The set before the iteration looks like this:

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1
-----------------
    -   |    -
       ...

The iterator will start in row 0 find it is empty and goes to row 1 which contains the value 1 which is then returned and printed. If the arrow would indicate the position of the iterator it would look like this in the first iteration:

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1      <----------
-----------------
    -   |    -

Then the 1 is removed and the 2 is added:

  hash  |  item  
-----------------
    -   |    -
-----------------
    -   |    -      <----------
-----------------
    2   |    2

So in the next iteration the iterator finds the value 2 and returns it. Then the two is added and a 3 is added:

  hash  |  item  
-----------------
    -   |    -
-----------------
    -   |    -
-----------------
    -   |    -      <----------
-----------------
    3   |    3

And so on.

Until it reaches 7. At that point something interesting happens: The truncated hash of 8 means that the 8 will be put in row 0, however row 0 has already been iterated over so it will stop with 7. The actual value might depend on Python version and the add/removal history of the set, for example just changing the order of the set.add and set.discard operations will give a different result (goes up to 15 because the set is resized).

enter image description here

For the same reason the iterator would only display 1 if one would add the item - 1 in each iteration, because 0 would be (because of hash 0) to the first row:

s = {1}
for item in s: 
    print(item)
    s.discard(item)
    s.add(item-1)

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1      <----------
-----------------
    -   |    -

  hash  |  item  
-----------------
    0   |    0
-----------------
    -   |    -
-----------------
    -   |    -      <----------

Visualized with a simple GIF:

enter image description here

Note that these examples are very simplistic, if the set resizes during the iteration it will re-distribute the stored items based on the "new" truncated hash and it will also remove dummies that are left behind when you remove an item from a set. In that case you still can (roughly) predict what will happen but it will become a lot more convoluted.

One additional but very important fact is that Python (since Python 3.4) randomizes the hash value of strings for each interpreter. That means that each Python session will produce different hash-values for strings. So if you add/remove strings while iterating the behavior will also be random.

Suppose you have this script:

s = {'a'}
for item in s: 
    print(item)
    s.discard(item)
    s.add(item*2)

and you run it multiple times from the command line the result will be different.

For example my first run:

'a'
'aa'

My second / third / fourth run:

'a'

My fifth run:

'a'
'aa'

That's because scripts from the command line always start a new interpreter session. If you run the script multiple times in the same session the results won't differ. For example:

>>> def fun():
...     s = {'a'}
...     for item in s: 
...         print(item)
...         s.discard(item)
...         s.add(item*2)

>>> for i in range(10):
...     fun()

produces:

a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa

But it could also have given 10 times a or 10 times a, aa and aaaa, ...

To summarize it:

A value added to a set during iteration will be displayed if the item is put in a position that hasn't been iterated over. The position depends on the truncated hash of the item and the collision strategy.
The truncation of the hash depends on the size of the set and that size depends on the add/removal history of the set.
With strings one cannot predict the position because their hash values are randomized on a per-session-basis in recent Python versions.
And most importantly: Avoid modifying the set / list / dict / ... while iterating over it. It almost always leads to problems and even if it doesn't it will confuse anyone reading it! Although there are a few, very rare cases where adding elements to a list while iterating over it makes sense. That would require very specific comments alongside it, otherwise it would look like a mistake! Especially with sets and dicts you will rely on implementation details that might change at any point!

Just in case you're curious I inspected the internals of the set using (somewhat fragile and probably only works on Python 3.6 and definitely not usable in production-code) Cython introspection in a Jupyter Notebook:

%load_ext Cython

%%cython

from cpython cimport PyObject, PyTypeObject
cimport cython

cdef extern from "Python.h":
    ctypedef Py_ssize_t Py_hash_t

    struct setentry:
        PyObject *key
        Py_hash_t hash

    ctypedef struct PySetObject:
        Py_ssize_t ob_refcnt
        PyTypeObject *ob_type
        Py_ssize_t fill
        Py_ssize_t used
        Py_ssize_t mask
        setentry *table
        Py_hash_t hash
        Py_ssize_t finger

        setentry smalltable[8]
        PyObject *weakreflist

cpdef print_set_table(set inp):
    cdef PySetObject* innerset = <PySetObject *>inp
    for idx in range(innerset.mask+1):
        if (innerset.table[idx].key == NULL):
            print(idx, '<EMPTY>')
        else:
            print(idx, innerset.table[idx].hash, <object>innerset.table[idx].key)

Which prints the key-value table inside the set:

a = {1}
print_set_table(a)

for idx, item in enumerate(a):
    print('\nidx', idx)
    a.discard(item)
    a.add(item+1)
    print_set_table(a)

Note that the output will contain dummys (left-overs from deleted set-items) and they will sometimes disappear (when the set get's too full or resized).

answered Sep 17 '22 13:09

MSeifert

Related questions
                            
                                Password form in PyQt
                            
                                Python how to alias module name (rename with preserving backward compatibility)
                            
                                Python PIL 0.5 opacity, transparency, alpha
                            
                                Can I use multiprocessing.Pool in a method of a class?
                            
                                Convert test client data to JSON
                            
                                Simple prediction using linear regression with python
                            
                                Best way to manage docker containers with supervisord
                            
                                return value from one python script to another
                            
                                efficient way of reading integers from file
                            
                                How to run a Django celery task every 6am and 6pm daily?
                            
                                Convert every character in a String to a Dictionary Key
                            
                                Python, assign function to variable, change optional argument's value
                            
                                How do I delete rows not starting with 'x' in Pandas or keep rows starting with 'x'
                            
                                How to get one record with SQLAlchemy?
                            
                                Sklearn pass fit() parameters to xgboost in pipeline
                            
                                Python Plotly format axis numbers as %
                            
                                Tweepy Truncated Status
                            
                                Pandas df.resample with column-specific aggregation function
                            
                                Using aria-label to locate and click an element with Python3 and Selenium
                            
                                Dynamic choices WTForms Flask SelectField

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Updating a set while iterating over its elements

Tags:

python

iteration

python-internals

set

user3285099

People also ask

1 Answers

MSeifert

Recent Activity

Donate For Us