When I convert a Python 3.8.0 list to a set, the resulting set ordering* is highly structured in a non-trivial way. How is this structure being extracted from the pseudo-random list? <hr> As part of an experiment I am running, I am generating a random set. I was surprised to see that plotting the set suddenly showed unexpected linear structure in the set. So there are two things puzzling me - why does converting to a set result have an ordering* which ends up highlighting this structure; and, to a lesser extent why does the pseudo-random set have this "hidden" structure at all? The code: <pre class="prettyprint lang-py prettyprint-override"><code>X = [randrange(250) for i in range(30)] print(X) print(set(X)) </code></pre> which outputs, for example <pre class="prettyprint lang-py prettyprint-override"><code>[238, 202, 245, 94, 111, 106, 148, 164, 154, 113, 128, 10, 196, 141, 69, 38, 106, 8, 40, 53, 160, 87, 85, 13, 38, 147, 204, 50, 162, 91] {128, 8, 10, 141, 13, 147, 148, 154, 160, 162, 164, 38, 40, 50, 53, 196, 69, 202, 204, 85, 87, 91, 94, 106, 238, 111, 113, 245} </code></pre> A plot** of the above list looks fairly random, as expected: <img src="https://i.stack.imgur.com/Vo2fW.png" alt="WolframAlpha plot of randomly generated list"> whereas plotting the set (as it is ordered in the output) exhibits the structure present in the set: <img src="https://i.stack.imgur.com/ZyLxQ.png" alt="WolframAlpha plot of set from random list"> This behaviour 100% consistent on my machine (more examples below) with the values 250 and 30 used in the above code (the example I used is not cherry picked - it is just the last one I ran). Tuning these values sometimes results in slightly different structure (e.g. a subset of three arithmetic progressions*** instead of two). Is this reproducible on other people's machines? Of course, that such structure exists seems indicative of a not-so-great pseudo-random number generation, but this does not explain how converting to a set would in some sense 'extract' this structure. As far as I am aware, the there is no formal guarantee that the ordering of a set (when converted from a list) is deterministic (and even if it is, there is no sophisticated ordering being done in the background). So how is this happening?! <hr> (*): I know, sets are unordered collections, but I mean "ordered" in the sense that, when calling the <code>print</code> statement, the set is output in some order which consistently highlights the underlying set structure. (**): These plots are from Wolfram Alpha. Two more examples are below: <img src="https://i.stack.imgur.com/swPSL.png" alt="enter image description here"> (***): Two plots when changing the range of the random numbers from 250 to 500: <img src="https://i.stack.imgur.com/b0bgM.png" alt="enter image description here">

Basically, this is because of two things: <ul> <li>A set in Python is implemented using a hashtable,</li> <li>The hash of an integer is the integer itself.</li> </ul> Therefore, the index that an integer appears in the underlying array will be determined by the integer's value, modulo the length of the underlying array. So, integers will tend to stay in ascending order when you put a contiguous range of them into a set: <pre class="prettyprint lang-py prettyprint-override"><code>>>> list(set(range(10000))) == list(range(10000)) True # this can't be an accident! </code></pre> If you don't have all of the numbers from a contiguous range, then the "modulo the length of the underlying array" part comes into play: <pre class="prettyprint lang-py prettyprint-override"><code>>>> r = range(0, 50, 4) >>> set(r) {0, 32, 4, 36, 8, 40, 12, 44, 16, 48, 20, 24, 28} >>> sorted(r, key=lambda x: x % 32) [0, 32, 4, 36, 8, 40, 12, 44, 16, 48, 20, 24, 28] </code></pre> The sequence is predictable if you know the length of the underlying array, and the (deterministic) algorithm for adding elements. In this case the array's length is 32, because it's initially 8 and is quadrupled while elements are added. Except for a blip near the end (because the numbers 52 and 56 aren't in the set), the range is divided into two sequences <code>0, 4, 8, ...</code> and <code>32, 36, 40, ...</code> which alternate because the hashes, which are the numbers' values themselves, are taken modulo 32 to choose indices in the array. There are collisions; for example, 4 and 36 are equal modulo 32, but 4 was added to the set first so 36 ends up at a different index. Here's a chart for this sequence. The structure in your charts is just a noisier version, because you generated your numbers randomly rather than from a range with a step. <img src="https://i.stack.imgur.com/fQHll.png" alt="enter image description here"> The number of interleaved sequences will depend on the size of the set in proportion to the length of the range the numbers are sampled from, since that determines how many times the range's length "wraps around" modulo the length of the hashtable's underlying array. Here's an example with three interleaved sequences <code>0, 6, 12, ...</code>, <code>66, 72, 78, ...</code> and <code>36, 42, 48, ...</code>: <pre class="prettyprint lang-py prettyprint-override"><code>>>> set(range(0, 90, 6)) {0, 66, 36, 6, 72, 42, 12, 78, 48, 18, 84, 54, 24, 60, 30} </code></pre>

'Bizarre' ordering of sets in python

Tags:

python

When I convert a Python 3.8.0 list to a set, the resulting set ordering* is highly structured in a non-trivial way. How is this structure being extracted from the pseudo-random list?

As part of an experiment I am running, I am generating a random set. I was surprised to see that plotting the set suddenly showed unexpected linear structure in the set. So there are two things puzzling me - why does converting to a set result have an ordering* which ends up highlighting this structure; and, to a lesser extent why does the pseudo-random set have this "hidden" structure at all?

The code:

X = [randrange(250) for i in range(30)]
print(X)
print(set(X))

which outputs, for example

[238, 202, 245, 94, 111, 106, 148, 164, 154, 113, 128, 10, 196, 141, 69, 38, 106, 8, 40, 53, 160, 87, 85, 13, 38, 147, 204, 50, 162, 91]

{128, 8, 10, 141, 13, 147, 148, 154, 160, 162, 164, 38, 40, 50, 53, 196, 69, 202, 204, 85, 87, 91, 94, 106, 238, 111, 113, 245}

A plot** of the above list looks fairly random, as expected:

WolframAlpha plot of randomly generated list

whereas plotting the set (as it is ordered in the output) exhibits the structure present in the set:

WolframAlpha plot of set from random list

This behaviour 100% consistent on my machine (more examples below) with the values 250 and 30 used in the above code (the example I used is not cherry picked - it is just the last one I ran). Tuning these values sometimes results in slightly different structure (e.g. a subset of three arithmetic progressions*** instead of two).

Is this reproducible on other people's machines? Of course, that such structure exists seems indicative of a not-so-great pseudo-random number generation, but this does not explain how converting to a set would in some sense 'extract' this structure. As far as I am aware, the there is no formal guarantee that the ordering of a set (when converted from a list) is deterministic (and even if it is, there is no sophisticated ordering being done in the background). So how is this happening?!

(*): I know, sets are unordered collections, but I mean "ordered" in the sense that, when calling the print statement, the set is output in some order which consistently highlights the underlying set structure.

(**): These plots are from Wolfram Alpha. Two more examples are below:

enter image description here

(***): Two plots when changing the range of the random numbers from 250 to 500:

enter image description here

363

asked Feb 21 '20 00:02

John Don

1 Answers

Basically, this is because of two things:

A set in Python is implemented using a hashtable,
The hash of an integer is the integer itself.

Therefore, the index that an integer appears in the underlying array will be determined by the integer's value, modulo the length of the underlying array. So, integers will tend to stay in ascending order when you put a contiguous range of them into a set:

>>> list(set(range(10000))) == list(range(10000))
True # this can't be an accident!

If you don't have all of the numbers from a contiguous range, then the "modulo the length of the underlying array" part comes into play:

>>> r = range(0, 50, 4)
>>> set(r)
{0, 32, 4, 36, 8, 40, 12, 44, 16, 48, 20, 24, 28}
>>> sorted(r, key=lambda x: x % 32)
[0, 32, 4, 36, 8, 40, 12, 44, 16, 48, 20, 24, 28]

The sequence is predictable if you know the length of the underlying array, and the (deterministic) algorithm for adding elements. In this case the array's length is 32, because it's initially 8 and is quadrupled while elements are added.

Except for a blip near the end (because the numbers 52 and 56 aren't in the set), the range is divided into two sequences 0, 4, 8, ... and 32, 36, 40, ... which alternate because the hashes, which are the numbers' values themselves, are taken modulo 32 to choose indices in the array. There are collisions; for example, 4 and 36 are equal modulo 32, but 4 was added to the set first so 36 ends up at a different index.

Here's a chart for this sequence. The structure in your charts is just a noisier version, because you generated your numbers randomly rather than from a range with a step.

enter image description here

The number of interleaved sequences will depend on the size of the set in proportion to the length of the range the numbers are sampled from, since that determines how many times the range's length "wraps around" modulo the length of the hashtable's underlying array. Here's an example with three interleaved sequences 0, 6, 12, ..., 66, 72, 78, ... and 36, 42, 48, ...:

>>> set(range(0, 90, 6))
{0, 66, 36, 6, 72, 42, 12, 78, 48, 18, 84, 54, 24, 60, 30}

163

answered Oct 28 '22 19:10

kaya3

Related questions
                            
                                How to specify another tox project folder as a dependency for a tox project
                            
                                Compute SHA1 of Strings in python
                            
                                How to setup different subdomains in Flask (using blueprints)?
                            
                                How can I change device used of theano
                            
                                Why is globals() a function in Python?
                            
                                What is the difference between Session and db.session in SQLAlchemy?
                            
                                Sklearn: Is there any way to debug Pipelines?
                            
                                In scipy's ConvexHull, what does "area" measure?
                            
                                Connection refused with postgresql using psycopg2
                            
                                Provide __classcell__ example for Python 3.6 metaclass
                            
                                Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bird-classifier.tfl.ckpt-50912
                            
                                Angular 4 frontend with python flask backend how to render simple index page
                            
                                Mock a method of a mocked object in Python?
                            
                                Is there a single line way to run a command in a Python venv?
                            
                                Why are chained operator expressions slower than their expanded equivalent?
                            
                                How/why does set() in {frozenset()} work?
                            
                                Building an SVM with Tensorflow
                            
                                Are Django SECRET_KEY's per instance or per app?
                            
                                Export a folium map as a png
                            
                                How do I return a new dictionary if the keys in one dictionary, match the keys in another dictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

'Bizarre' ordering of sets in python

Tags:

python

John Don

People also ask

1 Answers

kaya3

Recent Activity

Donate For Us