When I convert a Python 3.8.0 list to a set, the resulting set ordering* is highly structured in a non-trivial way. How is this structure being extracted from the pseudo-random list?
As part of an experiment I am running, I am generating a random set. I was surprised to see that plotting the set suddenly showed unexpected linear structure in the set. So there are two things puzzling me - why does converting to a set result have an ordering* which ends up highlighting this structure; and, to a lesser extent why does the pseudo-random set have this "hidden" structure at all?
The code:
X = [randrange(250) for i in range(30)]
print(X)
print(set(X))
which outputs, for example
[238, 202, 245, 94, 111, 106, 148, 164, 154, 113, 128, 10, 196, 141, 69, 38, 106, 8, 40, 53, 160, 87, 85, 13, 38, 147, 204, 50, 162, 91]
{128, 8, 10, 141, 13, 147, 148, 154, 160, 162, 164, 38, 40, 50, 53, 196, 69, 202, 204, 85, 87, 91, 94, 106, 238, 111, 113, 245}
A plot** of the above list looks fairly random, as expected:
whereas plotting the set (as it is ordered in the output) exhibits the structure present in the set:
This behaviour 100% consistent on my machine (more examples below) with the values 250 and 30 used in the above code (the example I used is not cherry picked - it is just the last one I ran). Tuning these values sometimes results in slightly different structure (e.g. a subset of three arithmetic progressions*** instead of two).
Is this reproducible on other people's machines? Of course, that such structure exists seems indicative of a not-so-great pseudo-random number generation, but this does not explain how converting to a set would in some sense 'extract' this structure. As far as I am aware, the there is no formal guarantee that the ordering of a set (when converted from a list) is deterministic (and even if it is, there is no sophisticated ordering being done in the background). So how is this happening?!
(*): I know, sets are unordered collections, but I mean "ordered" in the sense that, when calling the print
statement, the set is output in some order which consistently highlights the underlying set structure.
(**): These plots are from Wolfram Alpha. Two more examples are below:
(***): Two plots when changing the range of the random numbers from 250 to 500:
The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
Unlike lists, ordinary sets do not preserve the order in which we insert the elements. This is because the elements in a set are usually not stored in the order in which they appear.
This question is about why a particular implementation doesn't iterate over a set in the same order that sorted() would, and the answer is that whoever wrote it didn't want it to. This is why Python sets are considered unsorted.
In earlier version of python, the order of pop is definite, in recent python version, the set object(also for dict) is insertion ordered. in both case, the result of pop is predictable, thus not random.
Basically, this is because of two things:
Therefore, the index that an integer appears in the underlying array will be determined by the integer's value, modulo the length of the underlying array. So, integers will tend to stay in ascending order when you put a contiguous range of them into a set:
>>> list(set(range(10000))) == list(range(10000))
True # this can't be an accident!
If you don't have all of the numbers from a contiguous range, then the "modulo the length of the underlying array" part comes into play:
>>> r = range(0, 50, 4)
>>> set(r)
{0, 32, 4, 36, 8, 40, 12, 44, 16, 48, 20, 24, 28}
>>> sorted(r, key=lambda x: x % 32)
[0, 32, 4, 36, 8, 40, 12, 44, 16, 48, 20, 24, 28]
The sequence is predictable if you know the length of the underlying array, and the (deterministic) algorithm for adding elements. In this case the array's length is 32, because it's initially 8 and is quadrupled while elements are added.
Except for a blip near the end (because the numbers 52 and 56 aren't in the set), the range is divided into two sequences 0, 4, 8, ...
and 32, 36, 40, ...
which alternate because the hashes, which are the numbers' values themselves, are taken modulo 32 to choose indices in the array. There are collisions; for example, 4 and 36 are equal modulo 32, but 4 was added to the set first so 36 ends up at a different index.
Here's a chart for this sequence. The structure in your charts is just a noisier version, because you generated your numbers randomly rather than from a range with a step.
The number of interleaved sequences will depend on the size of the set in proportion to the length of the range the numbers are sampled from, since that determines how many times the range's length "wraps around" modulo the length of the hashtable's underlying array. Here's an example with three interleaved sequences 0, 6, 12, ...
, 66, 72, 78, ...
and 36, 42, 48, ...
:
>>> set(range(0, 90, 6))
{0, 66, 36, 6, 72, 42, 12, 78, 48, 18, 84, 54, 24, 60, 30}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With