I have a list of 5 million string elements, which are stored as a pickle object.
a = ['https://en.wikipedia.org/wiki/Data_structure','https://en.wikipedia.org/wiki/Data_mining','https://en.wikipedia.org/wiki/Statistical_learning_theory','https://en.wikipedia.org/wiki/Machine_learning','https://en.wikipedia.org/wiki/Computer_science','https://en.wikipedia.org/wiki/Information_theory','https://en.wikipedia.org/wiki/Statistics','https://en.wikipedia.org/wiki/Mathematics','https://en.wikipedia.org/wiki/Signal_processing','https://en.wikipedia.org/wiki/Sorting_algorithm','https://en.wikipedia.org/wiki/Data_structure','https://en.wikipedia.org/wiki/Quicksort','https://en.wikipedia.org/wiki/Merge_sort','https://en.wikipedia.org/wiki/Heapsort','https://en.wikipedia.org/wiki/Insertion_sort','https://en.wikipedia.org/wiki/Introsort','https://en.wikipedia.org/wiki/Selection_sort','https://en.wikipedia.org/wiki/Timsort','https://en.wikipedia.org/wiki/Cubesort','https://en.wikipedia.org/wiki/Shellsort']
To remove duplicates, I use set(a)
, then I made it a list again through list(set(a))
.
My question is:
Even if I restart python, and read the list from the pickle file, will the order of list(set(a))
be the same every time?
I'm eager to know how this hash -> list ordering works.
I tested with a small dataset and it seems to have a consistent ordering.
In [50]: a = ['x','y','z','k']
In [51]: a
['x', 'y', 'z', 'k']
In [52]: list(set(a))
['y', 'x', 'k', 'z']
In [53]: b=list(set(a))
In [54]: list(set(b))
['y', 'x', 'k', 'z']
In [55]: del b
In [56]: b=list(set(a))
In [57]: b
['y', 'x', 'k', 'z']
Unlike in a standard set, the order of the data in an ordered set is preserved. We used ordered sets when we needed the order in which we entered the data to be maintained over the course of the program. In an ordered set, looking at the data does not change its order as it would in an unordered set.
Lists Are Ordered The order in which you specify the elements when you define a list is an innate characteristic of that list and is maintained for that list's lifetime. (You will see a Python data type that is not ordered in the next tutorial on dictionaries.)
The answer is simply a NO.
sort() established the convention that sort() sorts the object in place, but a set cannot be sorted in place because sets are unordered.
I would suggest an auxiliary set()
to ensure unicity when adding items on the list, thus preserving the order of your list()
, and not storing the set()
per se.
First, load your list and create a set with the contents Before adding items to your list, check that they are not in the set (much faster search using "in" from the set rather than the list, specially if there are many elements) Pickle your list, the order will be exactly the one you want
Drawback: takes twice as much memory than handling only a set()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With