I have a list of dictionaries, but some of them are duplicates and I want to remove them (the duplicates).
The keys of the dict are a sequential number.
An example is the following:
[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
{3: {a:[1,2,3], b: 4}},
.....,
{1000: {a:[2,5,1], b: 99}},
]
Considering the previous example I would like to obtain:
[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99}},
]
In fact the dictionaries with keys 1 and 3 are identically in their values.
I tried with a set, but since dict is a not hashable type I am not able to do so.
How can i fix the problem?
EDIT
In my case the number of items inside the dict is not fix, so I can have:
[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99, c:["a","v"]}},
]
where the dict with keys 100 has three elements inside insted of two as the other shown.
To get around the limitation of @jdehesa's solution, where [1, 2]
would be treated as a duplicate as (1, 2)
, you can preserve the data types by using pprint.pformat
instead to serialize the data structure. Since pprint.pformat
sorts dicts by keys and sets by items, {1: 2, 3: 4}
would be properly considered the same as {3: 4, 1: 2}
, but [1, 2]
would not be considered a duplicate to (1, 2)
:
from pprint import pformat
lst = [
{1: {'a': [1, 2, 3], 'b': 4}},
{2: {'a': [4, 5, 6], 'd': 5}},
{3: {'b': 4, 'a': [1, 2, 3]}},
{4: {'a': (4, 5, 6), 'd': 5}},
]
seen = set()
output = []
for d in lst:
for k, v in d.items():
signature = pformat(v)
if signature not in seen:
seen.add(signature)
output.append({k: v})
output
becomes:
[{1: {'a': [1, 2, 3], 'b': 4}},
{2: {'a': [4, 5, 6], 'd': 5}},
{4: {'a': (4, 5, 6), 'd': 5}}]
You can maybe use a function like this to turn your objects into something hashable:
def make_hashable(o):
if isinstance(o, dict):
return frozenset((k, make_hashable(v)) for k, v in o.items())
elif isinstance(o, list):
return tuple(make_hashable(elem) for elem in o)
elif isinstance(o, set):
return frozenset(make_hashable(elem) for elem in o)
else:
return o
Then you keep a set of seen objects and keep only the keys of each dictionary containing objects that you did not see before:
lst = [
{1: {'a':[1,2,3], 'b': 4}},
{2: {'a':[4,5,6], 'd': 5}},
{3: {'a':[1,2,3], 'b': 4}},
]
seen = set()
result_keys = []
for elem in lst:
keep_keys = []
for k, v in elem.items():
v_hashable = make_hashable(v)
if v_hashable not in seen:
seen.add(v_hashable)
keep_keys.append(k)
result_keys.append(keep_keys)
result = [{k: elem[k] for k in keys} for elem, keys in zip(lst, result_keys) if keys]
print(result)
# [{1: {'a': [1, 2, 3], 'b': 4}}, {2: {'a': [4, 5, 6], 'd': 5}}]
Note that, as blhsing notes in the comments, this has some limitations, such as considering (1, 2)
and [1, 2]
equals, as well as {1: 2}
and {(1, 2)}
. Also, some types may not be convertible to an equivalent hashable type.
EDIT: As a_guest suggests, you can work around the type ambiguity by returning the type itself along with the hashable object in make_hashable
:
def make_hashable(o):
t = type(o)
if isinstance(o, dict):
o = frozenset((k, make_hashable(v)) for k, v in o.items())
elif isinstance(o, list):
o = tuple(make_hashable(elem) for elem in o)
elif isinstance(o, set):
o = frozenset(make_hashable(elem) for elem in o)
return t, o
If you don't need to look into the hashable object, this will easily provide strict type comparison. Note in this case even things like {1, 2}
and frozenset({1, 2})
will be different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With