Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

list comprehension to build a nested dictionary from a list of tuples

I have data (counts) indexed by user_id and analysis_type_id obtained from a database. It's a list of 3-tuple. Sample data:

counts  = [(4, 1, 4), (3, 5, 4), (2, 10, 4), (2, 10, 5)]

where the first item of each tuple is the count, the second the analysis_type_id, and the last the user_id.

I'd like to place that into a dictionary, so i can retrieve the counts quickly: given a user_id and analysis_type_id. It would have to be a two-level dictionary. Is there any better structure?

To construct the two-level dictionary "by hand", I would code:

dict = {4:{1:4,5:3,10:2},5:{10:2}}

Where user_id is the first dict key level, analysis_type_id is the second (sub-) key, and the count is the value inside the dict.

How would I create the "double-depth" in dict keys through list comprehension? Or do I need to resort to a nested for-loop, where I first iterate through unique user_id values, then find matching analysis_type_id and fill in the counts ... one-at-a-time into the dict?

like image 522
Ant Avatar asked Nov 01 '17 01:11

Ant


2 Answers

Two Tuple Keys

I would suggest abandoning the idea of nesting dictionaries and simply use two tuples as the keys directly. Like so:

d = { (user_id, analysis_type_id): count for count, analysis_type_id, user_id in counts}

The dictionary is a hash table. In python, each two tuple has a single hash value (not two hash values) and thus each two tuple is looked up based on its (relatively) unique hash. Therefore this is faster (2x faster, most of the time) than looking up the hash of TWO separate keys (first the user_id, then the analysis_type_id).

However, beware of premature optimization. Unless you're doing millions of lookups, the increase in performance of the flat dict is unlikely to matter. The real reason to favor the use of the two tuple here is that the syntax and readability of a two tuple solution is far superior than other solutions- that is, assuming the vast majority of the time you will be wanting to access items based on a pair of values and not groups of items based on a single value.

Consider Using a namedtuple

It may be convenient to create a named tuple for storing those keys. Do that this way:

from collections import namedtuple
IdPair = namedtuple("IdPair", "user_id, analysis_type_id")

Then use it in your dictionary comprehension:

d = { IdPair(user_id, analysis_type_id): count for count, analysis_type_id, user_id in counts}

And access a count you're interested in like this:

somepair = IdPair(user_id = 4, analysis_type_id = 1)
d[somepair]

The reason this is sometimes useful is you can do things like this:

user_id = somepair.user_id # very nice syntax

Some Other Useful Options

One downside of the above solution is the case in which your lookup fails. In that case, you will only get a traceback like the following:

>>> d[IdPair(0,0)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: IdPair(user_id=0, analysis_type_id=0)

This isn't very helpful; was it the user_id that was unmatched, or the analysis_type_id, or both?

You can create a better tool for yourself by creating your own dict type that gives you a nice traceback with more information. It might look something like this:

class CountsDict(dict):
    """A dict for storing IdPair keys and count values as integers.

    Provides more detailed traceback information than a regular dict.
    """
    def __getitem__(self, k):
        try:
            return super().__getitem__(k)
        except KeyError as exc:
            raise self._handle_bad_key(k, exc) from exc
    def _handle_bad_key(self, k, exc):
        """Provides a custom exception when a bad key is given."""
        try:
            user_id, analysis_type_id = k
        except:
            return exc
        has_u_id = next((True for u_id, _ in self if u_id==user_id), False)
        has_at_id  = next((True for _, at_id in self if at_id==analysis_type_id), False)
        exc_lookup = {(False, False):KeyError(f"CountsDict missing pair: {k}"),
                      (True, False):KeyError(f"CountsDict missing analysis_type_id: "
                                             f"{analysis_type_id}"),
                      (False, True):KeyError(f"CountsDict missing user_id: {user_id}")}
        return exc_lookup[(user_id, analysis_type_id)]

Use it just like a regular dict.

However, it may make MORE sense to simply add new pairs to your dict (with a count of zero) when you try to access a missing pair. If this is the case, I'd use a defaultdict and have it set the count to zero (using the default value of int as the factory function) when a missing key is accessed. Like so:

from collections import defaultdict
my_dict = defaultdict(default_factory=int, 
                      ((user_id, analysis_type_id), count) for count, analysis_type_id, user_id in counts))

Now if you attempt to access a key that is missing, the count will be set to zero. However, one problem with this method is that ALL keys will be set to zero:

value = my_dict['I'm not a two tuple, sucka!!!!'] # <-- will be added to my_dict

To prevent this, we go back to the idea of making a CountsDict, except in this case, your special dict will be a subclass of defaultdict. However, unlike a regular defaultdict, it will check to make sure the key is a valid kind before it is added. And as a bonus, we can make sure ANY two tuple that is added as a key becomes an IdPair.

from collections import defaultdict

class CountsDict(defaultdict):
    """A dict for storing IdPair keys and count values as integers.

    Missing two-tuple keys are converted to an IdPair. Invalid keys raise a KeyError.
    """
    def __getitem__(self, k):
        try:
            user_id, analysis_type_id = k
        except:
            raise KeyError(f"The provided key {k!r} is not a valid key.")
        else:
            # convert two tuple to an IdPair if it was not already
            k = IdPair(user_id, analysis_type_id)
        return super().__getitem__(k)

Use it just like the regular defaultdict:

my_dict = CountsDict(default_factory=int, 
                     ((user_id, analysis_type_id), count) for count, analysis_type_id, user_id in counts))

NOTE: In the above I have not made it so that two tuple keys are converted to IdPairs upon instance creation (because __setitem__ is not utilized during instance creation). To create this functionality, we would also need to implement an override of the __init__ method.

Wrap Up

Out of all of these, the more useful option depends entirely on your use case.

like image 198
Rick supports Monica Avatar answered Sep 22 '22 22:09

Rick supports Monica


The most readable solution utilizes a defaultdict which saves you nested loops and bumpy checking if keys already exist:

from collections import defaultdict
dct = defaultdict(dict)  # do not shadow the built-in 'dict'
for x, y, z in counts:
    dct[z][y] = x
dct
# defaultdict(dict, {4: {1: 4, 5: 3, 10: 2}, 5: {10: 2}})

If you really want a one-liner comprehension you can use itertools.groupby and this clunkiness:

from itertools import groupby
dct = {k: {y: x for x, y, _ in g} for k, g in groupby(sorted(counts, key=lambda c: c[2]), key=lambda c: c[2])}

If your initial data is already sorted by user_id, you can save yourself the sorting.

like image 44
user2390182 Avatar answered Sep 22 '22 22:09

user2390182