Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Literate way to index a list where each element has an interpretation?

Tl;dr is bold-faced text.

I'm working with an image dataset that comes with boolean "one-hot" image annotations (Celeba to be specific). The annotations encode facial features like bald, male, young. Now I want to make a custom one-hot list (to test my GAN model). I want to provide a literate interface. I.e., rather than specifying features[12]=True knowing that 12 - counting from zero - corresponds to the male feature, I want something like features[male]=True or features.male=True.

Suppose the header of my .txt file is

Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young

and I want to codify Young, Bald, and Chubby. The expected output is

[ 0.  0.  0.  1.  0.  1.  0.  0.  1.]

since Bald is the fourth entry of the header, Chubby is the sixth, and so on. What is the clearest way to do this without expecting a user to know Bald is the fourth entry, etc.?

I'm looking for a Pythonic way, not necessarily the fastest way.

Ideal Features

In rough order of importance:

  1. A way to accomplish my stated goal that is already standard in the Python community will take precedence.
  2. A user/programmer should not need to count to an attribute in the .txt header. This is the point of what I'm trying to design.
  3. A user should not be expected to have non-standard libraries like aenum.
  4. A user/programmer should not need to reference the .txt header for attribute names/available attributes. One example: if a user wants to specify the gender attribute but does not know whether to use male or female, it should be easy to find out.
  5. A user/programmer should be able to find out the available attributes via documentation (ideally generated by Sphinx api-doc). That is, the point 4 should be possible reading as little code as possible. Attribute exposure with dir() sufficiently satisfies this point.
  6. The programmer should find the indexing tool natural. Specifically, zero-indexing should be preferred over subtracting from one-indexing.
  7. Between two otherwise completely identical solutions, one with better performance would win.

Examples:

I'm going to compare and contrast the ways that immediately came to my mind. All examples use:

import numpy as np
header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
          "Bald Bangs Chubby Male Wearing_Necktie Young")
NUM_CLASSES = len(header.split())  # 9

1: Dict Comprehension

Obviously we could use a dictionary to accomplish this:

binary_label = np.zeros([NUM_CLASSES])
classes = {head: idx for (idx, head) in enumerate(header.split())}
binary_label[[classes["Young"], classes["Bald"], classes["Chubby"]]] = True
print(binary_label)

For what it's worth, this has the fewest lines of code and is the only one that doesn't rely on a standard library over builtins. As for negatives, it isn't exactly self-documenting. To see the available options, you must print(classes.keys()) - it's not exposed with dir(). This borders on not satisfying feature 5 because it requires a user to know classes is a dict to exposure features AFAIK.

2: Enum:

Since I'm learning C++ right now, Enum is the first thing that came to mind:

import enum
binary_label = np.zeros([NUM_CLASSES])
Classes = enum.IntEnum("Classes", header)
features = [Classes.Young, Classes.Bald, Classes.Chubby]
zero_idx_feats = [feat-1 for feat in features]
binary_label[zero_idx_feats] = True
print(binary_label)

This gives dot notation and the image options are exposed with dir(Classes). However, enum uses one-indexing by default (the reason is documented). The work-around makes me feel like enum is not the Pythonic way to do this, and entirely fails to satisfy feature 6.

3: Named Tuple

Here's another one out of the standard Python library:

import collections
binary_label = np.zeros([NUM_CLASSES])
clss = collections.namedtuple(
    "Classes", header)._make(range(NUM_CLASSES))
binary_label[[clss.Young, clss.Bald, clss.Chubby]] = True
print(binary_label)

Using namedtuple, we again get dot notation and self-documentation with dir(clss). But, the namedtuple class is heavier than enum. By this I mean, namedtuple has functionality I do not need. This solution appears to be a leader among my examples, but I do not know if it satisfies feature 1 or if an alternative could "win" via feature 7.

4: Custom Enum

I could really break my back:

binary_label = np.zeros([NUM_CLASSES])
class Classes(enum.IntEnum):
    Arched_Eyebrows = 0
    Attractive = 1
    Bags_Under_Eyes = 2
    Bald = 3
    Bangs = 4
    Chubby = 5
    Male = 6
    Wearing_Necktie = 7
    Young = 8
binary_label[
    [Classes.Young, Classes.Bald, Classes.Chubby]] = True
print(binary_label)

This has all the advantages of Ex. 2. But, it comes with obvious the obvious drawbacks. I have to write out all the features (there's 40 in the real dataset) just to zero-index! Sure, this is how to make an enum in C++ (AFAIK), but it shouldn't be necessary in Python. This is a slight failure on feature 6.

Summary

There are many ways to accomplish literate zero-indexing in Python. Would you provide a code snippet of how you would accomplish what I'm after and tell me why your way is right?

(edit:) Or explain why one of my examples is the right tool for the job?


Status Update:

I'm not ready to accept an answer yet in case anyone wants to address the following feedback/update, or any new solution appears. Maybe another 24 hours? All the responses have been helpful, so I upvoted everyone's so far. You may want to look over this repo I'm using to test solutions. Feel free to tell me if my following remarks are (in)accurate or unfair:

zero-enum:

Oddly, Sphinx documents this incorrectly (one-indexed in docs), but it does document it! I suppose that "issue" doesn't fail any ideal feature.

dotdict:

I feel that Map is overkill, but dotdict is acceptable. Thanks to both answerers that got this solution working with dir(). However, it doesn't appear that it "works seamlessly" with Sphinx.

Numpy record:

As written, this solution takes significantly longer than the other solutions. It comes in at 10x slower than a namedtuple (fastest behind pure dict) and 7x slower than standard IntEnum (slowest behind numpy record). That's not drastic at current scale, nor a priority, but a quick Google search indicates np.in1d is in fact slow. Let's stick with

_label = np.zeros([NUM_CLASSES])
_label[[header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]] = True

unless I've implemented something wrong in the linked repo. This brings the execution speed into a range that compares with the other solutions. Again, no Sphinx.

namedtuple (and rassar's critiques)

I'm not convinced of your enum critique. It seems to me that you believe I'm approaching the problem wrong. It's fine to call me out on that, but I don't see how using the namedtuple is fundamentally different from "Enum [which] will provide separate values for each constant." Have I misunderstood you?

Regardless, namedtuple appears in Sphinx (correctly numbered, for what it's worth). On the Ideal Features list, this chalks up identically to zero-enum and profiles ahead of zero-enum.

Accepted Rationale

I accepted the zero-enum answer because the answer gave me the best challenger for namedtuple. By my standards, namedtuple is marginally the best solution. But salparadise wrote the answer that helped me feel confident in that assessment. Thanks to all who answered.

like image 959
Dylan F Avatar asked Dec 27 '17 22:12

Dylan F


1 Answers

How about a factory function to create a zero indexed IntEnum since that is the object that suits your needs, and Enum provides flexibility in construction:

from enum import IntEnum

def zero_indexed_enum(name, items):
    # splits on space, so it won't take any iterable. Easy to change depending on need.
    return IntEnum(name, ((item, value) for value, item in enumerate(items.split())))

Then:

In [43]: header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
    ...:           "Bald Bangs Chubby Male Wearing_Necktie Young")
In [44]: Classes = zero_indexed_enum('Classes', header)

In [45]: list(Classes)
Out[45]:
[<Classes.Arched_Eyebrows: 0>,
 <Classes.Attractive: 1>,
 <Classes.Bags_Under_Eyes: 2>,
 <Classes.Bald: 3>,
 <Classes.Bangs: 4>,
 <Classes.Chubby: 5>,
 <Classes.Male: 6>,
 <Classes.Wearing_Necktie: 7>,
 <Classes.Young: 8>]
like image 179
salparadise Avatar answered Sep 25 '22 23:09

salparadise