Tl;dr is bold-faced text.
I'm working with an image dataset that comes with boolean "one-hot" image annotations (Celeba to be specific). The annotations encode facial features like bald, male, young. Now I want to make a custom one-hot list (to test my GAN model). I want to provide a literate interface. I.e., rather than specifying features[12]=True
knowing that 12
- counting from zero - corresponds to the male feature, I want something like features[male]=True
or features.male=True
.
Suppose the header of my .txt
file is
Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young
and I want to codify Young, Bald, and Chubby. The expected output is
[ 0. 0. 0. 1. 0. 1. 0. 0. 1.]
since Bald is the fourth entry of the header, Chubby is the sixth, and so on. What is the clearest way to do this without expecting a user to know Bald is the fourth entry, etc.?
I'm looking for a Pythonic way, not necessarily the fastest way.
In rough order of importance:
.txt
header. This is the point of what I'm trying to design.aenum
..txt
header for attribute names/available attributes. One example: if a user wants to specify the gender attribute but does not know whether to use male
or female
, it should be easy to find out.dir()
sufficiently satisfies this point.I'm going to compare and contrast the ways that immediately came to my mind. All examples use:
import numpy as np
header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
"Bald Bangs Chubby Male Wearing_Necktie Young")
NUM_CLASSES = len(header.split()) # 9
Obviously we could use a dictionary to accomplish this:
binary_label = np.zeros([NUM_CLASSES])
classes = {head: idx for (idx, head) in enumerate(header.split())}
binary_label[[classes["Young"], classes["Bald"], classes["Chubby"]]] = True
print(binary_label)
For what it's worth, this has the fewest lines of code and is the only one that doesn't rely on a standard library over builtins. As for negatives, it isn't exactly self-documenting. To see the available options, you must print(classes.keys())
- it's not exposed with dir()
. This borders on not satisfying feature 5 because it requires a user to know classes
is a dict to exposure features AFAIK.
Since I'm learning C++ right now, Enum
is the first thing that came to mind:
import enum
binary_label = np.zeros([NUM_CLASSES])
Classes = enum.IntEnum("Classes", header)
features = [Classes.Young, Classes.Bald, Classes.Chubby]
zero_idx_feats = [feat-1 for feat in features]
binary_label[zero_idx_feats] = True
print(binary_label)
This gives dot notation and the image options are exposed with dir(Classes)
. However, enum
uses one-indexing by default (the reason is documented). The work-around makes me feel like enum
is not the Pythonic way to do this, and entirely fails to satisfy feature 6.
Here's another one out of the standard Python library:
import collections
binary_label = np.zeros([NUM_CLASSES])
clss = collections.namedtuple(
"Classes", header)._make(range(NUM_CLASSES))
binary_label[[clss.Young, clss.Bald, clss.Chubby]] = True
print(binary_label)
Using namedtuple
, we again get dot notation and self-documentation with dir(clss)
. But, the namedtuple
class is heavier than enum
. By this I mean, namedtuple
has functionality I do not need. This solution appears to be a leader among my examples, but I do not know if it satisfies feature 1 or if an alternative could "win" via feature 7.
I could really break my back:
binary_label = np.zeros([NUM_CLASSES])
class Classes(enum.IntEnum):
Arched_Eyebrows = 0
Attractive = 1
Bags_Under_Eyes = 2
Bald = 3
Bangs = 4
Chubby = 5
Male = 6
Wearing_Necktie = 7
Young = 8
binary_label[
[Classes.Young, Classes.Bald, Classes.Chubby]] = True
print(binary_label)
This has all the advantages of Ex. 2. But, it comes with obvious the obvious drawbacks. I have to write out all the features (there's 40 in the real dataset) just to zero-index! Sure, this is how to make an enum in C++ (AFAIK), but it shouldn't be necessary in Python. This is a slight failure on feature 6.
There are many ways to accomplish literate zero-indexing in Python. Would you provide a code snippet of how you would accomplish what I'm after and tell me why your way is right?
(edit:) Or explain why one of my examples is the right tool for the job?
I'm not ready to accept an answer yet in case anyone wants to address the following feedback/update, or any new solution appears. Maybe another 24 hours? All the responses have been helpful, so I upvoted everyone's so far. You may want to look over this repo I'm using to test solutions. Feel free to tell me if my following remarks are (in)accurate or unfair:
Oddly, Sphinx documents this incorrectly (one-indexed in docs), but it does document it! I suppose that "issue" doesn't fail any ideal feature.
I feel that Map
is overkill, but dotdict
is acceptable. Thanks to both answerers that got this solution working with dir()
. However, it doesn't appear that it "works seamlessly" with Sphinx.
As written, this solution takes significantly longer than the other solutions. It comes in at 10x slower than a namedtuple
(fastest behind pure dict) and 7x slower than standard IntEnum
(slowest behind numpy record). That's not drastic at current scale, nor a priority, but a quick Google search indicates np.in1d
is in fact slow. Let's stick with
_label = np.zeros([NUM_CLASSES])
_label[[header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]] = True
unless I've implemented something wrong in the linked repo. This brings the execution speed into a range that compares with the other solutions. Again, no Sphinx.
I'm not convinced of your enum
critique. It seems to me that you believe I'm approaching the problem wrong. It's fine to call me out on that, but I don't see how using the namedtuple
is fundamentally different from "Enum [which] will provide separate values for each constant." Have I misunderstood you?
Regardless, namedtuple
appears in Sphinx (correctly numbered, for what it's worth). On the Ideal Features list, this chalks up identically to zero-enum and profiles ahead of zero-enum.
I accepted the zero-enum answer because the answer gave me the best challenger for namedtuple
. By my standards, namedtuple
is marginally the best solution. But salparadise wrote the answer that helped me feel confident in that assessment. Thanks to all who answered.
How about a factory function to create a zero indexed IntEnum
since that is the object that suits your needs, and Enum
provides flexibility in construction:
from enum import IntEnum
def zero_indexed_enum(name, items):
# splits on space, so it won't take any iterable. Easy to change depending on need.
return IntEnum(name, ((item, value) for value, item in enumerate(items.split())))
Then:
In [43]: header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
...: "Bald Bangs Chubby Male Wearing_Necktie Young")
In [44]: Classes = zero_indexed_enum('Classes', header)
In [45]: list(Classes)
Out[45]:
[<Classes.Arched_Eyebrows: 0>,
<Classes.Attractive: 1>,
<Classes.Bags_Under_Eyes: 2>,
<Classes.Bald: 3>,
<Classes.Bangs: 4>,
<Classes.Chubby: 5>,
<Classes.Male: 6>,
<Classes.Wearing_Necktie: 7>,
<Classes.Young: 8>]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With