Question 1:
In wide_n_deep_tutorial.py
, there is a hyper-parameter named hash_bucket_size
for both tf.feature_column.categorical_column_with_hash_bucket
and tf.feature_column.crossed_column
methods, and the value is hash_bucket_size=1000
.
But why 1000? How to set this parameter ?
Question 2:
The second question about crossed_columns
, that is,
crossed_columns = [
tf.feature_column.crossed_column( ["education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column( [age_buckets, "education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column( ["native_country", "occupation"], hash_bucket_size=1000) ]
in wide_n_deep_tutorial.py
,
Why choose ["education", "occupation"]
, [age_buckets, "education", "occupation"]
and ["native_country", "occupation"]
as crossed_columns
, are there any rule of thumb ?
Regarding the hash_bucket_size
: if you set it too low, there will be many hash collisions, where different categories will be mapped to the same bucket, forcing the neural network to use other features to distinguish them. If you set it too high then you will use a lot of RAM for nothing: I am assuming that you will wrap the categorical_column_with_hash_bucket()
in an embedding_column()
(as you generally should), in which case the hash_bucket_size
will determine the number of rows of the embedding matrix.
The probability of a collision if there are k categories is approximately equal to: 1 - exp(-k*(k-1)/2/hash_bucket_size)
(source), so if there are 40 categories and you use hash_bucket_size=1000
, the probability is surprisingly high: about 54%! To convince yourself, try running len(np.unique(np.random.randint(1000, size=40)))
several times (it picks 40 random numbers between 0 and 999 and counts how many unique numbers there are), and you will see that the result is quite often less than 40. You can use this equation to choose a value of hash_bucket_size
that does not cause too many collisions.
That said, if there are just a couple collisions, it's probably not going to be too bad in practice, as the neural network will still be able to use other features to distinguish the colliding categories. The best option may be to experiment with different values of hash_bucket_size
to find the value below which performance starts to degrade, then increase it by 10-20% to be safe.
For the hash_bucket
The general idea is that ideally the result of the hash functions should not result in any collisions (otherwise you/the algorithm would not be able to distinguish between two cases). Hence the 1000 is in this case 'just' a value. If you look at the unique entries for occupation and country (16 and 43) you'll see that this number is high enough:
edb@lapelidb:/tmp$ cat adult.data | cut -d , -f 7 | sort | uniq -c | wc -l
16
edb@lapelidb:/tmp$ cat adult.data | cut -d , -f 14 | sort | uniq -c | wc -l
43
Feature crossing
I think the rule of thumb there is that crossing makes sense if the combination of the features actually has meaning. In this example education and occupation are linked. As for the second one it probably make sense to define people as 'junior engineer with a ph.d' vs 'senior cleaning staff without a degree'. Another typical example you see quite often is the crossing of longitude and latitude since they have more meaning together than individually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With