Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I load categorical data from a numpy array into an Indicator or Embedding column?

Using Tensorflow 1.8.0, we are running into an issue whenever we attempt to build a categorical column. Here is a full example demonstrating the problem. It runs as-is (using only numeric columns). Uncommenting the indicator column definition and data generates a stack trace ending in tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.

import tensorflow as tf
import numpy as np

def feature_numeric(key):
  return tf.feature_column.numeric_column(key=key, default_value=0)

def feature_indicator(key, vocabulary):
  return tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(
      key=key, vocabulary_list=vocabulary ))


labels = ['Label1','Label2','Label3']

model = tf.estimator.DNNClassifier(
  feature_columns=[
    feature_numeric("number"),
    # feature_indicator("indicator", ["A","B","C"]),
  ],
  hidden_units=[64, 16, 8],
  model_dir='./models',
  n_classes=len(labels),
  label_vocabulary=labels)

def train(inputs, training):
  model.train(
    input_fn=tf.estimator.inputs.numpy_input_fn(
        x=inputs,
        y=training,
        shuffle=True
      ), steps=1)

inputs = {
  "number": np.array([1,2,3,4,5]),
  # "indicator": np.array([
  #     ["A"],
  #     ["B"],
  #     ["C"],
  #     ["A", "A"],
  #     ["A", "B", "C"],
  #   ]),
}

training = np.array(['Label1','Label2','Label3','Label2','Label1'])

train(inputs, training)

Attempts to use an embedding fare no better. Using only numeric inputs, we can successfully scale to thousands of input nodes, and in fact we have temporarily expanded our categorical features in the preprocessor to simulate indicators.

The documentation for categorical_column_*() and indicator_column() are awash in references to features we're pretty sure we're not using (proto inputs, whatever bytes_list is) but maybe we're wrong on that?

like image 492
Theodore Lief Gannon Avatar asked Feb 17 '26 17:02

Theodore Lief Gannon


2 Answers

The issue here is related to the ragged shape of the "indicator" input array (some elements are of length 1, one is length 2, one is length 3). If you pad your input lists with some non-vocabulary string (I used "Z" for example since your vocabulary is "A", "B", "C"), you'll get the expected results:

inputs = {
  "number": np.array([1,2,3,4,5]),
  "indicator": np.array([
    ["A", "Z", "Z"],
    ["B", "Z", "Z"],
    ["C", "Z", "Z"],
    ["A", "A", "Z"],
    ["A", "B", "C"]
  ])
}

You can verify that this works by printing the resulting tensor:

dense = tf.feature_column.input_layer(
  inputs,
  [
    feature_numeric("number"),
    feature_indicator("indicator", ["A","B","C"]),
  ])

with tf.train.MonitoredTrainingSession() as sess:
  print(dense)
  print(sess.run(dense))
like image 163
Eric Le Fort Avatar answered Feb 20 '26 05:02

Eric Le Fort


From what I can tell, the difficulty is that you are trying to make an indicator column from an array of arrays.

I collapsed your indicator array to

"indicator": np.array([
  "A",
  "B",
  "C",
  "AA",
  "ABC",
])

... and the thing ran.

More, I can't find any example where the vocabulary array is anything but a flat array of strings.

like image 44
Hack Saw Avatar answered Feb 20 '26 06:02

Hack Saw



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!