How do I load categorical data from a numpy array into an Indicator or Embedding column?

Question

Using Tensorflow 1.8.0, we are running into an issue whenever we attempt to build a categorical column. Here is a full example demonstrating the problem. It runs as-is (using only numeric columns). Uncommenting the indicator column definition and data generates a stack trace ending in tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.

import tensorflow as tf
import numpy as np

def feature_numeric(key):
  return tf.feature_column.numeric_column(key=key, default_value=0)

def feature_indicator(key, vocabulary):
  return tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(
      key=key, vocabulary_list=vocabulary ))


labels = ['Label1','Label2','Label3']

model = tf.estimator.DNNClassifier(
  feature_columns=[
    feature_numeric("number"),
    # feature_indicator("indicator", ["A","B","C"]),
  ],
  hidden_units=[64, 16, 8],
  model_dir='./models',
  n_classes=len(labels),
  label_vocabulary=labels)

def train(inputs, training):
  model.train(
    input_fn=tf.estimator.inputs.numpy_input_fn(
        x=inputs,
        y=training,
        shuffle=True
      ), steps=1)

inputs = {
  "number": np.array([1,2,3,4,5]),
  # "indicator": np.array([
  #     ["A"],
  #     ["B"],
  #     ["C"],
  #     ["A", "A"],
  #     ["A", "B", "C"],
  #   ]),
}

training = np.array(['Label1','Label2','Label3','Label2','Label1'])

train(inputs, training)

Attempts to use an embedding fare no better. Using only numeric inputs, we can successfully scale to thousands of input nodes, and in fact we have temporarily expanded our categorical features in the preprocessor to simulate indicators.

The documentation for categorical_column_*() and indicator_column() are awash in references to features we're pretty sure we're not using (proto inputs, whatever bytes_list is) but maybe we're wrong on that?

Eric Le Fort · Accepted Answer

The issue here is related to the ragged shape of the "indicator" input array (some elements are of length 1, one is length 2, one is length 3). If you pad your input lists with some non-vocabulary string (I used "Z" for example since your vocabulary is "A", "B", "C"), you'll get the expected results:

inputs = {
  "number": np.array([1,2,3,4,5]),
  "indicator": np.array([
    ["A", "Z", "Z"],
    ["B", "Z", "Z"],
    ["C", "Z", "Z"],
    ["A", "A", "Z"],
    ["A", "B", "C"]
  ])
}

You can verify that this works by printing the resulting tensor:

dense = tf.feature_column.input_layer(
  inputs,
  [
    feature_numeric("number"),
    feature_indicator("indicator", ["A","B","C"]),
  ])

with tf.train.MonitoredTrainingSession() as sess:
  print(dense)
  print(sess.run(dense))

Hack Saw · Answer

From what I can tell, the difficulty is that you are trying to make an indicator column from an array of arrays.

I collapsed your indicator array to

"indicator": np.array([
  "A",
  "B",
  "C",
  "AA",
  "ABC",
])

... and the thing ran.

More, I can't find any example where the vocabulary array is anything but a flat array of strings.

How do I load categorical data from a numpy array into an Indicator or Embedding column?

Tags:

python

tensorflow

python-2.7

tensorflow-estimator

Theodore Lief Gannon

2 Answers

Eric Le Fort

Hack Saw

Recent Activity

Donate For Us

How do I load categorical data from a numpy array into an Indicator or Embedding column?

Tags:

python

tensorflow

python-2.7

tensorflow-estimator

Theodore Lief Gannon

2 Answers

Eric Le Fort

Hack Saw

Related questions

Recent Activity

Donate For Us