We're trying to build a pipeline that takes data from BigQuery, runs through TensorFlow Transform, before training in TensorFlow.
The pipeline is up and running, but we're having difficulty with null values in BigQuery.
We're using Beam to load from BigQuery:
raw_data = (pipeline
| '{}_read_from_bq'.format(step) >> beam.io.Read(
beam.io.BigQuerySource(query=source_query,
use_standard_sql=True,
)))
I'm playing with the dataset metadata, trying FixedLenFeature
and VarLenFeature
for various columns:
# Categorical feature schema
categorical_features = {
column_name: tf.io.FixedLenFeature([], tf.string) for column_name in categorical_columns
}
raw_data_schema.update(categorical_features)
# Numerical feature schema
numerical_features = {
column_name: tf.io.VarLenFeature(tf.float32) for column_name in numerical_columns
}
raw_data_schema.update(numerical_features)
# Create dataset_metadata given raw_data_schema
raw_metadata = dataset_metadata.DatasetMetadata(
schema_utils.schema_from_feature_spec(raw_data_schema))
As expected, if you try and feed a BigQuery NULL into a FixedLenFeature
, it breaks.
However, when I try to feed strings or integers a VarLenFeature
, it breaks too. This seems to be because VarLenFeature expects a list, but BigQuerySource gives a Python primitive. The exact point where it breaks is here (error is from when I tried with an integer):
File "/usr/local/lib/python3.7/site-packages/tensorflow_transform/impl_helper.py", line 157, in <listcomp>
indices = [range(len(value)) for value in values]
TypeError: object of type 'int' has no len()
[while running 'train_transform/AnalyzeDataset/ApplySavedModel[Phase0]/ApplySavedModel/ApplySavedModel']
When I try VarLenFeature with my string inputs, e.g. "UK", the output is a SparseTensor like this:
SparseTensorValue(indices=[(0, 0), (0, 1)], values=['U', 'K'], dense_shape=(1, 2))
So it seems like I need to be passing a list into VarLenFeature for this to work, but BigQuerySource does not do this by default.
Is there a simple way of achieving this? Or am I totally missing the mark on reading nullable columns from BigQuery?
Thank you very much in advance!
You might need to handle NULL(missing) values by yourself. For numerical columns, you could replace NULLs with mean or median. For categorical columns (STRING), you could use some default value like an empty STRING or a new value as a missing value indicator.
I'm not very familiar with VarLenFeature, but you can probably replace NULLs (NULL imputation) in the source_query. Something like:
IFNULL(col, col_mean) AS col_imputed
The downside is that you will have to calculate col_mean first using sql and fill it here as a constant. Another thing is you will need to remember this mean and apply the same mean in prediction as it's not part of tf.transform (your graph).
Bigquery itself has BQML as an ML platform. They do support TRANSFORM and automatic imputation. Maybe you could also take a look :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With