With categorical features, we can see that BigQuery ML automatically creates a "_null_filler" dummy variable by running ML.WEIGHTS on the created model, which makes sense.
In the case of numeric features, the missing values are imputed using the mean or something else? And are those two behaviors mentioned anywhere in the official documentation?
Imputation is the process in statistics of replacing missing data with substituted values. When training, missing values occur when BigQuery encounters a null value in the dataset. In prediction, missing values can occur when BigQuery encounters a null value or a previously unseen value. The following documents how BigQuery ML handles missing data in various cases.
For numerical types (that are automatically Standardized by BigQuery ML), null values will be replaced with the mean value as calculated by the feature column in the original input dataset for both training and prediction.
For one-hot encoded columns, an additional category is added that all null values will map to for training and prediction. Unseen data is de-facto assigned a weight of 0 at prediction.
We're missing this information in our public documents. We're working on adding that right now. Thanks for bringing this up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With