I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time.
Given the following dataframe:
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
inferring and displaying the schema results in:
Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':
If if save the values of feature as e.g., ['AA', 'BB']
, the schema inference throws an error:
ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')
Is there any way to achieve this with tfdv?
A String
will be interpreted as a String
. Regarding your issue with the List
, it might be related to this issue:
Currently only pandas columns of primitive types are supported.
Could not find anything more recent. Here is a workaround:
import pandas as pd
import tensorflow_data_validation as tfdv
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
df['feat_2'] = df['feat_2'].str.split(',')
df = df.explode('feat_2').reset_index(drop=True)
train_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With