Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Infer multivalent features with tfdv from pandas dataframe

I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time.

Given the following dataframe:

df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
                   {'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
                   {'feat_1': 7, 'feat_2': None, 'feat_3': None}])

inferring and displaying the schema results in:

enter image description here

Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':

enter image description here

If if save the values of feature as e.g., ['AA', 'BB'], the schema inference throws an error:

ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')

Is there any way to achieve this with tfdv?

like image 488
ppmt Avatar asked Sep 17 '25 20:09

ppmt


1 Answers

A String will be interpreted as a String. Regarding your issue with the List, it might be related to this issue:

Currently only pandas columns of primitive types are supported.

Could not find anything more recent. Here is a workaround:

import pandas as pd
import tensorflow_data_validation as tfdv

df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
                   {'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
                   {'feat_1': 7, 'feat_2': None, 'feat_3': None}])

df['feat_2'] = df['feat_2'].str.split(',')
df = df.explode('feat_2').reset_index(drop=True)

train_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

enter image description here

like image 158
AloneTogether Avatar answered Sep 19 '25 11:09

AloneTogether