Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas schema validation with specific columns

I have a pandas dataframe with almost 56 columns and 120000 row.

I would like to implement validation only on some columns and not for all of them.

I followed article at https://tmiguelt.github.io/PandasSchema/

When i did like something below function, it throws an error as

"Invalid number of columns. The schema specifies 2, but the data frame has 56"

def DoValidation(self, df):
    null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]

    schema = pandas_schema.Schema([Column('ItemId', null_validation)],
                                   [Column('ItemName', null_validation)])
    errors = schema.validate(df)
    if (len(errors) > 0):
        for error in errors:
            print(error)
        return False
    return True

Am i doing something wrong ?

What is the correct way to validate specific column in a dataframe ?

Note: I have to implement different type of validations like decimal, length, null check validations etc on different columns and not just null check validation as show in function above.

like image 247
user1957116 Avatar asked Oct 26 '25 07:10

user1957116


1 Answers

As Yuki Ho mentioned in his answer, by default you have to specify as many columns in the schema as your dataframe.

But you can also use the columns parameter in schema.validate() to specify which columns to check. Combining that with schema.get_column_names() you can do the following to easily avoid your issue.

schema.validate(df, columns=schema.get_column_names())
like image 94
elevendollar Avatar answered Oct 28 '25 19:10

elevendollar