Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking for Blank Values in Specific Columns with Conditional Exceptions in pandas

Question 1: How do I check for empty values given additional conditions?

sample data

I have a CSV file containing 100 columns. Out of these, I want to check for blank values in the following columns:

bank and trade code
book value
business unit
COE value
corporate product id
counterparty legal entity
currency
cusip
face amount
legal entity
origination date
QRM book value
QRM face value

If any of these columns contain blank values, I want to highlight the particular column in the print statement. However, there's a special condition for the "origination date" column: if it contains blank values but the corresponding "source system" column has values like "post-close adjustment" or "GL-SDI gap", these blank values are acceptable and should not be flagged. I have tried the code but this is not working as intended.

Question 2: How do I check if there is scientific notation in the raw data?

enter image description here

I also want to check if any values in the following columns are in scientific notation:

book value
face amount
QRM book value
QRM face amount

If any of these columns contain values in scientific notation, I want to highlight this in the print statement by writing the below code. But it's printing all 4 column names when I deliberately change the first cell value to scientific under the "book_value" column.

What I have tried

To solve the first question, I tried this code:

import pandas as pd

# Read the CSV file
df = pd.read_csv('c:/user/file.csv')

# Columns to check for blank values
columns_to_check = ['bank', 'trade code', 'book value', 'business unit', 'COE value', 'corporate product id',
                    'counterparty legal entity', 'currency', 'qsip', 'face amount', 'legal entity',
                    'origination date', 'qrm book value', 'qrm face value', 'source system']

# Function to check for blank values and print column names with blanks
def check_for_blank_values(df):
    for col in columns_to_check:
        blank_values = df[df[col].isna()]
        if not blank_values.empty and not (col == 'origination date' and ~blank_values['source system'].isin(['post-close adjustment', 'GL-SDI gap']).all()):
            print(f"Column '{col}' has blank values.")

# Check for blank values
check_for_blank_values(df)

For the second question, I tried the following:

import pandas as pd


# Read the CSV file
df = pd.read_csv('c:/user/file.csv')

# Function to check if any value in the column is in scientific notation
def check_scientific_values(df, column_names):
    for column_name in column_names:
        df[column_name] = pd.to_numeric(df[column_name], errors='coerce')
        if df[column_name].dtype == 'float64':
            print(f"The values in column '{column_name}' are in scientific notation.")

# Columns to check
columns_to_check = ['book value', 'face amount', 'QRM face amount', 'QRM book value']


check_scientific_values(df, columns_to_check)`
like image 727
Peter Parker Avatar asked Oct 12 '25 01:10

Peter Parker


2 Answers

How to check for null values with additional conditions

In general, the answer to the first question might look like this:

df[columns_to_check].fillna(mask).isna().any()

So you have to mask blank values which should be skipped. For example:

exceptions = {
    'Origination_date': {
        'source': 'Source_system'
        , 'values': ['Post_clos_adj', 'SDI_Gap']
        , 'default': '01/01/2000'
    }
}

mask = {
    column: pd.Series(
        exception['default'], 
        index=df.index[df[exception['source']].isin(exception['values'])]
    ) for column, exception in exceptions.items()
}

How to check for scientific notation in data

As for the second question, the format of the input numbers can only be recognized in its source. Once the data is loaded, no numeric format is applied to the data until you print them, and only you decide in which format the numbers will be printed. Therefore, to check the format of the input numbers in the given column_names, you must first load these columns with dtype=str and then map them to the scientific notation template, for example:

df = pd.read_csv('data.csv', dtype=dict.fromkeys(column_names, str))
scinot = re.compile(r'[+-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)')
df[column_names].map(scinot.fullmatch).any()

Code for experiments

import pandas as pd
from io import StringIO
import re

raw_data = '''Origination_date,Reporting_date,Source_system,Book_value,Face_amount
12/11/2023,05/23/2024,Post_clos_adj,67517137,122126548
,05/23/2024,Post_clos_adj,1.53E+08,63810384
10/27/2023,05/23/2024,HMUS,182991335,187668072
,05/23/2024,HMUS,107402963,89933347
02/04/2024,05/23/2024,,24650754,222669942
,05/23/2024,SDI_Gap,131167066,213262751
'''

column_to_check_scinotation = ['Book_value', 'Face_amount']
df = pd.read_csv(StringIO(raw_data), 
                 dtype=dict.fromkeys(column_to_check_scinotation, str))

# Check scientific notation
scinot = re.compile(r'[+-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)')
has_scientific_notation = df[column_to_check_scinotation].map(scinot.fullmatch).any()

# Check null values
column_to_check_null = ['Origination_date', 'Reporting_date', 'Source_system']
exceptions = {
    'Origination_date': {
        'source': 'Source_system'
        , 'values': ['Post_clos_adj', 'SDI_Gap']
        , 'default': '01/01/2000'
    }
}
mask = {
    column: pd.Series(
        exception['default'], 
        index=df.index[df[exception['source']].isin(exception['values'])]
    ) for column, exception in exceptions.items()
}
has_null_values = df[column_to_check_null].fillna(mask).isna().any()

print('Null values:'.upper(),
      has_null_values,
      '-------------------',
      'Scientific notation:'.upper(),
      has_scientific_notation,
      sep='\n')
NULL VALUES:
Origination_date     True
Reporting_date      False
Source_system        True
dtype: bool
-------------------
SCIENTIFIC NOTATION:
Book_value      True
Face_amount    False
dtype: bool
like image 140
Vitalizzare Avatar answered Oct 14 '25 16:10

Vitalizzare


I don't think you're handling the origination date properly. You can try this for your first query example:

import pandas as pd

df = pd.read_csv('/mnt/data/your_file.csv')

columns_to_check = ['bank', 'trade code', 'book value', 'business unit', 'COE value', 'corporate product id',
                    'counterparty legal entity', 'currency', 'cusip', 'face amount', 'legal entity',
                    'origination date', 'QRM book value', 'QRM face value']

def check_for_blank_values(df):
    for col in columns_to_check:
        if col == 'origination date':
            blank_values = df[df[col].isna() & ~df['source system'].isin(['post-clos_adj', 'GL-SDI_Gap'])]
        else:
            blank_values = df[df[col].isna()]
        if not blank_values.empty:
            print(f"Column '{col}' has blank values.")

check_for_blank_values(df)

And then you can implement some regex to identify scientific notation values. Try this for your second query example:

import pandas as pd

df = pd.read_csv('/mnt/data/your_file.csv')

def check_scientific_values(df, column_names):
    for column_name in column_names:
        # Check if any value in the column matches scientific notation pattern
        scientific_notation_found = df[column_name].apply(lambda x: isinstance(x, str) and 'e' in x.lower())
        if scientific_notation_found.any():
            print(f"The values in column '{column_name}' are in scientific notation.")

columns_to_check = ['book value', 'face amount', 'QRM book value', 'QRM face amount']

check_scientific_values(df, columns_to_check)
like image 31
Hosea Avatar answered Oct 14 '25 17:10

Hosea