I have a CSV file containing 100 columns. Out of these, I want to check for blank values in the following columns:
bank and trade code
book value
business unit
COE value
corporate product id
counterparty legal entity
currency
cusip
face amount
legal entity
origination date
QRM book value
QRM face value
If any of these columns contain blank values, I want to highlight the particular column in the print statement. However, there's a special condition for the "origination date"
column: if it contains blank values but the corresponding "source system"
column has values like "post-close adjustment"
or "GL-SDI gap"
, these blank values are acceptable and should not be flagged. I have tried the code but this is not working as intended.
I also want to check if any values in the following columns are in scientific notation:
book value
face amount
QRM book value
QRM face amount
If any of these columns contain values in scientific notation, I want to highlight this in the print statement by writing the below code. But it's printing all 4 column names when I deliberately change the first cell value to scientific under the "book_value"
column.
To solve the first question, I tried this code:
import pandas as pd
# Read the CSV file
df = pd.read_csv('c:/user/file.csv')
# Columns to check for blank values
columns_to_check = ['bank', 'trade code', 'book value', 'business unit', 'COE value', 'corporate product id',
'counterparty legal entity', 'currency', 'qsip', 'face amount', 'legal entity',
'origination date', 'qrm book value', 'qrm face value', 'source system']
# Function to check for blank values and print column names with blanks
def check_for_blank_values(df):
for col in columns_to_check:
blank_values = df[df[col].isna()]
if not blank_values.empty and not (col == 'origination date' and ~blank_values['source system'].isin(['post-close adjustment', 'GL-SDI gap']).all()):
print(f"Column '{col}' has blank values.")
# Check for blank values
check_for_blank_values(df)
For the second question, I tried the following:
import pandas as pd
# Read the CSV file
df = pd.read_csv('c:/user/file.csv')
# Function to check if any value in the column is in scientific notation
def check_scientific_values(df, column_names):
for column_name in column_names:
df[column_name] = pd.to_numeric(df[column_name], errors='coerce')
if df[column_name].dtype == 'float64':
print(f"The values in column '{column_name}' are in scientific notation.")
# Columns to check
columns_to_check = ['book value', 'face amount', 'QRM face amount', 'QRM book value']
check_scientific_values(df, columns_to_check)`
In general, the answer to the first question might look like this:
df[columns_to_check].fillna(mask).isna().any()
So you have to mask blank values which should be skipped. For example:
exceptions = {
'Origination_date': {
'source': 'Source_system'
, 'values': ['Post_clos_adj', 'SDI_Gap']
, 'default': '01/01/2000'
}
}
mask = {
column: pd.Series(
exception['default'],
index=df.index[df[exception['source']].isin(exception['values'])]
) for column, exception in exceptions.items()
}
As for the second question, the format of the input numbers can only be recognized in its source. Once the data is loaded, no numeric format is applied to the data until you print them, and only you decide in which format the numbers will be printed. Therefore, to check the format of the input numbers in the given column_names
, you must first load these columns with dtype=str
and then map them to the scientific notation template, for example:
df = pd.read_csv('data.csv', dtype=dict.fromkeys(column_names, str))
scinot = re.compile(r'[+-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)')
df[column_names].map(scinot.fullmatch).any()
import pandas as pd
from io import StringIO
import re
raw_data = '''Origination_date,Reporting_date,Source_system,Book_value,Face_amount
12/11/2023,05/23/2024,Post_clos_adj,67517137,122126548
,05/23/2024,Post_clos_adj,1.53E+08,63810384
10/27/2023,05/23/2024,HMUS,182991335,187668072
,05/23/2024,HMUS,107402963,89933347
02/04/2024,05/23/2024,,24650754,222669942
,05/23/2024,SDI_Gap,131167066,213262751
'''
column_to_check_scinotation = ['Book_value', 'Face_amount']
df = pd.read_csv(StringIO(raw_data),
dtype=dict.fromkeys(column_to_check_scinotation, str))
# Check scientific notation
scinot = re.compile(r'[+-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)')
has_scientific_notation = df[column_to_check_scinotation].map(scinot.fullmatch).any()
# Check null values
column_to_check_null = ['Origination_date', 'Reporting_date', 'Source_system']
exceptions = {
'Origination_date': {
'source': 'Source_system'
, 'values': ['Post_clos_adj', 'SDI_Gap']
, 'default': '01/01/2000'
}
}
mask = {
column: pd.Series(
exception['default'],
index=df.index[df[exception['source']].isin(exception['values'])]
) for column, exception in exceptions.items()
}
has_null_values = df[column_to_check_null].fillna(mask).isna().any()
print('Null values:'.upper(),
has_null_values,
'-------------------',
'Scientific notation:'.upper(),
has_scientific_notation,
sep='\n')
NULL VALUES:
Origination_date True
Reporting_date False
Source_system True
dtype: bool
-------------------
SCIENTIFIC NOTATION:
Book_value True
Face_amount False
dtype: bool
I don't think you're handling the origination date properly. You can try this for your first query example:
import pandas as pd
df = pd.read_csv('/mnt/data/your_file.csv')
columns_to_check = ['bank', 'trade code', 'book value', 'business unit', 'COE value', 'corporate product id',
'counterparty legal entity', 'currency', 'cusip', 'face amount', 'legal entity',
'origination date', 'QRM book value', 'QRM face value']
def check_for_blank_values(df):
for col in columns_to_check:
if col == 'origination date':
blank_values = df[df[col].isna() & ~df['source system'].isin(['post-clos_adj', 'GL-SDI_Gap'])]
else:
blank_values = df[df[col].isna()]
if not blank_values.empty:
print(f"Column '{col}' has blank values.")
check_for_blank_values(df)
And then you can implement some regex to identify scientific notation values. Try this for your second query example:
import pandas as pd
df = pd.read_csv('/mnt/data/your_file.csv')
def check_scientific_values(df, column_names):
for column_name in column_names:
# Check if any value in the column matches scientific notation pattern
scientific_notation_found = df[column_name].apply(lambda x: isinstance(x, str) and 'e' in x.lower())
if scientific_notation_found.any():
print(f"The values in column '{column_name}' are in scientific notation.")
columns_to_check = ['book value', 'face amount', 'QRM book value', 'QRM face amount']
check_scientific_values(df, columns_to_check)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With