I need to perform validation on dates within a dataframe (https://pastebin.com/kNqLtUWu), checking if a date is valid. If the date is invalid (i.e. pd.to_datetime cannot parse - 0107-01-06, e.g.), I need to populate the Fail column with Yes.
I subset the columns containing dates and was able to identify the columns that contain invalid dates and add them to a dict, but haven't figured out how to return the specific row.
I am open to other approaches, but I need to use pandas and end up with a Fail column to indicate the row, which I plan to filter the final dataframe on (one dataframe containing rows with bad dates and the other containing no errors).
See pastebin link for full code
# insert empty Fail column to identify date errors
df.insert(loc=0, column='Fail', value="")
# replace all blanks with np.NaN
df.replace(r"^s*$", np.nan, regex=True, inplace = True)
# get list of date columns
cols = list(df)
date_cols = cols[2:]
# create empty dict
dfs = {}
# iterate over date columns to identify which columns contain invalid dates & add to dfs
for col in df[date_cols]:
try:
df[col] = df[col].apply(pd.to_datetime, errors='raise')
except:
print("%s column contains invalid date" % col)
dfs[col] = df[col]
Your problem as describe can be solve with coerce and a little logic:
# original non_null
notnull = df[col].notnull()
# where to_datetime fails
not_datetime = pd.to_datetime(df[col], errors='coerce').isna()
not_datetime = not_datetime & notnull
IIUC, you concern is on creating Fail columns. So, I focus on create it.
I think you may achive it with apply on datetime columns with slicing on axis=1 with custom lambda. The lambda will filter out NaN before passing each slice to pd.to_datetime with coerce and check any NaT from the output.
df['Fail'] = (df[date_cols].apply(lambda x: pd.to_datetime(x[x.notna()], errors='coerce')
.isna().any(), axis=1).replace({True: 'Fail', False: ''}))
Out[869]:
Fail patient_ID DateOfBirth ... date_10 date_11 date_12
0 A001 1950-03-02 ... NaT NaT NaN
1 A001 1950-03-02 ... NaT NaT NaN
2 A001 1950-03-02 ... NaT NaT NaN
3 A001 1950-03-02 ... NaT NaT NaN
4 A001 1950-03-02 ... 2010-01-01 NaT NaN
5 A001 1950-03-02 ... NaT 2010-01-01 NaN
6 A001 1950-03-02 ... NaT NaT 1/1/2010
7 A001 1950-03-02 ... NaT NaT 1/1/2010
8 A001 1950-03-02 ... NaT NaT 1/1/2010
9 A001 1950-03-02 ... NaT NaT 1/1/2010
10 A001 1950-03-02 ... NaT NaT 1/1/2010
11 A001 1950-03-02 ... NaT NaT 1/1/2010
12 A001 1950-03-02 ... NaT NaT 1/1/2010
13 A001 1950-03-02 ... NaT NaT 1/1/2010
14 A001 1950-03-02 ... NaT NaT 1/1/2010
15 Fail A002 1950-03-02 ... NaT NaT NaN
16 A002 1950-03-02 ... NaT NaT NaN
17 A002 1950-03-02 ... NaT NaT NaN
18 A002 1950-03-02 ... NaT NaT NaN
19 A002 1950-03-02 ... 2010-01-01 NaT NaN
20 A002 1950-03-02 ... NaT 2010-01-01 NaN
21 A002 1950-03-02 ... NaT NaT 1/1/2010
22 A002 1950-03-02 ... NaT NaT 1/1/2010
23 A002 1950-03-02 ... NaT NaT 1/1/2010
24 A002 1950-03-02 ... NaT NaT 1/1/2010
25 A002 1950-03-02 ... NaT NaT 1/1/2010
26 A002 1950-03-02 ... NaT NaT 1/1/2010
27 A002 1950-03-02 ... NaT NaT 1/1/2010
28 A002 1950-03-02 ... NaT NaT 1/1/2010
29 Fail A002 1950-03-02 ... NaT NaT 0107-01-06
[30 rows x 15 columns]
Note:
Code above is for creating Fail columns. It won't convert those columns to datetime. To convert them, you just need call pd.to_datetime separately.
Below are values of two row where Fail
In [870]: df.loc[15]
Out[870]:
Fail Fail
patient_ID A002
DateOfBirth 1950-03-02 00:00:00
date_1 0107-01-06
date_2 2010-01-01 00:00:00
date_3 NaT
date_4 NaT
date_5 NaT
date_6 NaT
date_7 NaT
date_8 NaT
date_9 NaT
date_10 NaT
date_11 NaT
date_12 NaN
Name: 15, dtype: object
In [871]: df.loc[29]
Out[871]:
Fail Fail
patient_ID A002
DateOfBirth 1950-03-02 00:00:00
date_1 NaN
date_2 NaT
date_3 NaT
date_4 NaT
date_5 NaT
date_6 NaT
date_7 NaT
date_8 NaT
date_9 NaT
date_10 NaT
date_11 NaT
date_12 0107-01-06
Name: 29, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With