Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas select rows where datetime error occurs

I need to perform validation on dates within a dataframe (https://pastebin.com/kNqLtUWu), checking if a date is valid. If the date is invalid (i.e. pd.to_datetime cannot parse - 0107-01-06, e.g.), I need to populate the Fail column with Yes.

I subset the columns containing dates and was able to identify the columns that contain invalid dates and add them to a dict, but haven't figured out how to return the specific row.

I am open to other approaches, but I need to use pandas and end up with a Fail column to indicate the row, which I plan to filter the final dataframe on (one dataframe containing rows with bad dates and the other containing no errors).

See pastebin link for full code

# insert empty Fail column to identify date errors
df.insert(loc=0, column='Fail', value="")

# replace all blanks with np.NaN
df.replace(r"^s*$", np.nan, regex=True, inplace = True)

# get list of date columns
cols = list(df)
date_cols = cols[2:]

# create empty dict
dfs = {}

# iterate over date columns to identify which columns contain invalid dates & add to dfs
for col in df[date_cols]:
    try:
        df[col] = df[col].apply(pd.to_datetime, errors='raise')
    except:
        print("%s column contains invalid date" % col)
        dfs[col] = df[col]
like image 220
n8-da-gr8 Avatar asked Oct 30 '25 14:10

n8-da-gr8


2 Answers

Your problem as describe can be solve with coerce and a little logic:

# original non_null
notnull = df[col].notnull()

# where to_datetime fails
not_datetime = pd.to_datetime(df[col], errors='coerce').isna()

not_datetime = not_datetime & notnull
like image 113
Quang Hoang Avatar answered Nov 02 '25 07:11

Quang Hoang


IIUC, you concern is on creating Fail columns. So, I focus on create it. I think you may achive it with apply on datetime columns with slicing on axis=1 with custom lambda. The lambda will filter out NaN before passing each slice to pd.to_datetime with coerce and check any NaT from the output.

df['Fail'] = (df[date_cols].apply(lambda x: pd.to_datetime(x[x.notna()], errors='coerce')
                          .isna().any(), axis=1).replace({True: 'Fail', False: ''}))

Out[869]:
    Fail patient_ID DateOfBirth  ...    date_10    date_11     date_12
0              A001  1950-03-02  ...        NaT        NaT         NaN
1              A001  1950-03-02  ...        NaT        NaT         NaN
2              A001  1950-03-02  ...        NaT        NaT         NaN
3              A001  1950-03-02  ...        NaT        NaT         NaN
4              A001  1950-03-02  ... 2010-01-01        NaT         NaN
5              A001  1950-03-02  ...        NaT 2010-01-01         NaN
6              A001  1950-03-02  ...        NaT        NaT    1/1/2010
7              A001  1950-03-02  ...        NaT        NaT    1/1/2010
8              A001  1950-03-02  ...        NaT        NaT    1/1/2010
9              A001  1950-03-02  ...        NaT        NaT    1/1/2010
10             A001  1950-03-02  ...        NaT        NaT    1/1/2010
11             A001  1950-03-02  ...        NaT        NaT    1/1/2010
12             A001  1950-03-02  ...        NaT        NaT    1/1/2010
13             A001  1950-03-02  ...        NaT        NaT    1/1/2010
14             A001  1950-03-02  ...        NaT        NaT    1/1/2010
15  Fail       A002  1950-03-02  ...        NaT        NaT         NaN
16             A002  1950-03-02  ...        NaT        NaT         NaN
17             A002  1950-03-02  ...        NaT        NaT         NaN
18             A002  1950-03-02  ...        NaT        NaT         NaN
19             A002  1950-03-02  ... 2010-01-01        NaT         NaN
20             A002  1950-03-02  ...        NaT 2010-01-01         NaN
21             A002  1950-03-02  ...        NaT        NaT    1/1/2010
22             A002  1950-03-02  ...        NaT        NaT    1/1/2010
23             A002  1950-03-02  ...        NaT        NaT    1/1/2010
24             A002  1950-03-02  ...        NaT        NaT    1/1/2010
25             A002  1950-03-02  ...        NaT        NaT    1/1/2010
26             A002  1950-03-02  ...        NaT        NaT    1/1/2010
27             A002  1950-03-02  ...        NaT        NaT    1/1/2010
28             A002  1950-03-02  ...        NaT        NaT    1/1/2010
29  Fail       A002  1950-03-02  ...        NaT        NaT  0107-01-06

[30 rows x 15 columns]

Note:
Code above is for creating Fail columns. It won't convert those columns to datetime. To convert them, you just need call pd.to_datetime separately.


Below are values of two row where Fail

In [870]: df.loc[15]
Out[870]:
Fail                          Fail
patient_ID                    A002
DateOfBirth    1950-03-02 00:00:00
date_1                  0107-01-06
date_2         2010-01-01 00:00:00
date_3                         NaT
date_4                         NaT
date_5                         NaT
date_6                         NaT
date_7                         NaT
date_8                         NaT
date_9                         NaT
date_10                        NaT
date_11                        NaT
date_12                        NaN
Name: 15, dtype: object

In [871]: df.loc[29]
Out[871]:
Fail                          Fail
patient_ID                    A002
DateOfBirth    1950-03-02 00:00:00
date_1                         NaN
date_2                         NaT
date_3                         NaT
date_4                         NaT
date_5                         NaT
date_6                         NaT
date_7                         NaT
date_8                         NaT
date_9                         NaT
date_10                        NaT
date_11                        NaT
date_12                 0107-01-06
Name: 29, dtype: object
like image 42
Andy L. Avatar answered Nov 02 '25 07:11

Andy L.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!