Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert string to datetime with nulls - python, pandas?

I have a series with some datetimes (as strings) and some nulls as 'nan':

import pandas as pd, numpy as np, datetime as dt df = pd.DataFrame({'Date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']}) 

I'm trying to convert these to datetime:

df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')) 

but I get the error:

time data 'nan' does not match format '%Y-%m-%d %H:%M:%S' 

So I try to turn these into actual nulls:

df.ix[df['Date'] == 'nan', 'Date'] = np.NaN 

and repeat:

df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')) 

but then I get the error:

must be string, not float

What is the quickest way to solve this problem?

like image 471
Colin O'Brien Avatar asked Mar 27 '15 10:03

Colin O'Brien


People also ask

Is null and Notnull in pandas?

Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.

Does pandas check null?

notnull is a pandas function that will examine one or multiple values to validate that they are not null. In Python, null values are reflected as NaN (not a number) or None to signify no data present. . notnull will return False if either NaN or None is detected. If these values are not present, it will return True.

How do I convert a string to a datetime in Python?

We can convert a string to datetime using strptime() function. This function is available in datetime and time modules to parse a string to datetime and time objects respectively.


1 Answers

Just use to_datetime and set errors='coerce' to handle duff data:

In [321]:  df['Date'] = pd.to_datetime(df['Date'], errors='coerce') df Out[321]:                  Date 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2                 NaT 3 2014-10-01 09:38:45  In [322]:  df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 4 entries, 0 to 3 Data columns (total 1 columns): Date    3 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 64.0 bytes 

the problem with calling strptime is that it will raise an error if the string, or dtype is incorrect.

If you did this then it would work:

In [324]:  def func(x):     try:         return dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')     except:         return pd.NaT  df['Date'].apply(func) Out[324]: 0   2014-10-20 10:44:31 1   2014-10-23 09:33:46 2                   NaT 3   2014-10-01 09:38:45 Name: Date, dtype: datetime64[ns] 

but it will be faster to use the inbuilt to_datetime rather than call apply which essentially just loops over your series.

timings

In [326]:  %timeit pd.to_datetime(df['Date'], errors='coerce') %timeit df['Date'].apply(func) 10000 loops, best of 3: 65.8 µs per loop 10000 loops, best of 3: 186 µs per loop 

We see here that using to_datetime is 3X faster.

like image 54
EdChum Avatar answered Sep 21 '22 22:09

EdChum