I have a series with some datetimes (as strings) and some nulls as 'nan':
import pandas as pd, numpy as np, datetime as dt df = pd.DataFrame({'Date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']})
I'm trying to convert these to datetime:
df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
but I get the error:
time data 'nan' does not match format '%Y-%m-%d %H:%M:%S'
So I try to turn these into actual nulls:
df.ix[df['Date'] == 'nan', 'Date'] = np.NaN
and repeat:
df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
but then I get the error:
must be string, not float
What is the quickest way to solve this problem?
Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.
notnull is a pandas function that will examine one or multiple values to validate that they are not null. In Python, null values are reflected as NaN (not a number) or None to signify no data present. . notnull will return False if either NaN or None is detected. If these values are not present, it will return True.
We can convert a string to datetime using strptime() function. This function is available in datetime and time modules to parse a string to datetime and time objects respectively.
Just use to_datetime
and set errors='coerce'
to handle duff data:
In [321]: df['Date'] = pd.to_datetime(df['Date'], errors='coerce') df Out[321]: Date 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2 NaT 3 2014-10-01 09:38:45 In [322]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 4 entries, 0 to 3 Data columns (total 1 columns): Date 3 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 64.0 bytes
the problem with calling strptime
is that it will raise an error if the string, or dtype is incorrect.
If you did this then it would work:
In [324]: def func(x): try: return dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S') except: return pd.NaT df['Date'].apply(func) Out[324]: 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2 NaT 3 2014-10-01 09:38:45 Name: Date, dtype: datetime64[ns]
but it will be faster to use the inbuilt to_datetime
rather than call apply
which essentially just loops over your series.
timings
In [326]: %timeit pd.to_datetime(df['Date'], errors='coerce') %timeit df['Date'].apply(func) 10000 loops, best of 3: 65.8 µs per loop 10000 loops, best of 3: 186 µs per loop
We see here that using to_datetime
is 3X faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With