I think this should be easy but I'm hitting a bit of a wall. I have a dataset that was imported into a pandas dataframe from a Stata .dta file. Several of the columns contain date data. The dataframe contains 100,000+ rows but a sample is given:
cat event_date total
0 G2 2006-03-08 16
1 G2 NaT NaN
2 G2 NaT NaN
3 G3 2006-03-10 16
4 G3 2006-08-04 12
5 G3 2006-12-28 13
6 G3 2007-05-25 10
7 G4 2006-03-10 13
8 G4 2006-08-06 19
9 G4 2006-12-30 16
The data is stored as a datetime64 format:
>>> mydata[['cat','event_date','total']].dtypes
cat object
event_date datetime64[ns]
total float64
dtype: object
All I would like to do is create a new column which gives the difference in days (rather than 'us' or 'ns'!!!) between the event_date and a start date, say 2006-01-01. I've tried the following:
>>> mydata['new'] = mydata['event_date'] - np.datetime64('2006-01-01')
… but I get the message:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I've also tried a lambda function but that doesn't work either.
However, if I wanted to simply add on one day to each date I can successfully use:
>>> mydata['plusone'] = mydata['event_date'] + np.timedelta64(1,'D')
That works fine.
Am I missing something straightforward here?
Thanks in advance for any help.
In line 7, we use the to_datetime() function, which takes your entire data frame and creates a datetime object, to create a new column, Date-Time , in our data frame and save the new values.
Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.
A Timestamp object in pandas is an equivalent of Python's datetime object. It is a combination of date and time fields. To combine date and time into a Timestamp object, we use the Timestamp. combine() function in pandas .
For example, you can choose to display the output date as MM/DD/YYYY by specifying dt. strftime('%m/%d/%Y') . There you go!
Not sure why the numpy datetime64
is incompatible with pandas dtypes but using datetime
objects worked fine for me:
In [39]:
import datetime as dt
mydata['new'] = mydata['event_date'] - dt.datetime(2006,1,1)
mydata
Out[39]:
cat event_date total new
Index
0 G2 2006-03-08 16 66 days
1 G2 NaT NaN NaT
2 G2 NaT NaN NaT
3 G3 2006-03-10 16 68 days
4 G3 2006-08-04 12 215 days
5 G3 2006-12-28 13 361 days
6 G3 2007-05-25 10 509 days
7 G4 2006-03-10 13 68 days
8 G4 2006-08-06 19 217 days
9 G4 2006-12-30 16 363 days
Ensure you have an upto date version of pandas and numpy (>=1.7):
In [11]: df.event_date - pd.Timestamp('2006-01-01')
Out[11]:
0 66 days
1 NaT
2 NaT
3 68 days
4 215 days
5 361 days
6 509 days
7 68 days
8 217 days
9 363 days
Name: event_date, dtype: timedelta64[ns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With