Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DatetimeIndex from MongoDB ISODate

I am having some difficulty working with times/timezones. I have raw JSON data of the form

{
  "Date": "28 Sep 2009 00:00:00",
  ....
}

This data is then loaded into MongoDB, and this string representation of the date is transformed into a JavaScript Date object. This conversion to UTC time results in the following date

{
  "_id": ObjectId("577a788f4439e17afd4e21f7"),
  "Date": ISODate("2009-09-27T23:00:00Z")
}

It "looks" as though the date has actually been moved forward a day, I'm assuming (perhaps incorrectly) that this is because my machine is set to Irish Standard Time.

I then read this data from MongoDB and use it to create a pandas DatetimeIndex

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')

which gives me

enter image description here

which is incorrect since the time has not been converted back correctly from UTC to local time. So I followed the solution given in this answer

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
frame = DataFrame(test_docs, index=idx)
frame = frame.drop('Date', 1)

which gives me the right day back

enter image description here

I then normalize the DatetimeIndex so the hours are removed, allowing me to group all entries by day.

frame.groupby(idx).sum()

At this point, however, something strange happens. The dates end up getting grouped as follows

enter image description here

but this doesn't reflect the dates in the frame

enter image description here

Can anyone shed some light on where I might be going wrong?


Response to @ptrj

Explicitly using my timezone as a string

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz='Europe/Dublin')
idx = idx.normalize()
frame = DataFrame(test_docs, index=idx)
...
...
aggregate = frame.groupby(idx).sum()
aggregate.plot()

this doesn't work for me, it results in the following plot

enter image description here

For some reason the groupby is not properly grouping for 2014, as shown below

enter image description here

If instead, I use

idx = idx.tz_convert(tz.gettz('Europe/Dublin'))

I get the same problem

Convert to an object

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
idx = idx.normalize()
frame = DataFrame(test_docs, index=idx)
aggregate = frame.groupby(idx.astype(object)).sum()

This approach seems to work correctly for me

enter image description here

like image 358
Philip O'Brien Avatar asked Jul 05 '16 10:07

Philip O'Brien


People also ask

How can I get DatetimeIndex in pandas?

date attribute outputs an Index object containing the date values present in each of the entries of the DatetimeIndex object. Example #1: Use DatetimeIndex. date attribute to find the date part of the DatetimeIndex object.

How can I find the difference between two Datetimes in pandas?

Use df. dates1-df. dates2 to find the difference between the two dates and then convert the result in the form of months.

What is ISODate in MongoDB?

ISODate() is a helper function that's built into to MongoDB and wraps the native JavaScript Date object. When you use the ISODate() constructor from the Mongo shell, it actually returns a JavaScript Date object.


1 Answers

I was able to reproduce the error with the following data:

idx0 = pd.date_range('2011-11-11', periods=4)
idx1 = idx0.tz_localize(tz.tzutc())
idx2 = idx1.tz_convert(tz.tzlocal())
df = pd.DataFrame([1, 2, 3, 4])

df.groupby(idx2).sum()
Out[20]: 
                           0
1970-01-01 00:00:00-05:00  9
2011-11-10 19:00:00-05:00  1

It's a bug deep in the pandas code, related exclusively to tz.tzlocal(). It manifests itself also in:

idx2.tz_localize(None)
Out[27]: 
DatetimeIndex(['2011-11-10 19:00:00', '1970-01-01 00:00:00',
               '1970-01-01 00:00:00', '1970-01-01 00:00:00'],
              dtype='datetime64[ns]', freq='D')

You can use any of the following solutions:

  • use explicitly your timezone as a string:

    idx2 = idx1.tz_convert(tz='Europe/Dublin')
    df.groupby(idx2).sum()
    Out[29]: 
                               0
    2011-11-11 00:00:00+00:00  1
    2011-11-12 00:00:00+00:00  2
    2011-11-13 00:00:00+00:00  3
    2011-11-14 00:00:00+00:00  4
    

    or if it doesn't work:

    idx2 = idx1.tz_convert(tz.gettz('Europe/Dublin'))
    
  • convert it to an object:

    df.groupby(idx2.astype(object)).sum()
    Out[32]: 
                               0
    2011-11-10 19:00:00-05:00  1
    2011-11-11 19:00:00-05:00  2
    2011-11-12 19:00:00-05:00  3
    2011-11-13 19:00:00-05:00  4
    

Basically, converting to anything else than DatetimeIndex with tz=tz.local() should work.


EDIT: This bug has been just fixed on pandas github. The fix will be available in pandas 0.19 release.

like image 63
ptrj Avatar answered Sep 29 '22 23:09

ptrj