How to Partition Pandas DataFrame using DateTime

Question

I am writing a Python script to import pictures from my digital cameras, and I'm using Pandas to help with the bookkeeping of the incoming images. I am using the EXIF data to tag individual images with information, such as the Camera Model, image mode, image format and the timestamp of when the image was acquired by the camera. These data are used to segregate the images into a directory structure. What I'm struggling with is how to use Pandas to group images based on a clump of timestamps that are, for example, all within half-an-hour of each other. As an example, let's say I have six images, three of which were taken within nine minutes of each other and then the other three, also within nine minutes of each other, but an hour later.

import pandas
import datetime    
rawdata = [{'filename': 'image_1.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 19, 14, 16, 152847)},
 {'filename': 'image_2.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 19, 17, 16, 152847)},
 {'filename': 'image_3.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 19, 20, 16, 152847)},
 {'filename': 'image_4.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 20, 14, 16, 152847)},
 {'filename': 'image_5.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 20, 17, 16, 152847)},
 {'filename': 'image_6.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 20, 20, 16, 152847)}]
df = pandas.DataFrame(rawdata)

Is there an automatic way to partition this DataFrame with a half-an-hour threshold such that I would have image_1, image_2 and image_3 in one DataFrame and image_4, image_5 and image_6 in a second DataFrame?

DSM · Accepted Answer

IIUC, one way would be to use the diff-compare-cumsum idiom to get cluster numbers which you can then use to groupby:

>>> df = df.sort("timestamp")
>>> cluster = (df["timestamp"].diff() > pd.Timedelta(minutes=30)).cumsum()
>>> dfs = [v for k,v in df.groupby(cluster)]
>>> for clust in dfs:
...     print(clust)
...     
      filename                  timestamp
0  image_1.jpg 2014-11-13 19:14:16.152847
1  image_2.jpg 2014-11-13 19:17:16.152847
2  image_3.jpg 2014-11-13 19:20:16.152847
      filename                  timestamp
3  image_4.jpg 2014-11-13 20:14:16.152847
4  image_5.jpg 2014-11-13 20:17:16.152847
5  image_6.jpg 2014-11-13 20:20:16.152847

This works because False ~ 0 and True ~ 1, so if we take the cumulative sum of something which becomes 1 every time a new cluster is found, we'll get the ids we want:

>>> df["timestamp"].diff()
0        NaT
1   00:03:00
2   00:03:00
3   00:54:00
4   00:03:00
5   00:03:00
Name: timestamp, dtype: timedelta64[ns]
>>> df["timestamp"].diff() > pd.Timedelta(minutes=30)
0    False
1    False
2    False
3     True
4    False
5    False
Name: timestamp, dtype: bool
>>> (df["timestamp"].diff() > pd.Timedelta(minutes=30)).cumsum()
0    0
1    0
2    0
3    1
4    1
5    1
Name: timestamp, dtype: int64

How to Partition Pandas DataFrame using DateTime

Tags:

python

pandas

vishnubob

1 Answers

DSM

Recent Activity

Donate For Us

How to Partition Pandas DataFrame using DateTime

Tags:

python

pandas

vishnubob

1 Answers

DSM

Related questions

Recent Activity

Donate For Us