I am writing a Python script to import pictures from my digital cameras, and I'm using Pandas to help with the bookkeeping of the incoming images. I am using the EXIF data to tag individual images with information, such as the Camera Model, image mode, image format and the timestamp of when the image was acquired by the camera. These data are used to segregate the images into a directory structure. What I'm struggling with is how to use Pandas to group images based on a clump of timestamps that are, for example, all within half-an-hour of each other. As an example, let's say I have six images, three of which were taken within nine minutes of each other and then the other three, also within nine minutes of each other, but an hour later.
import pandas
import datetime
rawdata = [{'filename': 'image_1.jpg',
'timestamp': datetime.datetime(2014, 11, 13, 19, 14, 16, 152847)},
{'filename': 'image_2.jpg',
'timestamp': datetime.datetime(2014, 11, 13, 19, 17, 16, 152847)},
{'filename': 'image_3.jpg',
'timestamp': datetime.datetime(2014, 11, 13, 19, 20, 16, 152847)},
{'filename': 'image_4.jpg',
'timestamp': datetime.datetime(2014, 11, 13, 20, 14, 16, 152847)},
{'filename': 'image_5.jpg',
'timestamp': datetime.datetime(2014, 11, 13, 20, 17, 16, 152847)},
{'filename': 'image_6.jpg',
'timestamp': datetime.datetime(2014, 11, 13, 20, 20, 16, 152847)}]
df = pandas.DataFrame(rawdata)
Is there an automatic way to partition this DataFrame with a half-an-hour threshold such that I would have image_1, image_2 and image_3 in one DataFrame and image_4, image_5 and image_6 in a second DataFrame?
IIUC, one way would be to use the diff-compare-cumsum idiom to get cluster numbers which you can then use to groupby:
>>> df = df.sort("timestamp")
>>> cluster = (df["timestamp"].diff() > pd.Timedelta(minutes=30)).cumsum()
>>> dfs = [v for k,v in df.groupby(cluster)]
>>> for clust in dfs:
... print(clust)
...
filename timestamp
0 image_1.jpg 2014-11-13 19:14:16.152847
1 image_2.jpg 2014-11-13 19:17:16.152847
2 image_3.jpg 2014-11-13 19:20:16.152847
filename timestamp
3 image_4.jpg 2014-11-13 20:14:16.152847
4 image_5.jpg 2014-11-13 20:17:16.152847
5 image_6.jpg 2014-11-13 20:20:16.152847
This works because False ~ 0 and True ~ 1, so if we take the cumulative sum of something which becomes 1 every time a new cluster is found, we'll get the ids we want:
>>> df["timestamp"].diff()
0 NaT
1 00:03:00
2 00:03:00
3 00:54:00
4 00:03:00
5 00:03:00
Name: timestamp, dtype: timedelta64[ns]
>>> df["timestamp"].diff() > pd.Timedelta(minutes=30)
0 False
1 False
2 False
3 True
4 False
5 False
Name: timestamp, dtype: bool
>>> (df["timestamp"].diff() > pd.Timedelta(minutes=30)).cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
Name: timestamp, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With