I'm trying to transform a pandas dataframe with three columns (Date, Start, End) into a frequency matrix. My input dataframe look like this:
Date, Start, End
2016-09-02 09:16:00 18 16
2016-09-02 16:14:10 16 1
2016-09-02 06:17:21 18 17
2016-09-02 05:51:07 23 17
2016-09-02 18:34:44 18 17
2016-09-02 05:44:44 20 4
2016-09-02 09:25:22 18 17
2016-09-02 22:27:44 18 17
2016-09-02 16:02:46 0 18
2016-09-02 15:35:07 17 17
2016-09-02 16:06:42 8 17
2016-09-02 14:47:04 16 23
2016-09-02 07:47:24 20 1
...
The values of 'Start' and 'End' are integers between 0
and 23
inclusive. The 'Date' is a datetime. The frequency matrix I'm trying to create is a 24 by 24 csv, where row i
and column j
is the number of times 'End'=i
and 'Start'=j
occurs in the input. For example, the above data would create:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 1
18, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
For extra help, could this be done in a way that creates a separate matrix for every 15 minutes? That would be 672 matrices as this date range is one week. I'm a self taught beginner, and I really can't think of how to solve this in a pythonic way, any solutions or advice would be greatly appreciated.
Create your matrix with a simple count and unstack one of one column:
mat = df.groupby(['Start', 'End']).count().unstack(level=0)
Clean up the Date level:
mat.columns = mat.columns.droplevel(0)
Now reindex rows and columns and cast into integers:
mat.reindex(*[range(0,24)]*2).fillna(0)
Detailed explanations
First, you count the number of occurences a given (start,end) couple appears. The result of groupby against these two columns actually brings back a multiindex.
df.groupby(['Start', 'End']).count()
Out[134]:
Date
Start End
0 18 1
8 17 1
16 1 1
23 1
17 17 1
18 16 1
17 4
20 1 1
4 1
23 17 1
What we want from that result is to get the Start index in columns. unstack does this:
df.groupby(['Start', 'End']).count().unstack(level=0)
Out[135]:
Date
Start 0 8 16 17 18 20 23
End
1 NaN NaN 1.0 NaN NaN 1.0 NaN
4 NaN NaN NaN NaN NaN 1.0 NaN
16 NaN NaN NaN NaN 1.0 NaN NaN
17 NaN 1.0 NaN 1.0 4.0 NaN 1.0
18 1.0 NaN NaN NaN NaN NaN NaN
23 NaN NaN 1.0 NaN NaN NaN NaN
The result of unstack is the Start column being moved as an additional column index level on top of the current Date column index (see below). That's why we drop the level 0 afterwards. Another way - depending on your current source code - could be to filter out the Date column upfront, then unstack would bring one level.
_.columns
Out[136]:
MultiIndex(levels=[['Date'], [0, 8, 16, 17, 18, 20, 23]],
labels=[[0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6]],
names=[None, 'Start'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With