Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's New for more.
pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)
Original question: Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!
Here's a simple example:
def randomDates(size, start=134e7, end=137e7):
return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')
df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10),
'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df
price qty recd ship
0 14.723510 3 2012-11-30 19:32:27 2013-03-08 23:10:12
1 53.535143 2 2012-07-25 14:26:45 2012-10-01 11:06:39
2 85.278743 7 2012-12-07 22:24:20 2013-02-26 10:23:20
3 35.940935 8 2013-04-18 13:49:43 2013-03-29 21:19:26
4 54.218896 8 2013-01-03 09:00:15 2012-08-08 12:50:41
5 61.404931 9 2013-02-10 19:36:54 2013-02-23 13:14:42
6 28.917693 1 2012-12-13 02:56:40 2012-09-08 21:14:45
7 88.440408 8 2013-04-04 22:54:55 2012-07-31 18:11:35
8 77.329931 7 2012-11-23 00:49:26 2012-12-09 19:27:40
9 46.540859 5 2013-03-13 11:37:59 2013-03-17 20:09:09
To bin by groups of price or quantity, I can use cut/qcut to bucket them:
df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()
price qty recd ship
qty price
(0, 1] [14.724, 46.541] 1 1 1 1
(1, 5] [14.724, 46.541] 2 2 2 2
(46.541, 61.405] 1 1 1 1
(5, 10] [14.724, 46.541] 1 1 1 1
(46.541, 61.405] 2 2 2 2
(61.405, 88.44] 3 3 3 3
But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?
So the kind of output I'm looking for would be something like:
ship recv count
2011-01 2011-01 1
2011-02 3
... ...
2011-02 2011-01 2
2011-02 6
... ... ...
More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:
ship recv price status count
2011-01 2011-01 [0-10) green 1
red 4
[10-20) yellow 2
... ... ...
2011-02 [0-10) yellow 3
... ... ... ...
As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?
Qcut (quantile-cut) differs from cut in the sense that, in qcut, the number of elements in each bin will be roughly the same, but this will come at the cost of differently sized interval widths.
To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.
Quantile-based discretization function. Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't
seem to support time rules with a multiple > 1, such as '4M'). I think
the answer to your bonus question is .size()
.
In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
....: pd.PeriodIndex(df.ship, freq='Q'),
....: pd.cut(df['qty'], bins=[0,5,10]),
....: pd.qcut(df['price'],q=2),
....: ]).size()
Out[49]:
qty price
2012Q2 2013Q1 (0, 5] [2, 5] 1
2012Q3 2013Q1 (5, 10] [2, 5] 1
2012Q4 2012Q3 (5, 10] [2, 5] 1
2013Q1 (0, 5] [2, 5] 1
(5, 10] [2, 5] 1
2013Q1 2012Q3 (0, 5] (5, 8] 1
2013Q1 (5, 10] (5, 8] 2
2013Q2 2012Q4 (0, 5] (5, 8] 1
2013Q2 (0, 5] [2, 5] 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With