Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the equivalent of cut/qcut for pandas date fields?

Tags:

python

pandas

Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's New for more.

pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)

Original question: Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!

Here's a simple example:

def randomDates(size, start=134e7, end=137e7):
    return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')

df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10), 
                   'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df

     price      qty recd                ship
0    14.723510   3  2012-11-30 19:32:27 2013-03-08 23:10:12
1    53.535143   2  2012-07-25 14:26:45 2012-10-01 11:06:39
2    85.278743   7  2012-12-07 22:24:20 2013-02-26 10:23:20
3    35.940935   8  2013-04-18 13:49:43 2013-03-29 21:19:26
4    54.218896   8  2013-01-03 09:00:15 2012-08-08 12:50:41
5    61.404931   9  2013-02-10 19:36:54 2013-02-23 13:14:42
6    28.917693   1  2012-12-13 02:56:40 2012-09-08 21:14:45
7    88.440408   8  2013-04-04 22:54:55 2012-07-31 18:11:35
8    77.329931   7  2012-11-23 00:49:26 2012-12-09 19:27:40
9    46.540859   5  2013-03-13 11:37:59 2013-03-17 20:09:09

To bin by groups of price or quantity, I can use cut/qcut to bucket them:

df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()

                       price  qty recd ship
qty     price               
(0, 1]  [14.724, 46.541]   1   1   1   1
(1, 5]  [14.724, 46.541]   2   2   2   2
        (46.541, 61.405]   1   1   1   1
(5, 10] [14.724, 46.541]   1   1   1   1
        (46.541, 61.405]   2   2   2   2
         (61.405, 88.44]   3   3   3   3

But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?

So the kind of output I'm looking for would be something like:

ship    recv     count
2011-01 2011-01  1
        2011-02  3
        ...      ...
2011-02 2011-01  2
        2011-02  6
...     ...      ...

More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:

ship    recv     price   status count
2011-01 2011-01  [0-10)   green     1
                            red     4
                 [10-20) yellow     2
                  ...      ...    ...
        2011-02  [0-10)  yellow     3
        ...      ...       ...    ...

As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?

like image 337
patricksurry Avatar asked May 01 '13 13:05

patricksurry


People also ask

What is the difference between cut and QCUT pandas?

Qcut (quantile-cut) differs from cut in the sense that, in qcut, the number of elements in each bin will be roughly the same, but this will come at the cost of differently sized interval widths.

How do I cut columns in pandas?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.

What is QCUT in pandas?

Quantile-based discretization function. Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.


1 Answers

Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't seem to support time rules with a multiple > 1, such as '4M'). I think the answer to your bonus question is .size().

In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
   ....:             pd.PeriodIndex(df.ship, freq='Q'),
   ....:             pd.cut(df['qty'], bins=[0,5,10]),
   ....:             pd.qcut(df['price'],q=2),
   ....:            ]).size()
Out[49]: 
                qty      price 
2012Q2  2013Q1  (0, 5]   [2, 5]    1
2012Q3  2013Q1  (5, 10]  [2, 5]    1
2012Q4  2012Q3  (5, 10]  [2, 5]    1
        2013Q1  (0, 5]   [2, 5]    1
                (5, 10]  [2, 5]    1
2013Q1  2012Q3  (0, 5]   (5, 8]    1
        2013Q1  (5, 10]  (5, 8]    2
2013Q2  2012Q4  (0, 5]   (5, 8]    1
        2013Q2  (0, 5]   [2, 5]    1
like image 171
Garrett Avatar answered Oct 27 '22 04:10

Garrett