Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping daily data by month in python/pandas and then normalizing

Tags:

python

pandas

I have the table below in a Pandas DataFrame:

    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron          364         2012-10-02 00:00:00
2   current         280         2012-10-02 00:00:00
3   molecular       259         2012-10-02 00:00:00
4   stem            201         2012-10-02 00:00:00

The table contains query volume from a server log, by day. I would like to do 2 things:

  1. I would like to group queries by month summing the query volume of a query for the whole month e.g. if 'molecular' was present on the 2012-10-02 with volume 1000 and on the 2012-10-03 with volume 500, then it should have an entry in the new table of 1500 (volume) with date 2012-10-31 (end of the month end-point representing the month – all dates in the transformed table will be month ends representing the whole month to which they relate).
  2. I want to add a 5th column which contains the month-normalized q_visits. I.e., a term's monthly query volume divided by the total query volume for the month across all terms.

What is the best way of doing this?

like image 280
user7289 Avatar asked Sep 07 '13 19:09

user7289


People also ask

How can I group by month from a date field using Python pandas?

Output: In the above example, the dataframe is groupby by the Date column. As we have provided freq = 'M' which means month, so the data is grouped month-wise till the last date of every month and provided sum of price column.

What are the three phases of the pandas Groupby () function?

(1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.

How to group pandas Dataframe by month in Python?

Python - How to Group Pandas DataFrame by Month? We will group Pandas DataFrame using the groupby. Select the column to be used using the grouper function. We will group month-wise and calculate sum of Registration Price monthly for our example shown below for Car Sale Records.

How to group by a column in pandas groupby?

We can specify a groupby directive for an object using Pandas GroupBy. This stated instruction will choose a column using the grouper function’s key argument, the level and/or axis parameters if provided, and the target object’s or column’s index level. Using the code below, let us perform the groupby operation on our sample data frame.

How to extract the month from a date column in pandas?

Note that the dt.month () function extracts the month from a date column in pandas. The following example shows how to use this syntax in practice. Suppose we have the following pandas DataFrame that shows the sales made by some company on various dates:

How to group Dataframe by month in grouper?

The frequency freq is set ‘M’ to group by month-wise − print(" Group Dataframe by month... ", dataFrame. groupby ( pd. Grouper ( key ='Date_of_Purchase', axis =0, freq ='M')).sum())


1 Answers

If I understand you correctly:

For (1) do this:

Make some fake data by sampling from the values you gave and some random dates and # of visits:

In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')

In [180]: visits = Series(poisson(1000, size=100), name='date')

In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')

In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})

In [183]: df.head()
Out[183]:
                 date   string  visits
0 2001-11-15 00:00:00  current     997
1 2001-11-15 00:00:00  current     974
2 2012-10-02 00:00:00     stem     982
3 2001-12-01 00:00:00     stem     984
4 2001-01-01 00:00:00  current     989

In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')

In [187]: resamp.head()
Out[187]:
                    visits
string  date
current 2001-01-31    2996
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31    3016

NaN is there because there were no visits with that query string in those months.

For (2), group by the dates and then divide by the sum:

In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())

In [189]: g.head()
Out[189]:
                    visits
string  date
current 2001-01-31   0.177
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31   0.188

Just to convince you that (2) is doing what you want:

In [176]: h = g.sortlevel('date').head()

In [177]: h
Out[177]:
                      visits
string    date
current   2001-01-31   0.077
molecular 2001-01-31   0.228
neuron    2001-01-31   0.073
nucleus   2001-01-31   0.234
stem      2001-01-31   0.388

In [178]: h.sum()
Out[178]:
visits    1
dtype: float64

If you want to convert resamp into a DataFrame and remove the NaNs do:

In [196]: resamp.dropna()
Out[196]:
                      visits
string    date
current   2001-01-31    2996
          2001-05-31    3016
          2001-11-30    5959
          2001-12-31    3998
          2013-09-30    1077
molecular 2001-01-31    3984
          2001-05-31    1911
          2001-11-30    3054
          2001-12-31    1020
          2012-10-31     977
          2013-09-30    1947
neuron    2001-01-31    3961
          2001-05-31    2069
          2001-11-30    5010
          2001-12-31    2065
          2012-10-31    6973
          2013-09-30     994
nucleus   2001-01-31    3060
          2001-05-31    3035
          2001-11-30    2924
          2001-12-31    4144
          2012-10-31    2004
          2013-09-30    7881
stem      2001-01-31    2911
          2001-05-31    5994
          2001-11-30    6072
          2001-12-31    4916
          2012-10-31    1991
          2013-09-30    3977

In [197]: resamp.dropna().reset_index()
Out[197]:
       string                date  visits
0     current 2001-01-31 00:00:00    2996
1     current 2001-05-31 00:00:00    3016
2     current 2001-11-30 00:00:00    5959
3     current 2001-12-31 00:00:00    3998
4     current 2013-09-30 00:00:00    1077
5   molecular 2001-01-31 00:00:00    3984
6   molecular 2001-05-31 00:00:00    1911
7   molecular 2001-11-30 00:00:00    3054
8   molecular 2001-12-31 00:00:00    1020
9   molecular 2012-10-31 00:00:00     977
10  molecular 2013-09-30 00:00:00    1947
11     neuron 2001-01-31 00:00:00    3961
12     neuron 2001-05-31 00:00:00    2069
13     neuron 2001-11-30 00:00:00    5010
14     neuron 2001-12-31 00:00:00    2065
15     neuron 2012-10-31 00:00:00    6973
16     neuron 2013-09-30 00:00:00     994
17    nucleus 2001-01-31 00:00:00    3060
18    nucleus 2001-05-31 00:00:00    3035
19    nucleus 2001-11-30 00:00:00    2924
20    nucleus 2001-12-31 00:00:00    4144
21    nucleus 2012-10-31 00:00:00    2004
22    nucleus 2013-09-30 00:00:00    7881
23       stem 2001-01-31 00:00:00    2911
24       stem 2001-05-31 00:00:00    5994
25       stem 2001-11-30 00:00:00    6072
26       stem 2001-12-31 00:00:00    4916
27       stem 2012-10-31 00:00:00    1991
28       stem 2013-09-30 00:00:00    3977

You can of course do this for g as well:

In [198]: g.dropna()
Out[198]:
                      visits
string    date
current   2001-01-31   0.177
          2001-05-31   0.188
          2001-11-30   0.259
          2001-12-31   0.248
          2013-09-30   0.068
molecular 2001-01-31   0.236
          2001-05-31   0.119
          2001-11-30   0.133
          2001-12-31   0.063
          2012-10-31   0.082
          2013-09-30   0.123
neuron    2001-01-31   0.234
          2001-05-31   0.129
          2001-11-30   0.218
          2001-12-31   0.128
          2012-10-31   0.584
          2013-09-30   0.063
nucleus   2001-01-31   0.181
          2001-05-31   0.189
          2001-11-30   0.127
          2001-12-31   0.257
          2012-10-31   0.168
          2013-09-30   0.496
stem      2001-01-31   0.172
          2001-05-31   0.374
          2001-11-30   0.264
          2001-12-31   0.305
          2012-10-31   0.167
          2013-09-30   0.251

In [199]: g.dropna().reset_index()
Out[199]:
       string                date  visits
0     current 2001-01-31 00:00:00   0.177
1     current 2001-05-31 00:00:00   0.188
2     current 2001-11-30 00:00:00   0.259
3     current 2001-12-31 00:00:00   0.248
4     current 2013-09-30 00:00:00   0.068
5   molecular 2001-01-31 00:00:00   0.236
6   molecular 2001-05-31 00:00:00   0.119
7   molecular 2001-11-30 00:00:00   0.133
8   molecular 2001-12-31 00:00:00   0.063
9   molecular 2012-10-31 00:00:00   0.082
10  molecular 2013-09-30 00:00:00   0.123
11     neuron 2001-01-31 00:00:00   0.234
12     neuron 2001-05-31 00:00:00   0.129
13     neuron 2001-11-30 00:00:00   0.218
14     neuron 2001-12-31 00:00:00   0.128
15     neuron 2012-10-31 00:00:00   0.584
16     neuron 2013-09-30 00:00:00   0.063
17    nucleus 2001-01-31 00:00:00   0.181
18    nucleus 2001-05-31 00:00:00   0.189
19    nucleus 2001-11-30 00:00:00   0.127
20    nucleus 2001-12-31 00:00:00   0.257
21    nucleus 2012-10-31 00:00:00   0.168
22    nucleus 2013-09-30 00:00:00   0.496
23       stem 2001-01-31 00:00:00   0.172
24       stem 2001-05-31 00:00:00   0.374
25       stem 2001-11-30 00:00:00   0.264
26       stem 2001-12-31 00:00:00   0.305
27       stem 2012-10-31 00:00:00   0.167
28       stem 2013-09-30 00:00:00   0.251

Lastly, if you want to put your columns in a different order, use reindex:

In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
    visits     string                date
0    0.177    current 2001-01-31 00:00:00
1    0.188    current 2001-05-31 00:00:00
2    0.259    current 2001-11-30 00:00:00
3    0.248    current 2001-12-31 00:00:00
4    0.068    current 2013-09-30 00:00:00
5    0.236  molecular 2001-01-31 00:00:00
6    0.119  molecular 2001-05-31 00:00:00
7    0.133  molecular 2001-11-30 00:00:00
8    0.063  molecular 2001-12-31 00:00:00
9    0.082  molecular 2012-10-31 00:00:00
10   0.123  molecular 2013-09-30 00:00:00
11   0.234     neuron 2001-01-31 00:00:00
12   0.129     neuron 2001-05-31 00:00:00
13   0.218     neuron 2001-11-30 00:00:00
14   0.128     neuron 2001-12-31 00:00:00
15   0.584     neuron 2012-10-31 00:00:00
16   0.063     neuron 2013-09-30 00:00:00
17   0.181    nucleus 2001-01-31 00:00:00
18   0.189    nucleus 2001-05-31 00:00:00
19   0.127    nucleus 2001-11-30 00:00:00
20   0.257    nucleus 2001-12-31 00:00:00
21   0.168    nucleus 2012-10-31 00:00:00
22   0.496    nucleus 2013-09-30 00:00:00
23   0.172       stem 2001-01-31 00:00:00
24   0.374       stem 2001-05-31 00:00:00
25   0.264       stem 2001-11-30 00:00:00
26   0.305       stem 2001-12-31 00:00:00
27   0.167       stem 2012-10-31 00:00:00
28   0.251       stem 2013-09-30 00:00:00
like image 51
Phillip Cloud Avatar answered Oct 12 '22 23:10

Phillip Cloud