I have the table below in a Pandas DataFrame
:
q_string q_visits q_date
0 nucleus 1790 2012-10-02 00:00:00
1 neuron 364 2012-10-02 00:00:00
2 current 280 2012-10-02 00:00:00
3 molecular 259 2012-10-02 00:00:00
4 stem 201 2012-10-02 00:00:00
The table contains query volume from a server log, by day. I would like to do 2 things:
q_visits
. I.e., a term's monthly query volume divided by the total query volume for the month across all terms.What is the best way of doing this?
Output: In the above example, the dataframe is groupby by the Date column. As we have provided freq = 'M' which means month, so the data is grouped month-wise till the last date of every month and provided sum of price column.
(1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.
Python - How to Group Pandas DataFrame by Month? We will group Pandas DataFrame using the groupby. Select the column to be used using the grouper function. We will group month-wise and calculate sum of Registration Price monthly for our example shown below for Car Sale Records.
We can specify a groupby directive for an object using Pandas GroupBy. This stated instruction will choose a column using the grouper function’s key argument, the level and/or axis parameters if provided, and the target object’s or column’s index level. Using the code below, let us perform the groupby operation on our sample data frame.
Note that the dt.month () function extracts the month from a date column in pandas. The following example shows how to use this syntax in practice. Suppose we have the following pandas DataFrame that shows the sales made by some company on various dates:
The frequency freq is set ‘M’ to group by month-wise − print(" Group Dataframe by month... ", dataFrame. groupby ( pd. Grouper ( key ='Date_of_Purchase', axis =0, freq ='M')).sum())
If I understand you correctly:
For (1) do this:
Make some fake data by sampling from the values you gave and some random dates and # of visits:
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
In [180]: visits = Series(poisson(1000, size=100), name='date')
In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
In [183]: df.head()
Out[183]:
date string visits
0 2001-11-15 00:00:00 current 997
1 2001-11-15 00:00:00 current 974
2 2012-10-02 00:00:00 stem 982
3 2001-12-01 00:00:00 stem 984
4 2001-01-01 00:00:00 current 989
In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
In [187]: resamp.head()
Out[187]:
visits
string date
current 2001-01-31 2996
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 3016
NaN
is there because there were no visits with that query string in those months.
For (2), group by the dates and then divide by the sum:
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
In [189]: g.head()
Out[189]:
visits
string date
current 2001-01-31 0.177
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 0.188
Just to convince you that (2) is doing what you want:
In [176]: h = g.sortlevel('date').head()
In [177]: h
Out[177]:
visits
string date
current 2001-01-31 0.077
molecular 2001-01-31 0.228
neuron 2001-01-31 0.073
nucleus 2001-01-31 0.234
stem 2001-01-31 0.388
In [178]: h.sum()
Out[178]:
visits 1
dtype: float64
If you want to convert resamp
into a DataFrame
and remove the NaN
s do:
In [196]: resamp.dropna()
Out[196]:
visits
string date
current 2001-01-31 2996
2001-05-31 3016
2001-11-30 5959
2001-12-31 3998
2013-09-30 1077
molecular 2001-01-31 3984
2001-05-31 1911
2001-11-30 3054
2001-12-31 1020
2012-10-31 977
2013-09-30 1947
neuron 2001-01-31 3961
2001-05-31 2069
2001-11-30 5010
2001-12-31 2065
2012-10-31 6973
2013-09-30 994
nucleus 2001-01-31 3060
2001-05-31 3035
2001-11-30 2924
2001-12-31 4144
2012-10-31 2004
2013-09-30 7881
stem 2001-01-31 2911
2001-05-31 5994
2001-11-30 6072
2001-12-31 4916
2012-10-31 1991
2013-09-30 3977
In [197]: resamp.dropna().reset_index()
Out[197]:
string date visits
0 current 2001-01-31 00:00:00 2996
1 current 2001-05-31 00:00:00 3016
2 current 2001-11-30 00:00:00 5959
3 current 2001-12-31 00:00:00 3998
4 current 2013-09-30 00:00:00 1077
5 molecular 2001-01-31 00:00:00 3984
6 molecular 2001-05-31 00:00:00 1911
7 molecular 2001-11-30 00:00:00 3054
8 molecular 2001-12-31 00:00:00 1020
9 molecular 2012-10-31 00:00:00 977
10 molecular 2013-09-30 00:00:00 1947
11 neuron 2001-01-31 00:00:00 3961
12 neuron 2001-05-31 00:00:00 2069
13 neuron 2001-11-30 00:00:00 5010
14 neuron 2001-12-31 00:00:00 2065
15 neuron 2012-10-31 00:00:00 6973
16 neuron 2013-09-30 00:00:00 994
17 nucleus 2001-01-31 00:00:00 3060
18 nucleus 2001-05-31 00:00:00 3035
19 nucleus 2001-11-30 00:00:00 2924
20 nucleus 2001-12-31 00:00:00 4144
21 nucleus 2012-10-31 00:00:00 2004
22 nucleus 2013-09-30 00:00:00 7881
23 stem 2001-01-31 00:00:00 2911
24 stem 2001-05-31 00:00:00 5994
25 stem 2001-11-30 00:00:00 6072
26 stem 2001-12-31 00:00:00 4916
27 stem 2012-10-31 00:00:00 1991
28 stem 2013-09-30 00:00:00 3977
You can of course do this for g
as well:
In [198]: g.dropna()
Out[198]:
visits
string date
current 2001-01-31 0.177
2001-05-31 0.188
2001-11-30 0.259
2001-12-31 0.248
2013-09-30 0.068
molecular 2001-01-31 0.236
2001-05-31 0.119
2001-11-30 0.133
2001-12-31 0.063
2012-10-31 0.082
2013-09-30 0.123
neuron 2001-01-31 0.234
2001-05-31 0.129
2001-11-30 0.218
2001-12-31 0.128
2012-10-31 0.584
2013-09-30 0.063
nucleus 2001-01-31 0.181
2001-05-31 0.189
2001-11-30 0.127
2001-12-31 0.257
2012-10-31 0.168
2013-09-30 0.496
stem 2001-01-31 0.172
2001-05-31 0.374
2001-11-30 0.264
2001-12-31 0.305
2012-10-31 0.167
2013-09-30 0.251
In [199]: g.dropna().reset_index()
Out[199]:
string date visits
0 current 2001-01-31 00:00:00 0.177
1 current 2001-05-31 00:00:00 0.188
2 current 2001-11-30 00:00:00 0.259
3 current 2001-12-31 00:00:00 0.248
4 current 2013-09-30 00:00:00 0.068
5 molecular 2001-01-31 00:00:00 0.236
6 molecular 2001-05-31 00:00:00 0.119
7 molecular 2001-11-30 00:00:00 0.133
8 molecular 2001-12-31 00:00:00 0.063
9 molecular 2012-10-31 00:00:00 0.082
10 molecular 2013-09-30 00:00:00 0.123
11 neuron 2001-01-31 00:00:00 0.234
12 neuron 2001-05-31 00:00:00 0.129
13 neuron 2001-11-30 00:00:00 0.218
14 neuron 2001-12-31 00:00:00 0.128
15 neuron 2012-10-31 00:00:00 0.584
16 neuron 2013-09-30 00:00:00 0.063
17 nucleus 2001-01-31 00:00:00 0.181
18 nucleus 2001-05-31 00:00:00 0.189
19 nucleus 2001-11-30 00:00:00 0.127
20 nucleus 2001-12-31 00:00:00 0.257
21 nucleus 2012-10-31 00:00:00 0.168
22 nucleus 2013-09-30 00:00:00 0.496
23 stem 2001-01-31 00:00:00 0.172
24 stem 2001-05-31 00:00:00 0.374
25 stem 2001-11-30 00:00:00 0.264
26 stem 2001-12-31 00:00:00 0.305
27 stem 2012-10-31 00:00:00 0.167
28 stem 2013-09-30 00:00:00 0.251
Lastly, if you want to put your columns in a different order, use reindex
:
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
visits string date
0 0.177 current 2001-01-31 00:00:00
1 0.188 current 2001-05-31 00:00:00
2 0.259 current 2001-11-30 00:00:00
3 0.248 current 2001-12-31 00:00:00
4 0.068 current 2013-09-30 00:00:00
5 0.236 molecular 2001-01-31 00:00:00
6 0.119 molecular 2001-05-31 00:00:00
7 0.133 molecular 2001-11-30 00:00:00
8 0.063 molecular 2001-12-31 00:00:00
9 0.082 molecular 2012-10-31 00:00:00
10 0.123 molecular 2013-09-30 00:00:00
11 0.234 neuron 2001-01-31 00:00:00
12 0.129 neuron 2001-05-31 00:00:00
13 0.218 neuron 2001-11-30 00:00:00
14 0.128 neuron 2001-12-31 00:00:00
15 0.584 neuron 2012-10-31 00:00:00
16 0.063 neuron 2013-09-30 00:00:00
17 0.181 nucleus 2001-01-31 00:00:00
18 0.189 nucleus 2001-05-31 00:00:00
19 0.127 nucleus 2001-11-30 00:00:00
20 0.257 nucleus 2001-12-31 00:00:00
21 0.168 nucleus 2012-10-31 00:00:00
22 0.496 nucleus 2013-09-30 00:00:00
23 0.172 stem 2001-01-31 00:00:00
24 0.374 stem 2001-05-31 00:00:00
25 0.264 stem 2001-11-30 00:00:00
26 0.305 stem 2001-12-31 00:00:00
27 0.167 stem 2012-10-31 00:00:00
28 0.251 stem 2013-09-30 00:00:00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With