I have data frame looks like this
company tool category year month
Amazon A productivity 2014 9
Amazon B productivity 2014 8
Apple A productivity 2014 6
Apple C CRM 2015 4
Apple D CRM 2015 3
Google C CRM 2015 6
Google E HR 2014 9
Google F productivity 2014 11
Google G productivity 2014 12
The first column shows the purchaser of the tool, tool column corresponds to the name of the tool, category shows a tool's purpose, year and month are the dates of purchase.
For each tool, I would like to create a following data:
tool monthlydate cumulative_sales no_companies_comp year month
A 2014/06 1 0 2014 6
A 2014/07 1 0 2014 7
A 2014/08 1 1 2014 8
A 2014/09 2 1 2014 9
A 2014/10 2 1 2014 10
A 2014/11 2 2 2014 11
A 2014/12 2 2 2014 12
where cumulative_sales corresponds to the cumulative sales of the tool at hand in a given year month, no_companies_comp corresponds to the cumulative number of companies that purchased a competitor tool in a given year month (note that a company might purchase multiple competitor tools, but we will only count its first purchase as we are interested in number of companies). How could I achieve this?
You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.
Pandas DataFrame count() Method The count() method counts the number of not empty values for each row, or column if you specify the axis parameter as axis='columns' , and returns a Series object with the result for each row (or column).
Pandas DataFrame: unstack() function Pivot a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.
You can also use the pandas groupby count () function which gives the “count” of values in each column for each group. For example, let’s group the dataframe df on the “Team” column and apply the count () function. We get a dataframe of counts of values for each group and each column. Note that counts are similar to the row sizes we got above.
Using the size () or count () method with pandas.DataFrame.groupby () will generate the count of a number of occurrences of data present in a particular column of the dataframe. However, this operation can also be performed using pandas.Series.value_counts () and, pandas.Index.value_counts ().
In this article, we will GroupBy two columns and count the occurrences of each combination in Pandas. DataFrame.groupby () method is used to separate the DataFrame into groups. It will generate the number of similar data counts present in a particular column of the data frame. by: mapping, function, string, label, or iterable to group elements.
You can see that we get the count of rows for each group. There are two rows for team A, three rows for team B, and one row for team C in the dataframe df. You can also use the pandas groupby count () function which gives the “count” of values in each column for each group.
With a simple groupby we can obtain the number of sales and companies buying each tool:
>>> sales = df.groupby(['tool', 'year', 'month']).size()
>>> sales
tool year month
A 2014 6 1
9 1
B 2014 8 1
C 2015 4 1
6 1
D 2015 3 1
E 2014 9 1
F 2014 11 1
G 2014 12 1
dtype: int64
>>> companies = df.groupby(['tool', 'year', 'month'])['company'].nunique()
>>> companies
tool year month
A 2014 6 1
9 1
B 2014 8 1
C 2015 4 1
6 1
D 2015 3 1
E 2014 9 1
F 2014 11 1
G 2014 12 1
Name: company, dtype: int64
Then cumulative sales is easy:
>>> sales.groupby('tool').cumsum()
tool year month
A 2014 6 1
9 2
B 2014 8 1
C 2015 4 1
6 2
D 2015 3 1
E 2014 9 1
F 2014 11 1
G 2014 12 1
dtype: int64
Note that a number of months are missing, so we should reindex:
>>> dates = [(2014 + n // 12, (n - 1) % 12 + 1) for n in range(6, 19)]
>>> idx = pd.MultiIndex.from_tuples([
... (tool, year, month) for tool in df['tool'].unique() for year, month in dates
... ], names=['tool', 'year', 'month'])
>>> cum_sales = sales.reindex(idx, fill_value=0).groupby('tool').cumsum()
>>> cum_sales.unstack('tool')
tool A B C D E F G
year month
2014 6 1 0 0 0 0 0 0
7 1 0 0 0 0 0 0
8 1 1 0 0 0 0 0
9 2 1 0 0 1 0 0
10 2 1 0 0 1 0 0
11 2 1 0 0 1 1 0
2015 1 2 1 0 0 1 1 0
2 2 1 0 0 1 1 0
3 2 1 0 1 1 1 0
4 2 1 1 1 1 1 0
5 2 1 1 1 1 1 0
6 2 1 2 1 1 1 0
12 2 1 0 0 1 1 0
You can of course change the range of dates as needed.
Number of companies buying a competitor tool is the number of companies buying any tool, minus the number of companies buying each tool. We can do that with transform
but similarly to above, we need to reindex first:
>>> companies = companies.reindex(idx, fill_value=0)
>>> total_companies = companies.groupby(['year', 'month']).transform('sum')
>>> cum_compet_companies = (total_companies - companies).groupby('tool').cumsum()
>>> cum_compet_companies.unstack('tool')
tool A B C D E F G
year month
2014 6 0 1 1 1 1 1 1
7 0 1 1 1 1 1 1
8 1 1 2 2 2 2 2
9 2 3 4 4 3 4 4
10 2 3 4 4 3 4 4
11 3 4 5 5 4 4 5
2015 1 3 4 5 5 4 4 5
2 3 4 5 5 4 4 5
3 4 5 6 5 5 5 6
4 5 6 6 6 6 6 7
5 5 6 6 6 6 6 7
6 6 7 6 7 7 7 8
12 3 4 5 5 4 4 5
The rest is just simply joining the data and adding monthlydate
, possibly playing with indexes:
>>> res = cum_sales.to_frame('cumulative_sales').join(
... cum_compet_companies.to_frame('no_companies_comp')
... ).reset_index()
>>> res['monthlydate'] = res['year'].combine(res['month'], lambda y, m: f'{y}/{m:02}')
>>> res.set_index(['tool', 'monthlydate']).loc['A'] # just tool A
year month cumulative_sales no_companies_comp
monthlydate
2014/06 2014 6 1 0
2014/07 2014 7 1 0
2014/08 2014 8 1 1
2014/09 2014 9 2 2
2014/10 2014 10 2 2
2014/11 2014 11 2 3
2015/12 2015 12 2 3
2015/01 2015 1 2 3
2015/02 2015 2 2 3
2015/03 2015 3 2 4
2015/04 2015 4 2 5
2015/05 2015 5 2 5
2015/06 2015 6 2 6
>>> res.set_index(['tool', 'monthlydate']) # all tools
year month cumulative_sales no_companies_comp
tool monthlydate
A 2014/06 2014 6 1 0
2014/07 2014 7 1 0
2014/08 2014 8 1 1
2014/09 2014 9 2 2
2014/10 2014 10 2 2
... ... ... ... ...
G 2015/02 2015 2 0 5
2015/03 2015 3 0 6
2015/04 2015 4 0 7
2015/05 2015 5 0 7
2015/06 2015 6 0 8
[91 rows x 4 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With