I have data frame looks like this <pre class="prettyprint"><code>company tool category year month Amazon A productivity 2014 9 Amazon B productivity 2014 8 Apple A productivity 2014 6 Apple C CRM 2015 4 Apple D CRM 2015 3 Google C CRM 2015 6 Google E HR 2014 9 Google F productivity 2014 11 Google G productivity 2014 12 </code></pre> The first column shows the purchaser of the tool, tool column corresponds to the name of the tool, category shows a tool's purpose, year and month are the dates of purchase. For each tool, I would like to create a following data: <pre class="prettyprint"><code>tool monthlydate cumulative_sales no_companies_comp year month A 2014/06 1 0 2014 6 A 2014/07 1 0 2014 7 A 2014/08 1 1 2014 8 A 2014/09 2 1 2014 9 A 2014/10 2 1 2014 10 A 2014/11 2 2 2014 11 A 2014/12 2 2 2014 12 </code></pre> where cumulative_sales corresponds to the cumulative sales of the tool at hand in a given year month, no_companies_comp corresponds to the cumulative number of companies that purchased a competitor tool in a given year month (note that a company might purchase multiple competitor tools, but we will only count its first purchase as we are interested in number of companies). How could I achieve this?

With a simple groupby we can obtain the number of sales and companies buying each tool: <pre class="prettyprint"><code>>>> sales = df.groupby(['tool', 'year', 'month']).size() >>> sales tool year month A 2014 6 1 9 1 B 2014 8 1 C 2015 4 1 6 1 D 2015 3 1 E 2014 9 1 F 2014 11 1 G 2014 12 1 dtype: int64 >>> companies = df.groupby(['tool', 'year', 'month'])['company'].nunique() >>> companies tool year month A 2014 6 1 9 1 B 2014 8 1 C 2015 4 1 6 1 D 2015 3 1 E 2014 9 1 F 2014 11 1 G 2014 12 1 Name: company, dtype: int64 </code></pre> Then cumulative sales is easy: <pre class="prettyprint"><code>>>> sales.groupby('tool').cumsum() tool year month A 2014 6 1 9 2 B 2014 8 1 C 2015 4 1 6 2 D 2015 3 1 E 2014 9 1 F 2014 11 1 G 2014 12 1 dtype: int64 </code></pre> Note that a number of months are missing, so we should reindex: <pre class="prettyprint"><code>>>> dates = [(2014 + n // 12, (n - 1) % 12 + 1) for n in range(6, 19)] >>> idx = pd.MultiIndex.from_tuples([ ... (tool, year, month) for tool in df['tool'].unique() for year, month in dates ... ], names=['tool', 'year', 'month']) >>> cum_sales = sales.reindex(idx, fill_value=0).groupby('tool').cumsum() >>> cum_sales.unstack('tool') tool A B C D E F G year month 2014 6 1 0 0 0 0 0 0 7 1 0 0 0 0 0 0 8 1 1 0 0 0 0 0 9 2 1 0 0 1 0 0 10 2 1 0 0 1 0 0 11 2 1 0 0 1 1 0 2015 1 2 1 0 0 1 1 0 2 2 1 0 0 1 1 0 3 2 1 0 1 1 1 0 4 2 1 1 1 1 1 0 5 2 1 1 1 1 1 0 6 2 1 2 1 1 1 0 12 2 1 0 0 1 1 0 </code></pre> You can of course change the range of dates as needed. Number of companies buying a competitor tool is the number of companies buying any tool, minus the number of companies buying each tool. We can do that with <code>transform</code> but similarly to above, we need to reindex first: <pre class="prettyprint"><code>>>> companies = companies.reindex(idx, fill_value=0) >>> total_companies = companies.groupby(['year', 'month']).transform('sum') >>> cum_compet_companies = (total_companies - companies).groupby('tool').cumsum() >>> cum_compet_companies.unstack('tool') tool A B C D E F G year month 2014 6 0 1 1 1 1 1 1 7 0 1 1 1 1 1 1 8 1 1 2 2 2 2 2 9 2 3 4 4 3 4 4 10 2 3 4 4 3 4 4 11 3 4 5 5 4 4 5 2015 1 3 4 5 5 4 4 5 2 3 4 5 5 4 4 5 3 4 5 6 5 5 5 6 4 5 6 6 6 6 6 7 5 5 6 6 6 6 6 7 6 6 7 6 7 7 7 8 12 3 4 5 5 4 4 5 </code></pre> The rest is just simply joining the data and adding <code>monthlydate</code>, possibly playing with indexes: <pre class="prettyprint"><code>>>> res = cum_sales.to_frame('cumulative_sales').join( ... cum_compet_companies.to_frame('no_companies_comp') ... ).reset_index() >>> res['monthlydate'] = res['year'].combine(res['month'], lambda y, m: f'{y}/{m:02}') >>> res.set_index(['tool', 'monthlydate']).loc['A'] # just tool A year month cumulative_sales no_companies_comp monthlydate 2014/06 2014 6 1 0 2014/07 2014 7 1 0 2014/08 2014 8 1 1 2014/09 2014 9 2 2 2014/10 2014 10 2 2 2014/11 2014 11 2 3 2015/12 2015 12 2 3 2015/01 2015 1 2 3 2015/02 2015 2 2 3 2015/03 2015 3 2 4 2015/04 2015 4 2 5 2015/05 2015 5 2 5 2015/06 2015 6 2 6 >>> res.set_index(['tool', 'monthlydate']) # all tools year month cumulative_sales no_companies_comp tool monthlydate A 2014/06 2014 6 1 0 2014/07 2014 7 1 0 2014/08 2014 8 1 1 2014/09 2014 9 2 2 2014/10 2014 10 2 2 ... ... ... ... ... G 2015/02 2015 2 0 5 2015/03 2015 3 0 6 2015/04 2015 4 0 7 2015/05 2015 5 0 7 2015/06 2015 6 0 8 [91 rows x 4 columns] </code></pre>

Pandas counting the number of group elements excluding the focal element

Tags:

python

pandas

I have data frame looks like this

company   tool   category         year  month
Amazon    A      productivity     2014     9
Amazon    B      productivity     2014     8
Apple     A      productivity     2014     6
Apple     C      CRM              2015     4 
Apple     D      CRM              2015     3
Google    C      CRM              2015     6
Google    E      HR               2014     9 
Google    F      productivity     2014     11
Google    G      productivity     2014     12

The first column shows the purchaser of the tool, tool column corresponds to the name of the tool, category shows a tool's purpose, year and month are the dates of purchase.

For each tool, I would like to create a following data:

tool   monthlydate    cumulative_sales no_companies_comp year month
A      2014/06              1                 0          2014  6
A      2014/07              1                 0          2014  7
A      2014/08              1                 1          2014  8
A      2014/09              2                 1          2014  9
A      2014/10              2                 1          2014  10
A      2014/11              2                 2          2014  11
A      2014/12              2                 2          2014  12

where cumulative_sales corresponds to the cumulative sales of the tool at hand in a given year month, no_companies_comp corresponds to the cumulative number of companies that purchased a competitor tool in a given year month (note that a company might purchase multiple competitor tools, but we will only count its first purchase as we are interested in number of companies). How could I achieve this?

684

asked Sep 19 '21 19:09

edyvedy13

1 Answers

With a simple groupby we can obtain the number of sales and companies buying each tool:

>>> sales = df.groupby(['tool', 'year', 'month']).size()
>>> sales
tool  year  month
A     2014  6        1
            9        1
B     2014  8        1
C     2015  4        1
            6        1
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
dtype: int64
>>> companies = df.groupby(['tool', 'year', 'month'])['company'].nunique()
>>> companies
tool  year  month
A     2014  6        1
            9        1
B     2014  8        1
C     2015  4        1
            6        1
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
Name: company, dtype: int64

Then cumulative sales is easy:

>>> sales.groupby('tool').cumsum()
tool  year  month
A     2014  6        1
            9        2
B     2014  8        1
C     2015  4        1
            6        2
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
dtype: int64

Note that a number of months are missing, so we should reindex:

>>> dates = [(2014 + n // 12, (n - 1) % 12 + 1) for n in range(6, 19)]
>>> idx = pd.MultiIndex.from_tuples([
...     (tool, year, month) for tool in df['tool'].unique() for year, month in dates
... ], names=['tool', 'year', 'month'])
>>> cum_sales = sales.reindex(idx, fill_value=0).groupby('tool').cumsum()
>>> cum_sales.unstack('tool')
tool        A  B  C  D  E  F  G
year month                     
2014 6      1  0  0  0  0  0  0
     7      1  0  0  0  0  0  0
     8      1  1  0  0  0  0  0
     9      2  1  0  0  1  0  0
     10     2  1  0  0  1  0  0
     11     2  1  0  0  1  1  0
2015 1      2  1  0  0  1  1  0
     2      2  1  0  0  1  1  0
     3      2  1  0  1  1  1  0
     4      2  1  1  1  1  1  0
     5      2  1  1  1  1  1  0
     6      2  1  2  1  1  1  0
     12     2  1  0  0  1  1  0

You can of course change the range of dates as needed.

Number of companies buying a competitor tool is the number of companies buying any tool, minus the number of companies buying each tool. We can do that with transform but similarly to above, we need to reindex first:

>>> companies = companies.reindex(idx, fill_value=0)
>>> total_companies = companies.groupby(['year', 'month']).transform('sum')
>>> cum_compet_companies = (total_companies - companies).groupby('tool').cumsum()
>>> cum_compet_companies.unstack('tool')
tool        A  B  C  D  E  F  G
year month                     
2014 6      0  1  1  1  1  1  1
     7      0  1  1  1  1  1  1
     8      1  1  2  2  2  2  2
     9      2  3  4  4  3  4  4
     10     2  3  4  4  3  4  4
     11     3  4  5  5  4  4  5
2015 1      3  4  5  5  4  4  5
     2      3  4  5  5  4  4  5
     3      4  5  6  5  5  5  6
     4      5  6  6  6  6  6  7
     5      5  6  6  6  6  6  7
     6      6  7  6  7  7  7  8
     12     3  4  5  5  4  4  5

The rest is just simply joining the data and adding monthlydate, possibly playing with indexes:

>>> res = cum_sales.to_frame('cumulative_sales').join(
...     cum_compet_companies.to_frame('no_companies_comp')
... ).reset_index()
>>> res['monthlydate'] = res['year'].combine(res['month'], lambda y, m: f'{y}/{m:02}')
>>> res.set_index(['tool', 'monthlydate']).loc['A']  # just tool A
             year  month  cumulative_sales  no_companies_comp
monthlydate                                                  
2014/06      2014      6                 1                  0
2014/07      2014      7                 1                  0
2014/08      2014      8                 1                  1
2014/09      2014      9                 2                  2
2014/10      2014     10                 2                  2
2014/11      2014     11                 2                  3
2015/12      2015     12                 2                  3
2015/01      2015      1                 2                  3
2015/02      2015      2                 2                  3
2015/03      2015      3                 2                  4
2015/04      2015      4                 2                  5
2015/05      2015      5                 2                  5
2015/06      2015      6                 2                  6
>>> res.set_index(['tool', 'monthlydate'])  # all tools
                  year  month  cumulative_sales  no_companies_comp
tool monthlydate                                                  
A    2014/06      2014      6                 1                  0
     2014/07      2014      7                 1                  0
     2014/08      2014      8                 1                  1
     2014/09      2014      9                 2                  2
     2014/10      2014     10                 2                  2
...                ...    ...               ...                ...
G    2015/02      2015      2                 0                  5
     2015/03      2015      3                 0                  6
     2015/04      2015      4                 0                  7
     2015/05      2015      5                 0                  7
     2015/06      2015      6                 0                  8

[91 rows x 4 columns]

184

answered Sep 20 '22 09:09

Cimbali

Related questions
                            
                                Pandas sort_value() issue. Wrong sorting integer when applied key parameter
                            
                                Scraping data from a dynamic web table
                            
                                str.encode() giving unexpected results
                            
                                How to fill the values in the list and convert it into the dataframe?
                            
                                Making a ML model scikit-learn compatible
                            
                                InvalidArgumentError: required broadcastable shapes at loc(unknown)
                            
                                Forward fill only certain value
                            
                                How to get the target by adding using python
                            
                                VS Code portable on Linux is still using for packages local user folder instead of the enviroment folder, and because of that imports fail
                            
                                What is the Sobel operator?
                            
                                In Pandas with Groupby: assign a value from a column conditioned on another column
                            
                                Drop all rows that have all NA values after last row that is not NA
                            
                                Building ML classifier with imbalanced data
                            
                                yfinance not working - receiving json.decoder.JSONDecodeError
                            
                                Django admin, page not found in custom view
                            
                                AttributeError: dlsym(RTLD_DEFAULT, AttachDebuggerTracing): symbol not found
                            
                                Using decorators of optional dependency
                            
                                Can anyone please explain why set is behaving like this with boolean in it? [duplicate]
                            
                                How to parse datetime that is coming in Arabic text (٠٤-٢٥-٢٠٢١) to English dates in Pyspark
                            
                                Split a string in pandas row and insert new rows by enlarging the dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With