Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas counting the number of group elements excluding the focal element

Tags:

python

pandas

I have data frame looks like this

company   tool   category         year  month
Amazon    A      productivity     2014     9
Amazon    B      productivity     2014     8
Apple     A      productivity     2014     6
Apple     C      CRM              2015     4 
Apple     D      CRM              2015     3
Google    C      CRM              2015     6
Google    E      HR               2014     9 
Google    F      productivity     2014     11
Google    G      productivity     2014     12    

The first column shows the purchaser of the tool, tool column corresponds to the name of the tool, category shows a tool's purpose, year and month are the dates of purchase.

For each tool, I would like to create a following data:

tool   monthlydate    cumulative_sales no_companies_comp year month
A      2014/06              1                 0          2014  6
A      2014/07              1                 0          2014  7
A      2014/08              1                 1          2014  8
A      2014/09              2                 1          2014  9
A      2014/10              2                 1          2014  10
A      2014/11              2                 2          2014  11
A      2014/12              2                 2          2014  12

where cumulative_sales corresponds to the cumulative sales of the tool at hand in a given year month, no_companies_comp corresponds to the cumulative number of companies that purchased a competitor tool in a given year month (note that a company might purchase multiple competitor tools, but we will only count its first purchase as we are interested in number of companies). How could I achieve this?

like image 684
edyvedy13 Avatar asked Sep 19 '21 19:09

edyvedy13


People also ask

How do you count groups in pandas?

You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.

How do you count the number of elements in a panda?

Pandas DataFrame count() Method The count() method counts the number of not empty values for each row, or column if you specify the axis parameter as axis='columns' , and returns a Series object with the result for each row (or column).

What does unstack do pandas?

Pandas DataFrame: unstack() function Pivot a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.

How do I get the Count of a group in pandas?

You can also use the pandas groupby count () function which gives the “count” of values in each column for each group. For example, let’s group the dataframe df on the “Team” column and apply the count () function. We get a dataframe of counts of values for each group and each column. Note that counts are similar to the row sizes we got above.

How to count the number of occurrences in a Dataframe in pandas?

Using the size () or count () method with pandas.DataFrame.groupby () will generate the count of a number of occurrences of data present in a particular column of the dataframe. However, this operation can also be performed using pandas.Series.value_counts () and, pandas.Index.value_counts ().

How to groupby two columns in pandas Dataframe?

In this article, we will GroupBy two columns and count the occurrences of each combination in Pandas. DataFrame.groupby () method is used to separate the DataFrame into groups. It will generate the number of similar data counts present in a particular column of the data frame. by: mapping, function, string, label, or iterable to group elements.

How do you count rows in a group in a Dataframe?

You can see that we get the count of rows for each group. There are two rows for team A, three rows for team B, and one row for team C in the dataframe df. You can also use the pandas groupby count () function which gives the “count” of values in each column for each group.


1 Answers

With a simple groupby we can obtain the number of sales and companies buying each tool:

>>> sales = df.groupby(['tool', 'year', 'month']).size()
>>> sales
tool  year  month
A     2014  6        1
            9        1
B     2014  8        1
C     2015  4        1
            6        1
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
dtype: int64
>>> companies = df.groupby(['tool', 'year', 'month'])['company'].nunique()
>>> companies
tool  year  month
A     2014  6        1
            9        1
B     2014  8        1
C     2015  4        1
            6        1
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
Name: company, dtype: int64

Then cumulative sales is easy:

>>> sales.groupby('tool').cumsum()
tool  year  month
A     2014  6        1
            9        2
B     2014  8        1
C     2015  4        1
            6        2
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
dtype: int64

Note that a number of months are missing, so we should reindex:

>>> dates = [(2014 + n // 12, (n - 1) % 12 + 1) for n in range(6, 19)]
>>> idx = pd.MultiIndex.from_tuples([
...     (tool, year, month) for tool in df['tool'].unique() for year, month in dates
... ], names=['tool', 'year', 'month'])
>>> cum_sales = sales.reindex(idx, fill_value=0).groupby('tool').cumsum()
>>> cum_sales.unstack('tool')
tool        A  B  C  D  E  F  G
year month                     
2014 6      1  0  0  0  0  0  0
     7      1  0  0  0  0  0  0
     8      1  1  0  0  0  0  0
     9      2  1  0  0  1  0  0
     10     2  1  0  0  1  0  0
     11     2  1  0  0  1  1  0
2015 1      2  1  0  0  1  1  0
     2      2  1  0  0  1  1  0
     3      2  1  0  1  1  1  0
     4      2  1  1  1  1  1  0
     5      2  1  1  1  1  1  0
     6      2  1  2  1  1  1  0
     12     2  1  0  0  1  1  0

You can of course change the range of dates as needed.

Number of companies buying a competitor tool is the number of companies buying any tool, minus the number of companies buying each tool. We can do that with transform but similarly to above, we need to reindex first:

>>> companies = companies.reindex(idx, fill_value=0)
>>> total_companies = companies.groupby(['year', 'month']).transform('sum')
>>> cum_compet_companies = (total_companies - companies).groupby('tool').cumsum()
>>> cum_compet_companies.unstack('tool')
tool        A  B  C  D  E  F  G
year month                     
2014 6      0  1  1  1  1  1  1
     7      0  1  1  1  1  1  1
     8      1  1  2  2  2  2  2
     9      2  3  4  4  3  4  4
     10     2  3  4  4  3  4  4
     11     3  4  5  5  4  4  5
2015 1      3  4  5  5  4  4  5
     2      3  4  5  5  4  4  5
     3      4  5  6  5  5  5  6
     4      5  6  6  6  6  6  7
     5      5  6  6  6  6  6  7
     6      6  7  6  7  7  7  8
     12     3  4  5  5  4  4  5

The rest is just simply joining the data and adding monthlydate, possibly playing with indexes:

>>> res = cum_sales.to_frame('cumulative_sales').join(
...     cum_compet_companies.to_frame('no_companies_comp')
... ).reset_index()
>>> res['monthlydate'] = res['year'].combine(res['month'], lambda y, m: f'{y}/{m:02}')
>>> res.set_index(['tool', 'monthlydate']).loc['A']  # just tool A
             year  month  cumulative_sales  no_companies_comp
monthlydate                                                  
2014/06      2014      6                 1                  0
2014/07      2014      7                 1                  0
2014/08      2014      8                 1                  1
2014/09      2014      9                 2                  2
2014/10      2014     10                 2                  2
2014/11      2014     11                 2                  3
2015/12      2015     12                 2                  3
2015/01      2015      1                 2                  3
2015/02      2015      2                 2                  3
2015/03      2015      3                 2                  4
2015/04      2015      4                 2                  5
2015/05      2015      5                 2                  5
2015/06      2015      6                 2                  6
>>> res.set_index(['tool', 'monthlydate'])  # all tools
                  year  month  cumulative_sales  no_companies_comp
tool monthlydate                                                  
A    2014/06      2014      6                 1                  0
     2014/07      2014      7                 1                  0
     2014/08      2014      8                 1                  1
     2014/09      2014      9                 2                  2
     2014/10      2014     10                 2                  2
...                ...    ...               ...                ...
G    2015/02      2015      2                 0                  5
     2015/03      2015      3                 0                  6
     2015/04      2015      4                 0                  7
     2015/05      2015      5                 0                  7
     2015/06      2015      6                 0                  8

[91 rows x 4 columns]
like image 184
Cimbali Avatar answered Sep 20 '22 09:09

Cimbali