Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge pandas value_counts() to dataframe or use it to subset a dataframe

Tags:

python

pandas

I used pandas df.value_counts() to find the number of occurrences of particular brands. I want to merge those value counts with the respective brands in the initial dataframe.

 df has many columns including one named 'brands'
 brands = df.brands.value_counts()

 brand1   143
 brand2   21
 brand3   101
 etc.

How do I merge the value counts with the original dataframe such that each brand's corresponding count is in a new column, say "brand_count"?

Is it possible to assign headers to these columns; the names function won't work with series and I was unable to convert it to a dataframe to possibly merge the data that way. But, value_counts outputs a Series of dtype int64 (brand names should be type string) which means I cannot do the following:

 df2 = pd.DataFrame({'brands': list(brands_all[0]), "brand_count":
 list(brands_all[1])})
 (merge with df)

Ultimately, I want to obtain this:

 col1  col2  col3  brands  brand_count ... col150
                   A        30
                   C        140
                   A        30
                   B        111 
like image 434
user2476665 Avatar asked Mar 05 '16 02:03

user2476665


People also ask

What does the function Value_counts () do in pandas?

value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

How do I merge one DataFrame to another?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How do I merge two DataFrames in pandas?

Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes. If the joining is done on columns, indexes are ignored. This function returns a new DataFrame and the source DataFrame objects are unchanged.


2 Answers

i think the best way is to use map

df['brand_count']= df.brand.map(df.brand.value_counts())

this is so much faster than groupby method for example (factor 500 on a 15000 row df) and take only one line

like image 56
Egos Avatar answered Oct 21 '22 21:10

Egos


You want to use transform.

import numpy as np
import pandas as pd

np.random.seed(0)

# Create dummy data.
df = pd.DataFrame({'brands': ['brand{0}'.format(n) 
                   for n in np.random.random_integers(0, 5, 10)]})

df['brand_count'] = \
    df.groupby('brands', as_index=False)['brands'].transform(lambda s: s.count())

>>> df
   brands brand_count
0  brand4           1
1  brand5           2
2  brand0           1
3  brand3           4
4  brand3           4
5  brand3           4
6  brand1           1
7  brand3           4
8  brand5           2
9  brand2           1

For reference:

>>> df.brands.value_counts()
brand3    4
brand5    2
brand4    1
brand0    1
brand1    1
brand2    1
Name: brands, dtype: int64
like image 41
Alexander Avatar answered Oct 21 '22 19:10

Alexander