Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding count of duplicate values and ordering in a Pandas dataframe

Tags:

python

pandas

I have a Pandas dataframe with many columns, two of them are "movie title" and "age", I want to find the top 5 movies with the lowest average age of the people that rated them, but only include movies which have at least 100 ratings (so appear in at least 100 rows).

For example:

movie title      age

Title 1          10
Title 2          12
Title 2          12
Title 3          13
Title 3          13
Title 3          13

Should become:

movie title     # of ratings     avg age

Title 1         1                    10
Title 2         2                    12
Title 3         3                    13

It can be in the same or a new dataframe. Thanks for your help!

like image 248
user2453297 Avatar asked Mar 12 '23 17:03

user2453297


2 Answers

Say you do

agg = df.age.groupby([df['movie title']]).agg({'ave_age': 'mean', 'size': 'size'})

You'll get a DataFrame with columns ave_age and size.

agg[agg['size'] > 100]

will give you only those that have more than 100 users. From there, sort by agg.ave_age and take the top 5. It should look something like this:

agg[agg['size'] > 100].sort_values(by='ave_age', ascending=True).head(5)
like image 154
Ami Tavory Avatar answered Mar 15 '23 06:03

Ami Tavory


The filter creates a flag for each movie that is set to True if the movie title count is more than one hundred and False otherwise.

n = 100
filter = (df.groupby(['movie title'])['age']
          .transform(lambda group: group.count()) >= n)

Given the small size of your sample data, I will set n to be 2 and create my filter.

Now I just filter on movies with a count exceeding n, calculate the average age per group, and then take the five smallest (i.e. lowest age).

>>> df[filter.values].groupby('movie title').age.mean().nsmallest(5)
movie title
Title 2    12
Title 3    13
Name: age, dtype: int64
like image 32
Alexander Avatar answered Mar 15 '23 07:03

Alexander