I have a Pandas dataframe with many columns, two of them are "movie title" and "age", I want to find the top 5 movies with the lowest average age of the people that rated them, but only include movies which have at least 100 ratings (so appear in at least 100 rows).
For example:
movie title age
Title 1 10
Title 2 12
Title 2 12
Title 3 13
Title 3 13
Title 3 13
Should become:
movie title # of ratings avg age
Title 1 1 10
Title 2 2 12
Title 3 3 13
It can be in the same or a new dataframe. Thanks for your help!
Say you do
agg = df.age.groupby([df['movie title']]).agg({'ave_age': 'mean', 'size': 'size'})
You'll get a DataFrame with columns ave_age
and size
.
agg[agg['size'] > 100]
will give you only those that have more than 100 users. From there, sort by agg.ave_age
and take the top 5. It should look something like this:
agg[agg['size'] > 100].sort_values(by='ave_age', ascending=True).head(5)
The filter creates a flag for each movie that is set to True if the movie title count is more than one hundred and False otherwise.
n = 100
filter = (df.groupby(['movie title'])['age']
.transform(lambda group: group.count()) >= n)
Given the small size of your sample data, I will set n
to be 2 and create my filter.
Now I just filter on movies with a count exceeding n
, calculate the average age per group, and then take the five smallest (i.e. lowest age).
>>> df[filter.values].groupby('movie title').age.mean().nsmallest(5)
movie title
Title 2 12
Title 3 13
Name: age, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With