I have a ratings dataframe that has rows of userId, movieId, rating. I want to find the user with the highest number of ratings.
Here is the code I've written:
import pandas as pd
ratings = pd.read_csv('ratings.csv') # userId,movieId,rating
user_rating_counts = ratings[['userId','movieId']].groupby('userId')['movieId'].agg(['count'])
top_rator = user_rating_counts[user_rating_counts['count']==user_rating_counts['count'].max()]
Here is how the file looks like:
userId,movieId,rating
1,1,4.0
1,3,4.0
1,6,4.0
1,47,5.0
1,50,5.0
1,70,3.0
1,101,5.0
1,110,4.0
When I view top_rator in jupyter notebook it looks like this:
count
userId
414 2698
What I want to get from it is a tuple like:
(414, 2698)
How can I do that?
P.S. any comments on how I can do this better/faster/shorter would be appreciated.
Use groupby with size and then Series.agg with max and idxmax in list:
tup = tuple(ratings.groupby('userId').size().agg(['idxmax','max']))
print (tup)
(1, 8)
Explanation:
First aggregate size per groups:
#changed data - multiple groups
print (df)
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 2 47 5.0
4 2 50 5.0
5 2 70 3.0
6 2 101 5.0
7 3 110 4.0
print (df.groupby('userId').size())
userId
1 3
2 4
3 1
dtype: int64
Output is Series, so added Series.agg with list of functions idxmax and max for index and value of Series for maximum:
print (df.groupby('userId').size().agg(['idxmax','max']))
idxmax 2
max 4
dtype: int64
And last convert to tuple:
print (tuple(df.groupby('userId').size().agg(['idxmax','max'])))
(2, 4)
Solution if multiple groups with same maximum sizes:
print (ratings)
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 2 47 5.0
4 2 50 5.0
5 2 70 3.0
6 3 101 5.0
7 3 110 4.0
First aggregate size per groups, but there are 2 groups with max 3 value:
user_rating_counts = ratings.groupby('userId')['movieId'].size()
print (user_rating_counts)
userId
1 3
2 3
3 2
Name: movieId, dtype: int64
So use boolean indexing first:
top_rator = (user_rating_counts[user_rating_counts == user_rating_counts.max()])
print (top_rator)
userId
1 3
2 3
Name: movieId, dtype: int64
Create DataFrame and convert to list of tuples:
tup = list(map(tuple, top_rator.reset_index().values.tolist()))
print (tup)
[(1, 3), (2, 3)]
You can do:
sizes = df.groupby(['userId']).size()
(sizes.idxmax(), sizes.max())
#(1, 8)
Details:
Groupby userId and get the size of each group.
sizes = df.groupby(['userId']).size()
#userId
#1 8
#2 1
Create the tuple of the user with the highest number of ratings using idxmax and max:
(sizes.idxmax(), sizes.max())
#(1, 8)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With