I have a dataframe with +100K rows like this:
user document
0 john book
1 jane article
2 jane book
3 jane book
4 jim article
5 john book
6 jim blogpost
7 jane blogpost
8 jane blogpost
9 jane blogpost
I need the dataframe like this:
blogpost article book
john 1 3 0
jane 0 0 1
jim 4 0 2
That is, I need numbers of downloads for every user, document
combination.
I'm doing .groupby(['user', 'document'])
and then use df.loc
to set download number:
df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)
grouped = records.groupby(['user', 'document'])
for elem in grouped:
user, document = elem[0]
downloads = len(elem[1])
df.loc[user, document] = downloads
However, df.loc
is very slow when used this way... I've commented out df.loc..
line and found that the loop completes fast, so almost certainly it's df.loc
access that is slow.
How can I get this result faster?
Minimum working example:
records = pd.DataFrame([
('john', 'book'),
('jane', 'article'),
('jane','book'),
('jane','book'),
('jim', 'article'),
('john', 'book'),
('jim', 'blogpost'),
('jane', 'blogpost'),
('jane', 'blogpost'),
('jane', 'blogpost')
], columns=['user', 'document'])
print(records)
users = list(set(records['user']))
users.sort()
documents = list(set(records['document']))
documents.sort()
print(users)
print(documents)
df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)
print(df)
grouped = records.groupby(['user', 'document'])
for elem in grouped:
user, document = elem[0]
downloads = len(elem[1])
df.loc[user, document] = downloads
There are plenty of ways achieves this without loop, pivot
, pivot_table
, crosstab
, groupby count
pd.crosstab(df.user,df.document)
Out[1283]:
document article blogpost book
user
jane 1 3 2
jim 1 1 0
john 0 0 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With