Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe.loc method too slow

I have a dataframe with +100K rows like this:

   user  document
0  john      book
1  jane   article
2  jane      book
3  jane      book
4   jim   article
5  john      book
6   jim  blogpost
7  jane  blogpost
8  jane  blogpost
9  jane  blogpost

I need the dataframe like this:

      blogpost  article  book
john         1        3     0
jane         0        0     1
jim          4        0     2

That is, I need numbers of downloads for every user, document combination.

I'm doing .groupby(['user', 'document']) and then use df.loc to set download number:

df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)

grouped = records.groupby(['user', 'document'])
for elem in grouped:
    user, document = elem[0]
    downloads = len(elem[1])
    df.loc[user, document] = downloads

However, df.loc is very slow when used this way... I've commented out df.loc.. line and found that the loop completes fast, so almost certainly it's df.loc access that is slow.

How can I get this result faster?

Minimum working example:

records = pd.DataFrame([
    ('john', 'book'), 
    ('jane', 'article'),
    ('jane','book'),
    ('jane','book'),
    ('jim', 'article'), 
    ('john', 'book'),
    ('jim', 'blogpost'), 
    ('jane', 'blogpost'),
    ('jane', 'blogpost'),
    ('jane', 'blogpost')
    ], columns=['user', 'document'])
print(records)

users = list(set(records['user']))
users.sort()
documents = list(set(records['document']))
documents.sort()

print(users)
print(documents)

df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)
print(df)

grouped = records.groupby(['user', 'document'])
for elem in grouped:
    user, document = elem[0]
    downloads = len(elem[1])
    df.loc[user, document] = downloads
like image 891
LetMeSOThat4U Avatar asked Feb 21 '18 18:02

LetMeSOThat4U


1 Answers

There are plenty of ways achieves this without loop, pivot , pivot_table, crosstab, groupby count

pd.crosstab(df.user,df.document)
Out[1283]: 
document  article  blogpost  book
user                             
jane            1         3     2
jim             1         1     0
john            0         0     2
like image 96
BENY Avatar answered Oct 07 '22 01:10

BENY