I'm trying to build a item-based recommendation system off of the yelp data set. I managed to process the data to an extent where I have the ratings given by all the users that reviewed a restaurant in a given state. Eventually I want to get to the point where I have a ratings matrix with restaurants on one axis and users on the other, and ratings(1-5) in the middle (zero for missing reviews).
Right now the DF looks like this:
user_id review_id business_id stars
0 Xqd0DzHaiyRqVH3WRG7 15SdjuK7DmYqUAj6rjGowg vcNAWiLM4dR7D2nwwJ7nCA 5
1 Xqd0DzHaiyRqVH3WRG7 15SdjuK7DmYqUAj6rjGowg vcNAWiLM4dR7D2nwwJ7nCA 5
2 H1kH6QZV7Le4zqTRNxo RF6UnRTtG7tWMcrO2GEoAg vcNAWiLM4dR7D2nwwJ7nCA 2
3 zvJCcrpm2yOZrxKffwG -TsVN230RCkLYKBeLsuz7A vcNAWiLM4dR7D2nwwJ7nCA 4
4 KBLW4wJA_fwoWmMhiHR dNocEAyUucjT371NNND41Q vcNAWiLM4dR7D2nwwJ7nCA 4
5 zvJCcrpm2yOZrxKffwG ebcN2aqmNUuYNoyvQErgnA vcNAWiLM4dR7D2nwwJ7nCA 4
6 Qrs3EICADUKNFoUq2iH _ePLBPrkrf4bhyiKWEn4Qg vcNAWiLM4dR7D2nwwJ7nCA 1
but I would like it to look a little bit more like this:
(4 Restaurants x 5 Users)
0 4 3 4 5
3 3 3 2 1
1 2 3 4 5
0 5 3 3 4
To convert Pandas DataFrame to Numpy Array, use the function DataFrame. to_numpy() . to_numpy() is applied on this DataFrame and the method returns object of type Numpy ndarray. Usually the returned ndarray is 2-dimensional.
DataFrame. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
I think you need pivot
with fillna
print (df.pivot(index='business_id', columns='user_id', values='stars').fillna(0))
If:
ValueError: Index contains duplicate entries, cannot reshape
Then use pivot_table
:
print (df.pivot_table(index='business_id', columns='user_id', values='stars').fillna(0))
user_id H1kH6QZV7Le4zqTRNxo KBLW4wJA_fwoWmMhiHR \
business_id
vcNAWiLM4dR7D2nwwJ7nCA 2 4
user_id Qrs3EICADUKNFoUq2iH Xqd0DzHaiyRqVH3WRG7 \
business_id
vcNAWiLM4dR7D2nwwJ7nCA 1 5
user_id zvJCcrpm2yOZrxKffwG
business_id
vcNAWiLM4dR7D2nwwJ7nCA 4
But pivot_table
uses aggfunc
, default is aggfunc=np.mean
if duplicates. Better explanation with sample is here and in docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With