Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rearrange a pandas data frame to create a 2d ratings matrix

I'm trying to build a item-based recommendation system off of the yelp data set. I managed to process the data to an extent where I have the ratings given by all the users that reviewed a restaurant in a given state. Eventually I want to get to the point where I have a ratings matrix with restaurants on one axis and users on the other, and ratings(1-5) in the middle (zero for missing reviews).

Right now the DF looks like this:

               user_id               review_id             business_id  stars
0  Xqd0DzHaiyRqVH3WRG7  15SdjuK7DmYqUAj6rjGowg  vcNAWiLM4dR7D2nwwJ7nCA      5
1  Xqd0DzHaiyRqVH3WRG7  15SdjuK7DmYqUAj6rjGowg  vcNAWiLM4dR7D2nwwJ7nCA      5
2  H1kH6QZV7Le4zqTRNxo  RF6UnRTtG7tWMcrO2GEoAg  vcNAWiLM4dR7D2nwwJ7nCA      2
3  zvJCcrpm2yOZrxKffwG  -TsVN230RCkLYKBeLsuz7A  vcNAWiLM4dR7D2nwwJ7nCA      4
4  KBLW4wJA_fwoWmMhiHR  dNocEAyUucjT371NNND41Q  vcNAWiLM4dR7D2nwwJ7nCA      4
5  zvJCcrpm2yOZrxKffwG  ebcN2aqmNUuYNoyvQErgnA  vcNAWiLM4dR7D2nwwJ7nCA      4
6  Qrs3EICADUKNFoUq2iH  _ePLBPrkrf4bhyiKWEn4Qg  vcNAWiLM4dR7D2nwwJ7nCA      1

but I would like it to look a little bit more like this:

(4 Restaurants x 5 Users)

0 4 3 4 5
3 3 3 2 1 
1 2 3 4 5
0 5 3 3 4 
like image 237
mmera Avatar asked Jun 01 '16 18:06

mmera


People also ask

How do you convert a DataFrame to a matrix in python?

To convert Pandas DataFrame to Numpy Array, use the function DataFrame. to_numpy() . to_numpy() is applied on this DataFrame and the method returns object of type Numpy ndarray. Usually the returned ndarray is 2-dimensional.

Which data structure of pandas works with 2d data?

DataFrame. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.


1 Answers

I think you need pivot with fillna

print (df.pivot(index='business_id', columns='user_id', values='stars').fillna(0))

If:

ValueError: Index contains duplicate entries, cannot reshape

Then use pivot_table:

print (df.pivot_table(index='business_id', columns='user_id', values='stars').fillna(0))
user_id                 H1kH6QZV7Le4zqTRNxo  KBLW4wJA_fwoWmMhiHR  \
business_id                                                        
vcNAWiLM4dR7D2nwwJ7nCA                    2                    4   

user_id                 Qrs3EICADUKNFoUq2iH  Xqd0DzHaiyRqVH3WRG7  \
business_id                                                        
vcNAWiLM4dR7D2nwwJ7nCA                    1                    5   

user_id                 zvJCcrpm2yOZrxKffwG  
business_id                                  
vcNAWiLM4dR7D2nwwJ7nCA                    4  

But pivot_table uses aggfunc, default is aggfunc=np.mean if duplicates. Better explanation with sample is here and in docs.

like image 187
jezrael Avatar answered Oct 06 '22 00:10

jezrael