Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multithreading for data from dataframe pandas

I'm struggling to use multithreading for calculating relatedness between list of customers who have different shopping items on their baskets. So I have a pandas data frame consists of 1,000 customers, which means that I have to calculate the relatedness 1 million times and this takes too long to process

An example of the data frame looks like this:

  ID     Item       
    1    Banana    
    1    Apple     
    2    Orange    
    2    Banana    
    2    Tomato    
    3    Apple     
    3    Tomato    
    3    Orange    

Here is the simplefied version of the code:

import pandas as pd

def relatedness (customer1, customer2):
    # do some calculations to measure the relation between the customers

data= pd.read_csv(data_file)
customers_list= list (set(data['ID']))

relatedness_matrix = pd.DataFrame(index=[customers_list], columns=[customers_list])
for i in customers_list:
    for j in customer_list:
        relatedness_matrix.loc[i,j] = relatedness (i,j)
like image 372
goodX Avatar asked May 19 '16 00:05

goodX


2 Answers

Was looking for same problem about having heavy calculations using pandas DataFrame and found

DASK http://dask.pydata.org/en/latest/

(from this SO https://datascience.stackexchange.com/questions/172/is-there-a-straightforward-way-to-run-pandas-dataframe-isin-in-parallel)

Hope this helps

like image 84
GBrian Avatar answered Oct 12 '22 14:10

GBrian


Check out Modin: "Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical." https://modin.readthedocs.io/en/latest/

like image 36
CyberPlayerOne Avatar answered Oct 12 '22 15:10

CyberPlayerOne