I'm struggling to use multithreading for calculating relatedness between list of customers who have different shopping items on their baskets. So I have a pandas data frame consists of 1,000 customers, which means that I have to calculate the relatedness 1 million times and this takes too long to process
An example of the data frame looks like this:
ID Item
1 Banana
1 Apple
2 Orange
2 Banana
2 Tomato
3 Apple
3 Tomato
3 Orange
Here is the simplefied version of the code:
import pandas as pd
def relatedness (customer1, customer2):
# do some calculations to measure the relation between the customers
data= pd.read_csv(data_file)
customers_list= list (set(data['ID']))
relatedness_matrix = pd.DataFrame(index=[customers_list], columns=[customers_list])
for i in customers_list:
for j in customer_list:
relatedness_matrix.loc[i,j] = relatedness (i,j)
Was looking for same problem about having heavy calculations using pandas DataFrame and found
DASK http://dask.pydata.org/en/latest/
(from this SO https://datascience.stackexchange.com/questions/172/is-there-a-straightforward-way-to-run-pandas-dataframe-isin-in-parallel)
Hope this helps
Check out Modin: "Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical." https://modin.readthedocs.io/en/latest/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With