Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge a large Dask dataframe with a small Pandas dataframe

Tags:

Following the example here: YouTube: Dask-Pandas Dataframe Join I attempting to merge a ~70GB Dask data frame with a ~24MB that I loaded as a Pandas dataframe.

The merge is on two columns A and B, and I did not set any as indices:

import dask.dataframe as dd from dask.diagnostics import ProgressBar  small_df = pd.read_csv(dataframe1) # as pandas large_df = dd.read_csv(dataframe2) #as dask.dataframe  df2 = large_df.merge(small_df, how='left', left_on=leftcolumns, right_on=rightcolumns) #do the merge  A = df2[df2['some column'] == 'somevalue'] #do a reduction that would fit on my computer  pbar = ProgressBar() pbar.register()  result = A.compute() 

I'm using a Windows computer with 16GB of RAM and 4 cores. I use the progress bar to assess how far along the merging process it is. I left it all night last night. I restarted it this morning and so far its about half an hour in and 0% progress.

Thank you and I appreciate your help,

Update

I tried it on my Mac with 8GB of RAM and worked pretty well. I have the Dask distribution that comes with Anaconda I believe. I don't think I did anything different in any case.

I share my results and time following the above coding (21 minutes):

In [26]: C = result1.compute() [########################################] | 100% Completed | 21min 13.4s [########################################] | 100% Completed | 21min 13.5s [########################################] | 100% Completed | 21min 13.6s [########################################] | 100% Completed | 21min 13.6s 

Update 2

I updated to the latest version of Dask on my Windows computer and it worked well.

like image 691
dleal Avatar asked Sep 13 '16 12:09

dleal


1 Answers

you can iterate over unique equal values and assign other columns with loop:

unioun_set = list(set(small_df['common_column']) & set(large_df['common_column'])) for el in union_set:     for column in small_df.columns:         if column not in large_df.columns:             large_df.loc[large_df['common_column'] == el,column] = small_df.loc[small_df['common_column'] ==  el,column]   
like image 194
Barış Can Tayiz Avatar answered Oct 12 '22 20:10

Barış Can Tayiz