Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing cartesian product between two Pandas Dataframe

I have two dataframes with the same columns:

Dataframe 1:

          attr_1  attr_77 ... attr_8
userID                              
John      1.2501  2.4196  ... 1.7610
Charles   0.0000  1.0618  ... 1.4813
Genarito  2.7037  4.6707  ... 5.3583
Mark      9.2775  6.7638  ... 6.0071

Dataframe 2:

          attr_1  attr_77 ... attr_8
petID                              
Firulais  1.2501  2.4196  ... 1.7610
Connie    0.0000  1.0618  ... 1.4813
PopCorn   2.7037  4.6707  ... 5.3583

I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:

   userId   petID      Correlation    p-value
0  John     Firulais   0.091447       1.222927e-02
1  John     Connie     0.101687       5.313359e-03
2  John     PopCorn    0.178965       8.103919e-07
3  Charles  Firulais   -0.078460      3.167896e-02

The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:

First of all, initial DataFrames:

df1 = pd.DataFrame({
    'userID': ['John', 'Charles', 'Genarito', 'Mark'],
    'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
    'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
    'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')

df2 = pd.DataFrame({
    'petID': ['Firulais', 'Connie', 'PopCorn'],
    'attr_1': [1.2501, 0.0, 2.7037],
    'attr_77': [2.4196, 1.0618, 4.6707],
    'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')

Option 1:

# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)

i = 0
for df1_key in df1_keys:
    df1_values = df1.loc[df1_key, :].values
    for df2_key in df2.index:
        df2_values = df2.loc[df2_key, :]
        pearson_res = pearsonr(df1_values, df2_values)

        users[i] = df1_key
        pets[i] = df2_key
        coff[i] = pearson_res[0]
        p_value[i] = pearson_res[1]
        i += 1

# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
    'userID': users,
    'petID': pets,
    'Correlation': coff,
    'p-value': p_value
})

Option 2 (slower), from here:

# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
    df1, df2 = prepare_df(df1_file_path, df2_file_path)

    df1['_tmpkey'] = 1
    df2['_tmpkey'] = 1

    res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
    res.index = pd.MultiIndex.from_product((df1.index, df2.index))

    df1.drop('_tmpkey', axis=1, inplace=True)
    df2.drop('_tmpkey', axis=1, inplace=True)

    return res

# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
    values = np.split(row.values, 2)
    return pearsonr(values[0], values[1])

result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)

Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?

Edit:

As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.

Thanks in advance

like image 503
Genarito Avatar asked Jan 21 '20 21:01

Genarito


2 Answers

Of all the alternatives tested, the one that gave me the best results was the following:

  1. An iteration product was made with itertools.product().

  2. All the iterations on both iterrows were performed on a Pool of parallel processes (using a map function).

To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:

In the cython_modules.pyx file:

from scipy.stats import pearsonr
import numpy as np

def compute_row_cython(row):
    (df1_key, df1_values), (df2_key, df2_values) = row
    cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
    return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

Then I set up the setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='Compiled Pearson',
      ext_modules=cythonize("cython_modules.pyx")

Finally I compiled it with: python setup.py build_ext --inplace

The final code was left, then:

import itertools
import multiprocessing
from cython_modules import compute_row_cython

NUM_CORES = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.

The solution with Dask can be found in my other question.

like image 194
Genarito Avatar answered Oct 26 '22 12:10

Genarito


Here's another method using same cross join but using the built in pandas method DataFrame.corrwith and scipy.stats.ttest_ind. Since we use less "loopy" implementation, this should perform better.

from scipy.stats import ttest_ind

mrg = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop(columns='key')

x = mrg.filter(like='_x').rename(columns=lambda x: x.rsplit('_', 1)[0])
y = mrg.filter(like='_y').rename(columns=lambda x: x.rsplit('_', 1)[0])

df = mrg[['userID', 'petID']].join(x.corrwith(y, axis=1).rename('Correlation'))

df['p_value'] = ttest_ind(x, y, axis=1)[1]
      userID     petID  Correlation   p_value
0       John  Firulais     1.000000  1.000000
1       John    Connie     0.641240  0.158341
2       John   PopCorn     0.661040  0.048041
3    Charles  Firulais     0.641240  0.158341
4    Charles    Connie     1.000000  1.000000
5    Charles   PopCorn     0.999660  0.020211
6   Genarito  Firulais     0.661040  0.048041
7   Genarito    Connie     0.999660  0.020211
8   Genarito   PopCorn     1.000000  1.000000
9       Mark  Firulais    -0.682794  0.006080
10      Mark    Connie    -0.998462  0.003865
11      Mark   PopCorn    -0.999569  0.070639
like image 31
Erfan Avatar answered Oct 26 '22 12:10

Erfan