Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Progress bar for pandas .corr() method

I am trying to show a progress bar, using tqdm or some other library, in the line following code:

corrmatrix = adjClose.corr('spearman')

where adjClose is a dataframe that has numerous stock tickers as columns and multiple years of closing prices indexed by date. The output is ultimately a correlation matrix.

This code tends to take exponentially more time as more tickers are added to the dataframe and I would like some sort of visual representation of the progress to indicate the code is still running. Google didn't turn up much in this regard unless I grossly overlooked something.

like image 714
Danqest Avatar asked May 11 '20 03:05

Danqest


1 Answers

Note: This will not be a really feasible answer due to the increased computation time. From what I have measured, it seams to increase dramatically when using small dataframes (up to factor 40), however when using large dataframes it's around a factor of 2 - 3.

Maybe someone can find a more efficient implementation of the custom function calc_corr_coefs.

I have managed to use pythons tqdm module to show the progress, however this required me to make use of its df.progress_apply() function. Here is some sample code:

import time
from tqdm import tqdm
import numpy as np
import pandas as pd


def calc_corr_coefs(s: pd.Series, df_all: pd.DataFrame) -> pd.Series:
    """
    calculates the correlation coefficient between one series and all columns in the dataframe

    :param s:       pd.Series; the column from which you want to calculate the correlation with all other columns
    :param df_all:  pd.DataFrame; the complete dataframe

    return:     a series with all the correlation coefficients
    """

    corr_coef = {}
    for col in df_all:
        # corr_coef[col] = s.corr(df_all[col])
        corr_coef[col] = np.corrcoef(s.values, df_all[col].values)[0, 1]

    return pd.Series(data=corr_coef)


df = pd.DataFrame(np.random.randint(0, 1000, (10000, 200)))

t0 = time.perf_counter()

# first use the basic df.corr()
df_corr_pd = df.corr()

t1 = time.perf_counter()
print(f'base df.corr(): {t1 - t0} s')

# compare to df.progress_apply()
tqdm.pandas(ncols=100)
df_corr_cust = df.progress_apply(calc_corr_coefs, axis=0, args=(df,))

t2 = time.perf_counter()
print(f'with progress bar: {t2 - t1} s')

print(f'factor: {(t2 - t1) / (t1 - t0)}')

I hope this helps and someone will be able to speed up the implementation.

like image 127
N. Maks Avatar answered Oct 07 '22 11:10

N. Maks