I am trying to show a progress bar, using tqdm
or some other library, in the line following code:
corrmatrix = adjClose.corr('spearman')
where adjClose is a dataframe
that has numerous stock tickers as columns and multiple years of closing prices indexed by date. The output is ultimately a correlation matrix.
This code tends to take exponentially more time as more tickers are added to the dataframe
and I would like some sort of visual representation of the progress to indicate the code is still running. Google didn't turn up much in this regard unless I grossly overlooked something.
Note: This will not be a really feasible answer due to the increased computation time. From what I have measured, it seams to increase dramatically when using small dataframes (up to factor 40), however when using large dataframes it's around a factor of 2 - 3.
Maybe someone can find a more efficient implementation of the custom function calc_corr_coefs
.
I have managed to use pythons tqdm module to show the progress, however this required me to make use of its df.progress_apply()
function. Here is some sample code:
import time
from tqdm import tqdm
import numpy as np
import pandas as pd
def calc_corr_coefs(s: pd.Series, df_all: pd.DataFrame) -> pd.Series:
"""
calculates the correlation coefficient between one series and all columns in the dataframe
:param s: pd.Series; the column from which you want to calculate the correlation with all other columns
:param df_all: pd.DataFrame; the complete dataframe
return: a series with all the correlation coefficients
"""
corr_coef = {}
for col in df_all:
# corr_coef[col] = s.corr(df_all[col])
corr_coef[col] = np.corrcoef(s.values, df_all[col].values)[0, 1]
return pd.Series(data=corr_coef)
df = pd.DataFrame(np.random.randint(0, 1000, (10000, 200)))
t0 = time.perf_counter()
# first use the basic df.corr()
df_corr_pd = df.corr()
t1 = time.perf_counter()
print(f'base df.corr(): {t1 - t0} s')
# compare to df.progress_apply()
tqdm.pandas(ncols=100)
df_corr_cust = df.progress_apply(calc_corr_coefs, axis=0, args=(df,))
t2 = time.perf_counter()
print(f'with progress bar: {t2 - t1} s')
print(f'factor: {(t2 - t1) / (t1 - t0)}')
I hope this helps and someone will be able to speed up the implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With