Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up Pandas on Multi-core machine

I have a pandas data frame that fits comfortably in memory. I do serval maps on the data frame, but each map is time-consuming due to the complexity of the call-back functions passed to map. I own a AWS C4 instance, which is 8-core and 16GB-RAM. I ran the python script on the machine and found that more than 80% of CPU time is idle. So, I think (correct me if I am not right) the python script is single-threaded and only consume 1 core. Is there a way to speed up pandas on multi-core machine? Here is the snippet of the two time-consuming maps

 tfidf_features = df.apply(lambda r: compute_tfidf_features(r.q1_tfidf_bow, r.q2_tfidf_bow), axis=1)
 bin_features = df.apply(lambda r: compute_bin_features(r.q1_bin_bow, r.q2_bin_bow), axis=1)

Here is the compute_tfidf_features function

def compute_tfidf_features(sparse1, sparse2):
    nparray1 = sparse1.toarray()[0]
    nparray2 = sparse2.toarray()[0]

    features = pd.Series({
    'bow_tfidf_sum1': np.sum(sparse1),
    'bow_tfidf_sum2': np.sum(sparse2),
    'bow_tfidf_mean1': np.mean(sparse1),
    'bow_tfidf_mean2': np.mean(sparse2),
    'bow_tfidf_cosine': cosine(nparray1, nparray2),
    'bow_tfidf_jaccard': real_jaccard(nparray1, nparray2),
    'bow_tfidf_sym_kl_divergence': sym_kl_div(nparray1, nparray2),
    'bow_tfidf_pearson': pearsonr(nparray1, nparray2)[0]
    })

    return features

I am aware of a python library called dask, but it says that it’s not intended for a data frame that can comfortably fit in memory.

like image 730
Allen Avatar asked Apr 15 '17 07:04

Allen


People also ask

Does pandas run on multiple cores?

Pandas is the most widely used library for data analysis but its rather slow because pandas does not utilize more than one CPU core, to speed up things and utilize all cores we need to break our data frame to smaller ones, ideally in parts of equal to the number of available CPU cores.

Does pandas run faster on GPU?

This limited speed of pandas data frames is because pandas work on CPUs that only have 8 cores. However, GPU acceleration of data science and machine learning workflows provides a solution to this problem and enhances the speed of operations at an impressive level.

Does pandas apply use all cores?

Pandas' apply(~) method uses a single core, which means that a single thread is used to perform this method. If your machine has multiple cores, then you would be able to execute the apply(~) method in parallel.


1 Answers

Pandas does not support this. Dask arrays are mostly API compatible with Pandas and support parallel execution for apply.

You might also consider some bleeding edge solutions such as this new tool

like image 139
Eilif Mikkelsen Avatar answered Sep 21 '22 00:09

Eilif Mikkelsen