parallel processing in pandas python

Tags:

I have 5,000,000 rows in my dataframe. In my code, I am using iterrows() which is taking too much time. To get the required output, I have to iterate through all the rows . So I wanted to know whether I can parallelize the code in pandas.

562

asked Mar 17 '16 07:03

surya

2 Answers

Here's a webpage I found that might help: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

And here's the code for multiprocessing found in that page:

import pandas as pd
import multiprocessing as mp

LARGE_FILE = "D:\\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
    # process data frame
    return len(df)

if __name__ == '__main__':
    reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
    pool = mp.Pool(4) # use 4 processes

    funclist = []
    for df in reader:
        # process each data frame
        f = pool.apply_async(process_frame,[df])
        funclist.append(f)

    result = 0
    for f in funclist:
        result += f.get(timeout=10) # timeout in 10 seconds

    print "There are %d rows of data"%(result)

answered Sep 28 '22 15:09

jtitusj

This code shows how you might break up a large dataframe into smaller dataframes each with a number of rows equal to N_ROWS (except possibly for the last dataframe) and then pass each dataframe to a process pool (of whatever size you want, but there is no point in using anything larger than the number of processors you have). Each worker process returns the modified dataframe back to the main process which then reassembles the final result dataframe by concatenating the return values from the worker processes:

import pandas as pd
import multiprocessing as mp


def process_frame(df):
    # process data frame
    # create new index starting at 0:
    df.reset_index(inplace=True, drop=True)
    # increment everybody's age:
    for i in range(len(df.index)):
        df.at[i, 'Age'] += 1
    return df


def divide_and_conquer(df):
    N_ROWS = 2 # number of rows in each dataframe
    with mp.Pool(3) as pool: # use 3 processes
        # break up dataframe into smaller daraframes of N_ROWS rows each
        cnt = len(df.index)
        n, remainder = divmod(cnt, N_ROWS)
        results = []
        start_index = 0
        for i in range(n):
            results.append(pool.apply_async(process_frame, args=(df.loc[start_index:start_index+N_ROWS-1, :],)))
            start_index += N_ROWS
        if remainder:
            results.append(pool.apply_async(process_frame, args=(df.loc[start_index:start_index+remainder-1, :],)))
        new_dfs = [result.get() for result in results]
        # reassemble final dataframe:
        df = pd.concat(new_dfs, ignore_index=True)
        return df



if __name__ == '__main__':
    df = pd.DataFrame({
        "Name": ['Tom', 'Dick', 'Harry', 'Jane', 'June', 'Sally', 'Mary'],
        "Age": [10, 20, 30, 40, 40, 60, 70],
        "Sex": ['M', 'M', 'M', 'F', 'F', 'F', 'F']
    })
    print(df)
    df = divide_and_conquer(df)
    print(df)

Prints:

    Name  Age Sex
0    Tom   10   M
1   Dick   20   M
2  Harry   30   M
3   Jane   40   F
4   June   40   F
5  Sally   60   F
6   Mary   70   F
    Name  Age Sex
0    Tom   11   M
1   Dick   21   M
2  Harry   31   M
3   Jane   41   F
4   June   41   F
5  Sally   61   F
6   Mary   71   F

answered Sep 28 '22 17:09

Booboo

Related questions
                            
                                Loop over rows of csv.DictReader more than once
                            
                                Merging dataframes based on date range
                            
                                Numpy: add a vector to matrix column wise
                            
                                Python - Random Forest - Iteratively adding trees
                            
                                Skipping more than one row in Python csv
                            
                                Training logistic regression using scikit learn for multi-class classification
                            
                                How to bin time in a pandas dataframe
                            
                                Matplotlib: y-axis label with multiple colors
                            
                                Cython prange slower for 4 threads then with range
                            
                                Py2app: Operation not permitted
                            
                                Why does the 'in' keyword claim it needs an iterable object?
                            
                                Run django application without django.contrib.admin
                            
                                Adding column(s) from one dataframe to another python pandas
                            
                                how to speed up NE recognition with stanford NER with python nltk
                            
                                How to test tensorflow cifar10 cnn tutorial model
                            
                                how to use matplotlib quiver scale
                            
                                Add multiple columns with zero values from a list to a Pandas data frame
                            
                                seaborn boxplots at desired distances along the x axis
                            
                                Reading in csv file as dataframe from hdfs
                            
                                Python mock object instantiation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

parallel processing in pandas python

Tags:

python

pandas

surya

People also ask

2 Answers

jtitusj

Booboo

Recent Activity

Donate For Us