Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel loading of Input Files in Pandas Dataframe

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.

The File extension always changes, it could be .txt one time and .xlsx or .csv another time.

How Can I run this process parallel, in order to save the waiting/ loading time ?

This is my code at the moment,

from time import time # to measure the time taken to run the code
start_time = time()

Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"

import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)

Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()

print(end_time - start_time)

It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time. I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.

Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS

Question Made Eligible for bounty :- As I am looking for a working code with detailed steps - using a package with in anaconda environment that supports loading my input files Parallel and storing them in a pandas data frame separately. This should eventually save time.

like image 959
Siddharth Thanga Mariappan Avatar asked Jan 22 '19 13:01

Siddharth Thanga Mariappan


People also ask

Does pandas work in parallel?

A Dask DataFrame consists of multiple pandas Dataframes, and each pandas dataframe is called a partition. This mechanism allows you to work with larger-than-memory data because your computations are distributed across these pandas dataframes and can be executed in parallel.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.


1 Answers

Try this:

from time import time 
import pandas as pd
from multiprocessing.pool import ThreadPool


start_time = time()

pool = ThreadPool(processes=3)

Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"


# Define a function for the thread
def import_xlsx(file_name):
    df_xlsx = pd.read_excel(file_name)
    # print(df_xlsx.head())
    return df_xlsx


def import_csv(file_name):
    df_csv = pd.read_csv(file_name)
    # print(df_csv.head())
    return df_csv

# Create two threads as follows

Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get() 
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get() 
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get() 

Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
like image 182
CezarySzulc Avatar answered Oct 02 '22 21:10

CezarySzulc