Parallel loading of Input Files in Pandas Dataframe

Tags:

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.

The File extension always changes, it could be .txt one time and .xlsx or .csv another time.

How Can I run this process parallel, in order to save the waiting/ loading time ?

This is my code at the moment,

from time import time # to measure the time taken to run the code
start_time = time()

Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"

import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)

Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()

print(end_time - start_time)

It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time. I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.

Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS

Question Made Eligible for bounty :- As I am looking for a working code with detailed steps - using a package with in anaconda environment that supports loading my input files Parallel and storing them in a pandas data frame separately. This should eventually save time.

959

asked Jan 22 '19 13:01

Siddharth Thanga Mariappan

1 Answers

Try this:

from time import time 
import pandas as pd
from multiprocessing.pool import ThreadPool


start_time = time()

pool = ThreadPool(processes=3)

Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"


# Define a function for the thread
def import_xlsx(file_name):
    df_xlsx = pd.read_excel(file_name)
    # print(df_xlsx.head())
    return df_xlsx


def import_csv(file_name):
    df_csv = pd.read_csv(file_name)
    # print(df_csv.head())
    return df_csv

# Create two threads as follows

Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get() 
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get() 
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get() 

Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()

182

answered Oct 02 '22 21:10

CezarySzulc

Related questions
                            
                                LinearConstraint in scipy.optimize
                            
                                matplotlib get_color for subplot
                            
                                how to set label for each subplot in a plot in matplotlib?
                            
                                Python how to remove last comma from print(string, end=“, ”)
                            
                                Get a Discord Role by Id
                            
                                How to remove nan and inf values from a numpy matrix?
                            
                                How to select an inter-year period with xarray?
                            
                                Why opening and iterating over file handle over twice as fast in Python 2 vs Python 3?
                            
                                Reusing Tensorflow session in multiple threads causes crash
                            
                                InvalidArgumentError: input_1:0 is both fed and fetched
                            
                                Moving QSlider to Mouse Click Position
                            
                                Better method to iterate over 3 lists
                            
                                Can static variables be declared as private in python?
                            
                                compare a list with values in dictionary
                            
                                Modify seaborn line relplot legend title
                            
                                Dataflow/apache beam - how to access current filename when passing in pattern?
                            
                                Rename the less frequent categories by "OTHER" python
                            
                                Python error when building Python package Docker Image
                            
                                Percentage of array between values
                            
                                AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel loading of Input Files in Pandas Dataframe

Tags:

python

pandas

anaconda

Siddharth Thanga Mariappan

People also ask

1 Answers

CezarySzulc

Recent Activity

Donate For Us