I would like to read a large .xls file in parallel using pandas. currently I am using this:
LARGE_FILE = "LARGEFILE.xlsx"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_excel(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
While this runs, I dont think it actually speeds up the process of reading the file. Is there a more efficient way of achieving this?
Just for your information: i'm reading 13 Mbyte, 29000 lines of csv in about 4 seconds. (not using parallel processing) Archlinux, AMD Phenom II X2, Python 3.4, python-pandas 0.16.2.
How big is your file and how long does it take to read it ? That would help to understand the problem better. Is your excel sheet very complex ? Maybe read_excel has difficulty processing that complexity ?
Suggestion: install genumeric and use the helper function ssconvert to translate the file to csv. In your program change to read_csv. Check the time used by ssconvert and the time taken by read_csv. By the way, python-pandas had major improvements while it went from version 13 .... 16, hence usefull to check you have a recent version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With