Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read .xls in parallel using pandas?

I would like to read a large .xls file in parallel using pandas. currently I am using this:

LARGE_FILE = "LARGEFILE.xlsx"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
      # process data frame
      return len(df)

if __name__ == '__main__':
      reader = pd.read_excel(LARGE_FILE, chunksize=CHUNKSIZE)
      pool = mp.Pool(4) # use 4 processes

      funclist = []
      for df in reader:
              # process each data frame
              f = pool.apply_async(process_frame,[df])
              funclist.append(f)

      result = 0
      for f in funclist:
              result += f.get(timeout=10) # timeout in 10 seconds

While this runs, I dont think it actually speeds up the process of reading the file. Is there a more efficient way of achieving this?

like image 559
Gman Avatar asked Nov 09 '22 08:11

Gman


1 Answers

Just for your information: i'm reading 13 Mbyte, 29000 lines of csv in about 4 seconds. (not using parallel processing) Archlinux, AMD Phenom II X2, Python 3.4, python-pandas 0.16.2.

How big is your file and how long does it take to read it ? That would help to understand the problem better. Is your excel sheet very complex ? Maybe read_excel has difficulty processing that complexity ?

Suggestion: install genumeric and use the helper function ssconvert to translate the file to csv. In your program change to read_csv. Check the time used by ssconvert and the time taken by read_csv. By the way, python-pandas had major improvements while it went from version 13 .... 16, hence usefull to check you have a recent version.

like image 122
henkidefix Avatar answered Nov 14 '22 21:11

henkidefix