Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python : Reading Large Excel Worksheets using Openpyxl

Tags:

python

csv

excel

I have an Excel file containing about 400 worksheets, 375 of which I need to save out as CSV files. I've tried a VBA solution, but Excel has issues just opening this workbook.

I've created a python script to do just that. However, it rapidly consumes all available memory and pretty much stops working after 25 sheets are exported. Does anybody have a suggestion on how I might improve this code?

import openpyxl

import csv

import time

print(time.ctime())

importedfile = openpyxl.load_workbook(filename = "C:/Users/User/Desktop/Giant Workbook.xlsm", data_only = True, keep_vba = False)

tabnames = importedfile.get_sheet_names()

substring = "Keyword"

for num in tabnames:

    if num.find(substring) > -1:
        sheet=importedfile.get_sheet_by_name(num)        
        name = "C:/Users/User/Desktop/Test/" + num + ".csv"
        with open(name, 'w', newline='') as file:
            savefile = csv.writer(file)
            for i in sheet.rows:
                savefile.writerow([cell.value for cell in i])
        file.close()
print(time.ctime())

Any help would be appreciated.

Thanks

EDIT: I'm using windows 7 and python 3.4.3. I'm also open to solutions in R, VBA, or SPSS.

like image 718
David Avatar asked Jul 02 '15 16:07

David


1 Answers

Try using the read_only=True property for load_workbook() class, this causes the worksheets you get to be IterableWorksheet , meaning you can only iterate over them: you cannot directly use column/row numbers to access cell values in it. This would provide near constant memory consumption according to documentation .

Also, you do not need to close the file; the with statement would handle that for you.

Example -

import openpyxl

import csv

import time

print(time.ctime())

importedfile = openpyxl.load_workbook(filename = "C:/Users/User/Desktop/Giant Workbook.xlsm", read_only = True, keep_vba = False)

tabnames = importedfile.get_sheet_names()

substring = "Keyword"

for num in tabnames:

    if num.find(substring) > -1:
        sheet=importedfile.get_sheet_by_name(num)        
        name = "C:/Users/User/Desktop/Test/" + num + ".csv"
        with open(name, 'w', newline='') as file:
            savefile = csv.writer(file)
            for i in sheet.rows:
                savefile.writerow([cell.value for cell in i])
print(time.ctime())

From Documentation -

Sometimes, you will need to open or write extremely large XLSX files, and the common routines in openpyxl won’t be able to handle that load. Fortunately, there are two modes that enable you to read and write unlimited amounts of data with (near) constant memory consumption.

like image 167
Anand S Kumar Avatar answered Sep 19 '22 00:09

Anand S Kumar