Memory error using openpyxl and large data excels

Tags:

I have written a script which has to read lot of excel files from a folder (around 10,000). This script loads the excel file (some of them has more than 2,000 rows) and reads one column to count the number of rows (checking stuff). If the number of rows is not equal to a given number, it writes the warning in a log.

The problem comes when the script reads more than 1,000 excel files. It's then when it throws memory error, and I don't know where could be the problem. Previously, the script read two csv file with 14,000 rows and stores it in a list. These lists contain an identificator for the excel file and its respective number of rows. If this number of rows is not equal to the number of rows of the excel file, it writes the warning. Could be the problem the reading of these lists?

I'm using openpyxl to load the workbooks, do I need to close them before open the next?

This is my code:

# -*- coding: utf-8 -*-

import os
from openpyxl import Workbook
import glob
import time
import csv
from time import gmtime,strftime
from openpyxl import load_workbook

folder = ''
conditions = 0
a = 0
flight_error = 0
condition_error = 0
typical_flight_error = 0
SP_error = 0


cond_numbers = []
with open('Conditions.csv','rb') as csv_name:           # Abre el fichero csv donde estarán las equivalencias   
    csv_read = csv.reader(csv_name,delimiter='\t')

    for reads in csv_read:
        cond_numbers.append(reads)

flight_TF = []
with open('vuelo-TF.csv','rb') as vuelo_TF:
    csv_read = csv.reader(vuelo_TF,delimiter=';')

    for reads in csv_read:
        flight_TF.append(reads)


excel_files = glob.glob('*.xlsx')

for excel in excel_files:
    print "Leyendo excel: "+excel

    wb = load_workbook(excel)
    ws = wb.get_sheet_by_name('Control System')
    flight = ws.cell('A7').value
    typical_flight = ws.cell('B7').value
    a = 0

    for row in range(6,ws.get_highest_row()):
        conditions = conditions + 1


        value_flight = int(ws.cell(row=row,column=0).value)
        value_TF = ws.cell(row=row,column=1).value
        value_SP = int(ws.cell(row=row,column=4).value)

        if value_flight == '':
            break

        if value_flight != flight:
            flight_error = 1                # Si no todos los flight numbers dentro del vuelo son iguales

        if value_TF != typical_flight:
            typical_flight_error = 2            # Si no todos los typical flight dentro del vuelo son iguales

        if value_SP != 100:
            SP_error = 1



    for cond in cond_numbers:
        if int(flight) == int(cond[0]):
            conds = int(cond[1])
            if conds != int(conditions):
                condition_error = 1         # Si el número de condiciones no se corresponde con el esperado

    for vuelo_TF in flight_TF:
        if int(vuelo_TF[0]) == int(flight):
            TF = vuelo_TF[1]
            if typical_flight != TF:
                typical_flight_error = 1        # Si el vuelo no coincide con el respectivo typical flight

    if flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+':  Los flight numbers del vuelo '+str(flight)+' no coinciden.\n'
        log.write(message)
        log.close()
        flight_error = 0

    if condition_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n'
        log.write(message)
        log.close()
        condition_error = 0

    if typical_flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n'
        log.write(message)
        log.close() 
        typical_flight_error = 0

    if typical_flight_error == 2:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n'
        log.write(message)
        log.close()
        typical_flight_error = 0

    if SP_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n'
        log.write(message)
        log.close()
        SP_error = 0

    conditions = 0

The if statements of the end are for checking and writing warning logs.

I'm using windows xp with 8 gb RAM and intel xeon w3505 (two cores, 2,53 GHz).

538

asked Feb 19 '14 08:02

DavidRguez

1 Answers

The default implementation of openpyxl will store all the accessed cells into memory. I will suggest you to use the Optimized reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html) instead

In code:-

wb = load_workbook(file_path, use_iterators = True)

While loading a workbook pass use_iterators = True. Then access the sheet and cells like:

for row in sheet.iter_rows():
    for cell in row:
        cell_text = cell.value

This will reduce the memory footprint to 5-10%

UPDATE: In version 2.4.0 use_iterators = True option is removed. In newer versions openpyxl.writer.write_only.WriteOnlyWorksheet is introduced for dumping large amounts of data.

from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()

# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
    ws.append(['%d' % i for i in range(200)])

# save the file
wb.save('new_big_file.xlsx')

Not tested the below code just copied from the above link.

Thanks @SdaliM for the information.

165

answered Oct 10 '22 19:10

anuragal

Related questions
                            
                                Pandas Plots: Separate color for weekends, pretty printing times on x axis
                            
                                python plot and powerlaw fit
                            
                                How can i get a file extension from a filetype?
                            
                                Bar chart in pylab from a dictionary
                            
                                Celery chain breaks if one of the tasks fail
                            
                                How do I make shelve file empty in python?
                            
                                orthogonal projection with numpy
                            
                                Print out message only once from the for loop
                            
                                imshow when you are plotting data, not images. Realtion between aspect and extent?
                            
                                How to strip whitespace from before but not after punctuation in python
                            
                                Python Twisted's DeferredLock
                            
                                Add "nan" to numpy array 20 times without loop
                            
                                One-line raise if
                            
                                Stop the thread until the celery task finishes
                            
                                How can I get the traceback object ( sys.exc_info()[2] , same as sys.exc_traceback ) as a string?
                            
                                how to speed up a vector cross product calculation
                            
                                How to tell Jenkins to use a particular virtualenv python
                            
                                SWIG in typemap works, but argout does not
                            
                                Python: ttk: disable/enable a button
                            
                                Django after @login_required redirect to next

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory error using openpyxl and large data excels

Tags:

python

csv

openpyxl

DavidRguez

People also ask

1 Answers

anuragal

Recent Activity

Donate For Us