Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing large XLSX file in python

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.

I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.

When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

like image 804
Amine Avatar asked Jul 05 '16 16:07

Amine


People also ask

Do you read Excel files with Python there is a 1000x faster way?

Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.

How do I compress an xlsx file in Python?

You could also use io. BytesIO to create file in memory and write excel in this file and next write this file as gzip on disk.

How do I make a large xlsx file?

All you have to do is, ctrl+down or right, then type something. A single space in cell XFD1 or A1048576 is all it takes to generate a needlessly large file.


2 Answers

Use openpyxl's read-only mode to do this.

You'll be able to work with the relevant worksheet instantly.

like image 96
Charlie Clark Avatar answered Oct 04 '22 21:10

Charlie Clark


Here is it, i found a solution. The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.

like image 29
Amine Avatar answered Oct 04 '22 20:10

Amine