I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using <code>xlrd</code>, <code>openpyxl</code>, and <code>pyexcel-xlsx</code>, but they always take more than 35 mins because it loads the whole file in memory. I unzipped the Excel file and found out that the <code>xml</code> which contains the data I need is 800mb unzipped. When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

Use openpyxl's read-only mode to do this. You'll be able to work with the relevant worksheet instantly.

Processing large XLSX file in python

Tags:

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.

I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.

When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

804

asked Jul 05 '16 16:07

Amine

2 Answers

Use openpyxl's read-only mode to do this.

You'll be able to work with the relevant worksheet instantly.

answered Oct 04 '22 21:10

Charlie Clark

Here is it, i found a solution. The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.

answered Oct 04 '22 20:10

Amine

Related questions
                            
                                is it possible to restart the already terminated process in python multiprocessing?
                            
                                RuntimeError: module compiled against API version 0xa but this version of numpy is 0x9
                            
                                Python Read Fortran Binary File
                            
                                Python comments Fail using """ or ''' in dictionary [duplicate]
                            
                                Area intersection in Python
                            
                                Keras/Tensorflow predict: error in array shape
                            
                                Accessing a variable of the another program in C
                            
                                Python: reading 12 bit packed binary image
                            
                                ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df
                            
                                Replace values in column of Pandas DataFrame using a Series lookup table
                            
                                Python: accept unicode strings as regular strings in doctests
                            
                                How can I asyncio schedule a filesystem stat operation?
                            
                                How to Make a Portable Jupyter Slideshow
                            
                                Django F doesn't seem to work?
                            
                                Splash lua script to do multiple clicks and visits
                            
                                Jupyter & IPython: What does %matplotlib inline do?
                            
                                PySpark Evaluation
                            
                                Can we make correlated queries with SQLAlchemy
                            
                                Assigning (instead of defining) a __getitem__ magic method breaks indexing [duplicate]
                            
                                Can't install datasets package via pip

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Processing large XLSX file in python

Tags:

python

excel

xlsx

xlrd

openpyxl

Amine

People also ask

2 Answers

Charlie Clark

Amine

Recent Activity

Donate For Us