Loading Excel file chunk by chunk with Python instead of loading full file into memory

Tags:

I want to read just 10 lines from Excel files (xlsx) without loading the whole file at once, as it can't be done on one of my machines (low memory).

I tried using

import xlrd
import pandas as pd
def open_file(path):
    xl = pd.ExcelFile(path)
    reader = xl.parse(chunksize=1000)
    for chunk in reader:
        print(chunk)

It seems like the file is loaded first then divided into parts.

How to read only first lines?

934

asked Nov 23 '17 12:11

Kornel

1 Answers

Due to the nature of xlsx files (which are essentially a bunch of xml files zipped together) you can't poke the file at an arbitrary byte and hope for it to be the beginning of Nth row of the table in the sheet you are interested in.

The best you can do is use pandas.read_excel with the skiprows (skips rows from the top of the file) and skip_footer (skips rows from the bottom) arguments. This however will load the whole file to memory first and then parse the required rows only.

# if the file contains 300 rows, this will read the middle 100
df = pd.read_excel('/path/excel.xlsx', skiprows=100, skip_footer=100,
                   names=['col_a', 'col_b'])

Note that you have to set the headers manually with the names argument otherwise the column names will be the last skipped row.

If you wish to use csv instead then it is a straightforward task since csv files are plain-text files.

But, and it's a big but, if you are really desperate you can extract the relevant sheet's xml file from the xlsx archive and parse that. It's not going to be an easy task though.

An example xml file that represents a sheet with a single 2 X 3 table. The <v> tags represent the cells' value.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
    <dimension ref="A1:B3"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0">
            <selection activeCell="C10" sqref="C10"/>
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultColWidth="11" defaultRowHeight="14.25" x14ac:dyDescent="0.2"/>
    <sheetData>
        <row r="1" spans="1:2" ht="15.75" x14ac:dyDescent="0.2">
            <c r="A1" t="s">
                <v>1</v>
            </c><c r="B1" s="1" t="s">
                <v>0</v>
            </c>
        </row>
        <row r="2" spans="1:2" ht="15" x14ac:dyDescent="0.2">
            <c r="A2" s="2">
                <v>1</v>
            </c><c r="B2" s="2">
                <v>4</v>
            </c>
        </row>
        <row r="3" spans="1:2" ht="15" x14ac:dyDescent="0.2">
            <c r="A3" s="2">
                <v>2</v>
            </c><c r="B3" s="2">
                <v>5</v>
            </c>
        </row>
    </sheetData>
    <pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5"/>
</worksheet>

162

answered Oct 23 '22 04:10

DeepSpace

Related questions
                            
                                How to install regular python (via homebrew) and miniconda in the same computer?
                            
                                python , opencv, image array to binary
                            
                                Django Rest Framework - OPTIONS request - Get foreign key choices
                            
                                Any limitations on platform constraints for wheels on PyPI?
                            
                                Is there a callable equivalent to f-string syntax?
                            
                                Poisson Regression in xgboost Fails for Low Frequencies
                            
                                Populate second dropdown based on the value selected in the first dropdown in flask using ajax and jQuery
                            
                                Google PubSub python client returning StatusCode.UNAVAILABLE
                            
                                How do you ensure a Celery chord callback gets called with failed subtasks?
                            
                                Set the HTTP status text in a Flask response
                            
                                Element disappears when I add an {% include %} tag inside my for loop
                            
                                URL path parameters vs query parameters in Django
                            
                                Python Error When Installing ez_setup.py "could not create SSL/TLS secure channel"
                            
                                Not clicking all tabs and not looping once issues
                            
                                Pygame - Loading images in sprites
                            
                                Matplotlib path.contains_points returns false for points on some edges but not others
                            
                                Pandas manipulating a DataFrame inplace vs not inplace (inplace=True vs False) [duplicate]
                            
                                Chaining string operations on Pandas Series
                            
                                Pandas counting occurrence of list contained in column of lists
                            
                                SQLAlchemy: How to filter after aggregation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Loading Excel file chunk by chunk with Python instead of loading full file into memory

Tags:

python

file

excel

xlsx

Kornel

People also ask

1 Answers

DeepSpace

Recent Activity

Donate For Us