I want to read just 10 lines from Excel files (xlsx) without loading the whole file at once, as it can't be done on one of my machines (low memory).
I tried using
import xlrd
import pandas as pd
def open_file(path):
xl = pd.ExcelFile(path)
reader = xl.parse(chunksize=1000)
for chunk in reader:
print(chunk)
It seems like the file is loaded first then divided into parts.
How to read only first lines?
Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.
Python is great for processing Excel-files. You can handle large files much easier, you create reproducible code and you provide a documentation for your colleagues. We also saw the we have easily access to advanced features of Python. You could automate your whole reporting process.
Due to the nature of xlsx
files (which are essentially a bunch of xml
files zipped together) you can't poke the file at an arbitrary byte and hope for it to be the beginning of Nth row of the table in the sheet you are interested in.
The best you can do is use pandas.read_excel
with the skiprows
(skips rows from the top of the file) and skip_footer
(skips rows from the bottom) arguments. This however will load the whole file to memory first and then parse the required rows only.
# if the file contains 300 rows, this will read the middle 100
df = pd.read_excel('/path/excel.xlsx', skiprows=100, skip_footer=100,
names=['col_a', 'col_b'])
Note that you have to set the headers manually with the names
argument otherwise the column names will be the last skipped row.
If you wish to use csv
instead then it is a straightforward task since csv
files are plain-text files.
But, and it's a big but, if you are really desperate you can extract the relevant sheet's xml
file from the xlsx
archive and parse that. It's not going to be an easy task though.
An example xml
file that represents a sheet with a single 2 X 3 table. The <v>
tags represent the cells' value.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="A1:B3"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="C10" sqref="C10"/>
</sheetView>
</sheetViews>
<sheetFormatPr defaultColWidth="11" defaultRowHeight="14.25" x14ac:dyDescent="0.2"/>
<sheetData>
<row r="1" spans="1:2" ht="15.75" x14ac:dyDescent="0.2">
<c r="A1" t="s">
<v>1</v>
</c><c r="B1" s="1" t="s">
<v>0</v>
</c>
</row>
<row r="2" spans="1:2" ht="15" x14ac:dyDescent="0.2">
<c r="A2" s="2">
<v>1</v>
</c><c r="B2" s="2">
<v>4</v>
</c>
</row>
<row r="3" spans="1:2" ht="15" x14ac:dyDescent="0.2">
<c r="A3" s="2">
<v>2</v>
</c><c r="B3" s="2">
<v>5</v>
</c>
</row>
</sheetData>
<pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5"/>
</worksheet>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With