Efficiently extract sheet names, and column names from large .xlsx with Python3

Question

What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?

I've tried using pandas:

For sheet names using pd.ExcelFile:

    xl = pd.ExcelFile(filename)
    return xl.sheet_names

For column names using pd.ExcelFile:

    xl = pd.ExcelFile(filename)
    df = xl.parse(sheetname, nrows=2, **kwargs)
    df.columns

For column names using pd.read_excel with and without nrows (>v23):

    df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
    df.columns

However, both pd.ExcelFile and and pd.read_excel seem to read the entire .xlsx in memory and are therefore slow.

Thanks a lot!

Jade Cacho · Accepted Answer

Here is the easiest way I can share with you:

# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names

Qusai Alothman · Answer

According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows will always read all the file into memory first.

Possible solutions:

Convert the sheet to csv, and read that in chunks.
Use something other than pandas. See this page for a list of alternative libraries.

Efficiently extract sheet names, and column names from large .xlsx with Python3

Tags:

performance

memory

python-3.x

pandas

excel

elke

2 Answers

Jade Cacho

Qusai Alothman

Recent Activity

Donate For Us

Efficiently extract sheet names, and column names from large .xlsx with Python3

Tags:

performance

memory

python-3.x

pandas

excel

elke

2 Answers

Jade Cacho

Qusai Alothman

Related questions

Recent Activity

Donate For Us