I'm currently using pandas to read an Excel file and present its sheet names to the user, so he can select which sheet he would like to use. The problem is that the files are really big (70 columns x 65k rows), taking up to 14s to load on a notebook (the same data in a CSV file is taking 3s). My code in panda goes like this: <pre class="prettyprint"><code>xls = pandas.ExcelFile(path) sheets = xls.sheet_names </code></pre> I tried xlrd before, but obtained similar results. This was my code with xlrd: <pre class="prettyprint"><code>xls = xlrd.open_workbook(path) sheets = xls.sheet_names </code></pre> So, can anybody suggest a faster way to retrieve the sheet names from an Excel file than reading the whole file?

you can use the xlrd library and open the workbook with the "on_demand=True" flag, so that the sheets won't be loaded automaticaly. Than you can retrieve the sheet names in a similar way to pandas: <pre class="prettyprint"><code>import xlrd xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True) print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property </code></pre>

How to obtain sheet names from XLS files without loading the whole file?

Tags:

python

pandas

excel

xlrd

I'm currently using pandas to read an Excel file and present its sheet names to the user, so he can select which sheet he would like to use. The problem is that the files are really big (70 columns x 65k rows), taking up to 14s to load on a notebook (the same data in a CSV file is taking 3s).

My code in panda goes like this:

xls = pandas.ExcelFile(path) sheets = xls.sheet_names

I tried xlrd before, but obtained similar results. This was my code with xlrd:

xls = xlrd.open_workbook(path) sheets = xls.sheet_names

So, can anybody suggest a faster way to retrieve the sheet names from an Excel file than reading the whole file?

455

asked Sep 03 '12 14:09

pcarvalho

2 Answers

you can use the xlrd library and open the workbook with the "on_demand=True" flag, so that the sheets won't be loaded automaticaly.

Than you can retrieve the sheet names in a similar way to pandas:

import xlrd xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True) print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property

181

answered Sep 25 '22 05:09

Colin O'Coal

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. The following function works for xlsx files.

def get_sheet_details(file_path):     sheets = []     file_name = os.path.splitext(os.path.split(file_path)[-1])[0]     # Make a temporary directory with the file name     directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)     os.mkdir(directory_to_extract_to)      # Extract the xlsx file as it is just a zip file     zip_ref = zipfile.ZipFile(file_path, 'r')     zip_ref.extractall(directory_to_extract_to)     zip_ref.close()      # Open the workbook.xml which is very light and only has meta data, get sheets from it     path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')     with open(path_to_workbook, 'r') as f:         xml = f.read()         dictionary = xmltodict.parse(xml)         for sheet in dictionary['workbook']['sheets']['sheet']:             sheet_details = {                 'id': sheet['sheetId'], # can be @sheetId for some versions                 'name': sheet['name'] # can be @name             }             sheets.append(sheet_details)      # Delete the extracted files directory     shutil.rmtree(directory_to_extract_to)     return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

answered Sep 26 '22 05:09

Dhwanil shah

Related questions
                            
                                only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
                            
                                Discord.py error: TypeError: __new__() got an unexpected keyword argument 'deny_new'
                            
                                Is there any reason to choose __new__ over __init__ when defining a metaclass?
                            
                                Python: make eval safe [duplicate]
                            
                                How can I run the current file in PyCharm
                            
                                Debugging python programs in emacs
                            
                                Import python module NOT on path
                            
                                How do I mock the filesystem in Python unit tests?
                            
                                How to install python3.9 with conda?
                            
                                cx_Oracle: How do I iterate over a result set?
                            
                                How to remove pip package after deleting it manually
                            
                                What is the difference between .one() and .first()
                            
                                Python multi-line with statement
                            
                                Contains of HashSet<Integer> in Python
                            
                                Printing out actual error message for ValueError
                            
                                What is the difference between sparse_categorical_crossentropy and categorical_crossentropy?
                            
                                Python: delete element from heap
                            
                                Inserting new records with one-to-many relationship in sqlalchemy
                            
                                How to set default text for a Tkinter Entry widget
                            
                                How do I attach a remote debugger to a Python process?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With