ExcelFile Vs. read_excel in pandas

Q: How view specific rows from pandas Excel?

To tell pandas to start reading an Excel sheet from a specific row, use the argument header = 0-indexed row where to start reading. By default, header=0, and the first such row is used to give the names of the data frame columns. To skip rows at the end of a sheet, use skipfooter = number of rows to skip.

Q: What does parse do in pandas?

Strings are used for sheet names, Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets.

Tags:

python

pandas

excel

I'm diving into pandas and experimenting around. As for reading data from an Excel file. I wonder what's the difference between using ExcelFile to read_excel. Both seem to work (albeit slightly different syntax, as could be expected), and the documentation supports both. In both cases, the documentation describes the method the same: "Read an Excel table into DataFrame" and "Read an Excel table into a pandas DataFrame". (documentation for read_excel, and for excel_file)

I'm seeing answers here on SO that uses either, w/o addressing the difference. Also, a Google search didn't produce a result that discusses this issue.

WRT my testing, these seem equivalent:

path = "test/dummydata.xlsx" xl = pd.ExcelFile(path) df = xl.parse("dummydata")  # sheet name

and

path = "test/dummydata.xlsx"  df = pd.io.excel.read_excel(path, sheetname=0)

other than the fact that the latter saves me a line, is there a difference between the two, and is there a reason to use either one?

Thanks!

984

asked Oct 20 '14 20:10

Optimesh

2 Answers

There's no particular difference beyond the syntax. Technically, ExcelFile is a class and read_excel is a function. In either case, the actual parsing is handled by the _parse_excel method defined within ExcelFile.

In earlier versions of pandas, read_excel consisted entirely of a single statement (other than comments):

return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,                                               kind=kind, **kwds)

And ExcelFile.parse didn't do much more than call ExcelFile._parse_excel.

In recent versions of pandas, read_excel ensures that it has an ExcelFile object (and creates one if it doesn't), and then calls the _parse_excel method directly:

if not isinstance(io, ExcelFile):     io = ExcelFile(io, engine=engine)  return io._parse_excel(...)

and with the updated (and unified) parameter handling, ExcelFile.parse really is just the single statement:

return self._parse_excel(...)

That is why the docs for ExcelFile.parse now say

Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters

As for another answer which claims that ExcelFile.parse is faster in a loop, that really just comes down to whether you are creating the ExcelFile object from scratch every time. You could certainly create your ExcelFile once, outside the loop, and pass that to read_excel inside your loop:

xl = pd.ExcelFile(path) for name in xl.sheet_names:     df = pd.read_excel(xl, name)

This would be equivalent to

xl = pd.ExcelFile(path) for name in xl.sheet_names:     df = xl.parse(name)

If your loop involves different paths (in other words, you are reading many different workbooks, not just multiple sheets within a single workbook), then you can't get around having to create a brand-new ExcelFile instance for each path anyway, and then once again, both ExcelFile.parse and read_excel will be equivalent (and equally slow).

156

answered Sep 23 '22 18:09

John Y

ExcelFile.parse is faster.

Suppose you are reading dataframes in a loop. With ExcelFile.parse you just pass the Excelfile object(xl in your case). So the excel sheet is just loaded once and you use this to get your dataframes. In case of Read_Excel you pass the path instead of Excelfile object. So essentially every time the workbook is loaded again. Makes a mess if your workbook has loads of sheets and tens of thousands of rows.

answered Sep 23 '22 18:09

Pranav

Related questions
                            
                                class is not defined despite being imported
                            
                                How to find the last row in a column using openpyxl normal workbook?
                            
                                Anyone using Django in the "Enterprise"
                            
                                Writing a help for python script
                            
                                What's wrong with my except? [duplicate]
                            
                                Quadratic Program (QP) Solver that only depends on NumPy/SciPy?
                            
                                How to upload a file using an ajax call in flask
                            
                                How to display all label values in matplotlib
                            
                                Hide Axis in Bokeh
                            
                                Building multi-regression model throws error: `Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).`
                            
                                Trailing slash in Flask route
                            
                                Do datetime objects need to be deep-copied?
                            
                                Python Pandas Dataframe, remove all rows where 'None' is the value in any column
                            
                                Python: How to drop a row whose particular column is empty/NaN?
                            
                                Getting No loop matching the specified signature and casting error
                            
                                How do I specify multiple types for a parameter using type-hints? [duplicate]
                            
                                from __future__ import annotations
                            
                                Django BigInteger auto-increment field as primary key?
                            
                                Is there a way to hide the csrf label while looping through form using Flask and Flask-WTForms?
                            
                                Python Serial: How to use the read or readline function to read more than 1 character at a time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With