I am not sure why I am getting this error although sometimes my code works fine!
Excel file format cannot be determined, you must specify an engine manually.
Here below is my code with steps:
1- list of columns of customers Id:
customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]
2- The the code to find all xlsx files in a folder and read them:
l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
df.columns = ["ID"] # to have only one column once concat
l.append(df)
all_data = pd.concat(l, ignore_index=True) # concat all data
I added the engine openpyxl
df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)
Now I got a different error:
BadZipFile: File is not a zip file
pandas version: 1.3.0 python version: python3.9 os: MacOS
is there a better way to read all xlsx files from a folder ?
Pandas is mainly used for data analysis and associated manipulation of tabular data in Dataframes. Pandas allows importing data from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.
Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays.
Panda belongs to "Media Transcoding" category of the tech stack, while Pandas can be primarily classified under "Data Science Tools". Some of the features offered by Panda are: Unlimited encoding- When we say unlimited we mean unlimited.
Why is the panda called so? The origin of the name panda is the Nepalese word 'nigalya ponya', which means 'eater of bamboo'. As used in the West, it was originally applied to the red panda, to which the giant panda was thought to be related.
home // about // get pandas // documentation // community // talks // donate. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas is a NumFOCUS sponsored project.
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.
Here are just a few of the things that pandas does well: Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
As the PANDAS hypothesis was unconfirmed and unsupported by data, a new definition was proposed by Swedo and colleagues in 2012.
Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:
~$datasheet.xlsx
So, when I run the code to read all the files from the folder it gives me the error:
Excel file format cannot be determined, you must specify an engine manually.
When all files are closed and no hidden temporary files ~$filename.xlsx
in the same directory the code works perfectly.
Also make sure you're using the correct pd.read_*
method. I ran into this error when attempting to open a .csv
file with read_excel()
instead of read_csv()
. I found this handy snippet here to automatically select the correct method by Excel file type.
if file_extension == 'xlsx':
df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
df = pd.read_excel(file.read())
elif file_extension == 'csv':
df = pd.read_csv(file.read())
In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)
for file in os.scandir(test_folder):
filename = os.fsdecode(file)
if '.DS_Store' not in filename:
execute_function(file)
Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With