Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

I am not sure why I am getting this error although sometimes my code works fine!

Excel file format cannot be determined, you must specify an engine manually.

Here below is my code with steps:

1- list of columns of customers Id:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2- The the code to find all xlsx files in a folder and read them:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

I added the engine openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

Now I got a different error:

BadZipFile: File is not a zip file

pandas version: 1.3.0 python version: python3.9 os: MacOS

is there a better way to read all xlsx files from a folder ?

like image 424
Mtaly Avatar asked Jul 22 '21 01:07

Mtaly


People also ask

What is the use of pandas?

Pandas is mainly used for data analysis and associated manipulation of tabular data in Dataframes. Pandas allows importing data from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.

What is a pandas in Python?

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays.

What is difference between Panda and pandas?

Panda belongs to "Media Transcoding" category of the tech stack, while Pandas can be primarily classified under "Data Science Tools". Some of the features offered by Panda are: Unlimited encoding- When we say unlimited we mean unlimited.

Why is it called pandas?

Why is the panda called so? The origin of the name panda is the Nepalese word 'nigalya ponya', which means 'eater of bamboo'. As used in the West, it was originally applied to the red panda, to which the giant panda was thought to be related.

What is pandas?

home // about // get pandas // documentation // community // talks // donate. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas is a NumFOCUS sponsored project.

What is the use of Panda in Python?

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

What can we learn from pandas?

Here are just a few of the things that pandas does well: Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

Is the PANDAS hypothesis supported by data?

As the PANDAS hypothesis was unconfirmed and unsupported by data, a new definition was proposed by Swedo and colleagues in 2012.


Video Answer


4 Answers

Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:

~$datasheet.xlsx

So, when I run the code to read all the files from the folder it gives me the error:

Excel file format cannot be determined, you must specify an engine manually.

When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.

like image 77
Mtaly Avatar answered Oct 18 '22 15:10

Mtaly


Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())
like image 32
pirateofebay Avatar answered Oct 18 '22 13:10

pirateofebay


In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)
like image 41
tbullock Avatar answered Oct 18 '22 14:10

tbullock


Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)

like image 35
Mohammed Avatar answered Oct 18 '22 14:10

Mohammed