PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

Tags:

I am not sure why I am getting this error although sometimes my code works fine!

Excel file format cannot be determined, you must specify an engine manually.

Here below is my code with steps:

1- list of columns of customers Id:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2- The the code to find all xlsx files in a folder and read them:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

I added the engine openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

Now I got a different error:

BadZipFile: File is not a zip file

pandas version: 1.3.0 python version: python3.9 os: MacOS

is there a better way to read all xlsx files from a folder ?

424

asked Jul 22 '21 01:07

Video Answer

4 Answers

Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:

~$datasheet.xlsx

So, when I run the code to read all the files from the folder it gives me the error:

Excel file format cannot be determined, you must specify an engine manually.

When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.

answered Oct 18 '22 15:10

Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())

answered Oct 18 '22 13:10

pirateofebay

In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)

answered Oct 18 '22 14:10

tbullock

Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)

answered Oct 18 '22 14:10

Mohammed

Related questions
                            
                                How can I use Java in Google Colab
                            
                                Regex No Character Should Repeat
                            
                                How to find last occurence index matching a certain value in a Pandas Series?
                            
                                How to solve the ModuleNotFoundError: No module named 'prompt_toolkit.formatted_text' in Jupyter Notebook inside the Pycharm IDE?
                            
                                How to sleep Selenium WebDriver in Python for milliseconds
                            
                                How to use shared memory instead of passing objects via pickling between multiple processes
                            
                                TensorFlow Serving: Update model_config (add additional models) at runtime
                            
                                limiting the number of decimal places in python pandas table
                            
                                Using Pandas Autocorrelation Plot - how to limit x-axis to make it more readable?
                            
                                Using OpenCV's Image Hashing Module from Python
                            
                                Pytest running very slow for project
                            
                                Python cryptography: create a certificate signed by an existing CA, and export
                            
                                Error: ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`
                            
                                TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' and I can't figure out why
                            
                                How to parse ObjectId in a pydantic model?
                            
                                How to plot with pyplot from a script file in Google Colab?
                            
                                Python dataclass, what's a pythonic way to validate initialization arguments?
                            
                                Evaluate consonant/vowel composition of word string in Python
                            
                                How to Install python3.8 on debian 10?
                            
                                Kernel died with exit code 1(VS code)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

Tags:

python

python-3.x

pandas

dataframe

Mtaly

People also ask