Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_excel function ignoring dtype

I'm trying to read an excel file with pd.read_excel(). The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.

Example of the excel file

Example of the excel file

I've tried to specify the dtype or the converters arguments to no avail.

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object
df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

As you can see the Date column is not treated as str...

Same thing when using converters

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object
df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

I have also tried to use other engine but the result is always the same.

The dtype argument seems to work as expected when reading a csv though

What am I doing wrong here ??

Edit: I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.

like image 641
TooLateMate Avatar asked Feb 18 '21 12:02

TooLateMate


People also ask

What is the use of Index_col parameter in read_excel function?

The read_excel method takes argument sheet_name and index_col where we can specify the sheet of which the data frame should be made of and index_col specifies the title column. This head() and tail() method also take arguments as numbers for the number of columns to show.

What is read_excel in Pandas?

pandas. read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame. By reading a single sheet it returns a pandas DataFrame object, but reading two sheets it returns a Dict of DataFrame.

What does PD read_excel return?

data = pd. read_excel(target_file, sheet_name = None) returns a Dict . When you loop over data you get the keys of the Dict .

Can Pandas handle XLS?

Read an Excel file into a pandas DataFrame. Supports xls , xlsx , xlsm , xlsb , odf , ods and odt file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets. Any valid string path is acceptable.


Video Answer


4 Answers

The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.

This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:

>>> pd.read_excel('test.xlsx').head()
        Date      Time            Datetime
0 2020-03-08  10:00:00 2020-03-08 10:00:00
1 2020-03-09  11:00:00 2020-03-09 11:00:00
2 2020-03-10  12:00:00 2020-03-10 12:00:00
3 2020-03-11  13:00:00 2020-03-11 13:00:00
4 2020-03-12  14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date        datetime64[ns]
Time                object
Datetime    datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
                  Date      Time             Datetime
0  2020-03-08 00:00:00  10:00:00  2020-03-08 10:00:00
1  2020-03-09 00:00:00  11:00:00  2020-03-09 11:00:00
2  2020-03-10 00:00:00  12:00:00  2020-03-10 12:00:00
3  2020-03-11 00:00:00  13:00:00  2020-03-11 13:00:00
4  2020-03-12 00:00:00  14:00:00  2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date        string
Time        string
Datetime    string
dtype: object

Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()


That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:

  1. load data
  2. re-apply formatting

There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:

import datetime
import openpyxl
import re

date_corresp = {
    'dd': '%d',
    'mm': '%m',
    'yy': '%y',
    'yyyy': '%Y',
}

time_corresp = {
    'hh': '%h',
    'mm': '%M',
    'ss': '%S',
}

def datecell_as_formatted(cell):
    if isinstance(cell.value, datetime.time):
        dfmt, tfmt = '', cell.number_format
    elif isinstance(cell.value, (datetime.date, datetime.datetime)):
        dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
    else:
        raise ValueError('Not a datetime cell')

    for fmt in re.split(r'\W', dfmt):
        if fmt:
            dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)

    for fmt in re.split(r'\W', tfmt):
        if fmt:
            tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)

    return cell.value.strftime(dfmt + tfmt)

Which you can then use as follows:

>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'

(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)


* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.

like image 76
Cimbali Avatar answered Oct 17 '22 23:10

Cimbali


The issue just like the other comments say is most likely a bug

Although not ideal, but you could always do something like this?

import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str}) 
# this line can be then simplified to : 
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)
                     Date        Time
0   "2020-03-08 00:00:00"  "10:00:00"
1   "2020-03-09 00:00:00"  "11:00:00"
2   "2020-03-10 00:00:00"  "12:00:00"
3   "2020-03-11 00:00:00"  "13:00:00"
4   "2020-03-12 00:00:00"  "14:00:00"
5   "2020-03-13 00:00:00"  "15:00:00"
6   "2020-03-14 00:00:00"  "16:00:00"
7   "2020-03-15 00:00:00"  "17:00:00"
8   "2020-03-16 00:00:00"  "18:00:00"
9   "2020-03-17 00:00:00"  "19:00:00"
10  "2020-03-18 00:00:00"  "20:00:00"
11  "2020-03-19 00:00:00"  "21:00:00"
object
object
like image 24
Ossi H. Avatar answered Oct 17 '22 21:10

Ossi H.


here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.

df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})

To understand more about the Pandas String Dtype use the link below,

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

Let me know if you have any issues on that !!

like image 1
Tamil Selvan Avatar answered Oct 17 '22 23:10

Tamil Selvan


Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).

And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).

All dtypes can now be converted to StringDtype

You just need to specify dtype="string" when loading your data with pandas:

>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date    string
Time    string
dtype: object
like image 1
R. Marolahy Avatar answered Oct 17 '22 21:10

R. Marolahy