I'm trying to read an excel file with pd.read_excel(). The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype. Example of the excel file <img src="https://i.stack.imgur.com/KD2WK.png" alt="Example of the excel file"> I've tried to specify the dtype or the converters arguments to no avail. <pre class="prettyprint"><code>df = pd.read_excel('xls_test.xlsx', dtype={'Date':str,'Time':str}) df.dtypes Date object Time object dtype: object </code></pre> <pre class="prettyprint"><code>df.head() Date Time 0 2020-03-08 00:00:00 10:00:00 1 2020-03-09 00:00:00 11:00:00 2 2020-03-10 00:00:00 12:00:00 3 2020-03-11 00:00:00 13:00:00 4 2020-03-12 00:00:00 14:00:00 </code></pre> As you can see the Date column is not treated as str... Same thing when using converters <pre class="prettyprint"><code>df = pd.read_excel('xls_test.xlsx', converters={'Date':str,'Time':str}) df.dtypes Date object Time object dtype: object </code></pre> <pre class="prettyprint"><code>df.head() Date Time 0 2020-03-08 00:00:00 10:00:00 1 2020-03-09 00:00:00 11:00:00 2 2020-03-10 00:00:00 12:00:00 3 2020-03-11 00:00:00 13:00:00 4 2020-03-12 00:00:00 14:00:00 </code></pre> I have also tried to use other engine but the result is always the same. The dtype argument seems to work as expected when reading a csv though What am I doing wrong here ?? Edit: I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.

The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*. This means that, whatever you do with the <code>dtype=</code> argument, the data will be loaded as a date, and then converted to string, giving you the result you see: <pre class="prettyprint"><code>>>> pd.read_excel('test.xlsx').head() Date Time Datetime 0 2020-03-08 10:00:00 2020-03-08 10:00:00 1 2020-03-09 11:00:00 2020-03-09 11:00:00 2 2020-03-10 12:00:00 2020-03-10 12:00:00 3 2020-03-11 13:00:00 2020-03-11 13:00:00 4 2020-03-12 14:00:00 2020-03-12 14:00:00 >>> pd.read_excel('test.xlsx').dtypes Date datetime64[ns] Time object Datetime datetime64[ns] dtype: object >>> pd.read_excel('test.xlsx', dtype='string').head() Date Time Datetime 0 2020-03-08 00:00:00 10:00:00 2020-03-08 10:00:00 1 2020-03-09 00:00:00 11:00:00 2020-03-09 11:00:00 2 2020-03-10 00:00:00 12:00:00 2020-03-10 12:00:00 3 2020-03-11 00:00:00 13:00:00 2020-03-11 13:00:00 4 2020-03-12 00:00:00 14:00:00 2020-03-12 14:00:00 >>> pd.read_excel('test.xlsx', dtype='string').dtypes Date string Time string Datetime string dtype: object </code></pre> Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with <code>.dt.strftime()</code> <hr> That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps: <ol> <li>load data</li> <li>re-apply formatting</li> </ol> There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend: <pre class="prettyprint"><code>import datetime import openpyxl import re date_corresp = { 'dd': '%d', 'mm': '%m', 'yy': '%y', 'yyyy': '%Y', } time_corresp = { 'hh': '%h', 'mm': '%M', 'ss': '%S', } def datecell_as_formatted(cell): if isinstance(cell.value, datetime.time): dfmt, tfmt = '', cell.number_format elif isinstance(cell.value, (datetime.date, datetime.datetime)): dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + [''] else: raise ValueError('Not a datetime cell') for fmt in re.split(r'\W', dfmt): if fmt: dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt) for fmt in re.split(r'\W', tfmt): if fmt: tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt) return cell.value.strftime(dfmt + tfmt) </code></pre> Which you can then use as follows: <pre class="prettyprint"><code>>>> wb = openpyxl.load_workbook('test.xlsx') >>> ws = wb.worksheets[0] >>> datecell_as_formatted(ws.cell(row=2, column=1)) '08/03/20' </code></pre> (You can also complete the <code>_corresp</code> dictionaries with more date/time formatting items if they are incomplete) <hr> * It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.

here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype. <pre class="prettyprint"><code>df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'}) </code></pre> To understand more about the Pandas String Dtype use the link below, https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html Let me know if you have any issues on that !!

Since version <code>1.0.0</code>, there are two ways to store text data in pandas: <code>object</code> or <code>StringDtype</code> (source). And since version <code>1.1.0</code>, StringDtype now works in all situations where <code>astype(str)</code> or <code>dtype=str</code> work (source). <blockquote> All dtypes can now be converted to StringDtype </blockquote> You just need to specify <code>dtype="string"</code> when loading your data with pandas: <pre class="prettyprint"><code>>>df = pd.read_excel('xls_test.xlsx', dtype="string") >>df.dtypes Date string Time string dtype: object </code></pre>

Pandas read_excel function ignoring dtype

Tags:

python

pandas

excel

I'm trying to read an excel file with pd.read_excel(). The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.

Example of the excel file

I've tried to specify the dtype or the converters arguments to no avail.

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

As you can see the Date column is not treated as str...

Same thing when using converters

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

I have also tried to use other engine but the result is always the same.

The dtype argument seems to work as expected when reading a csv though

What am I doing wrong here ??

Edit: I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.

641

asked Feb 18 '21 12:02

TooLateMate

Video Answer

4 Answers

The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.

This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:

>>> pd.read_excel('test.xlsx').head()
        Date      Time            Datetime
0 2020-03-08  10:00:00 2020-03-08 10:00:00
1 2020-03-09  11:00:00 2020-03-09 11:00:00
2 2020-03-10  12:00:00 2020-03-10 12:00:00
3 2020-03-11  13:00:00 2020-03-11 13:00:00
4 2020-03-12  14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date        datetime64[ns]
Time                object
Datetime    datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
                  Date      Time             Datetime
0  2020-03-08 00:00:00  10:00:00  2020-03-08 10:00:00
1  2020-03-09 00:00:00  11:00:00  2020-03-09 11:00:00
2  2020-03-10 00:00:00  12:00:00  2020-03-10 12:00:00
3  2020-03-11 00:00:00  13:00:00  2020-03-11 13:00:00
4  2020-03-12 00:00:00  14:00:00  2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date        string
Time        string
Datetime    string
dtype: object

Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()

That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:

load data
re-apply formatting

There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:

import datetime
import openpyxl
import re

date_corresp = {
    'dd': '%d',
    'mm': '%m',
    'yy': '%y',
    'yyyy': '%Y',
}

time_corresp = {
    'hh': '%h',
    'mm': '%M',
    'ss': '%S',
}

def datecell_as_formatted(cell):
    if isinstance(cell.value, datetime.time):
        dfmt, tfmt = '', cell.number_format
    elif isinstance(cell.value, (datetime.date, datetime.datetime)):
        dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
    else:
        raise ValueError('Not a datetime cell')

    for fmt in re.split(r'\W', dfmt):
        if fmt:
            dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)

    for fmt in re.split(r'\W', tfmt):
        if fmt:
            tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)

    return cell.value.strftime(dfmt + tfmt)

Which you can then use as follows:

>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'

(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)

_{* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.}

answered Oct 17 '22 23:10

Cimbali

The issue just like the other comments say is most likely a bug

Although not ideal, but you could always do something like this?

import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str}) 
# this line can be then simplified to : 
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)

                     Date        Time
0   "2020-03-08 00:00:00"  "10:00:00"
1   "2020-03-09 00:00:00"  "11:00:00"
2   "2020-03-10 00:00:00"  "12:00:00"
3   "2020-03-11 00:00:00"  "13:00:00"
4   "2020-03-12 00:00:00"  "14:00:00"
5   "2020-03-13 00:00:00"  "15:00:00"
6   "2020-03-14 00:00:00"  "16:00:00"
7   "2020-03-15 00:00:00"  "17:00:00"
8   "2020-03-16 00:00:00"  "18:00:00"
9   "2020-03-17 00:00:00"  "19:00:00"
10  "2020-03-18 00:00:00"  "20:00:00"
11  "2020-03-19 00:00:00"  "21:00:00"
object
object

answered Oct 17 '22 21:10

Ossi H.

here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.

df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})

To understand more about the Pandas String Dtype use the link below,

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

Let me know if you have any issues on that !!

answered Oct 17 '22 23:10

Tamil Selvan

Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).

And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).

All dtypes can now be converted to StringDtype

You just need to specify dtype="string" when loading your data with pandas:

>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date    string
Time    string
dtype: object