Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?

I have realised that, unless the format of a date column is declared explicitly or semi-explicitly (with dayfirst), pandas can apply different date formats to the same column, when reading a csv file! One row could be dd/mm/yyyy and another row in the same column mm/dd/yyyy! Insane doesn't even come close to describing it! Is it a known bug?

To demonstrate: the script below creates a very simple table with the dates from January 1st to the 31st, in the dd/mm/yyyy format, saves it to a csv file, then reads back the csv.

I then use pandas.DatetimeIndex to extract the day. Well, the day is 1 for the first 12 days (when month and day were both < 13), and 13 14 etc afterwards. How on earth is this possible?

The only way I have found to fix this is to declare the date format, either explicitly or just with dayfirst=True. But it's a pain because it means I must declare the date format even when I import csv with the best-formatted dates ever! Is there a simpler way?

This happens to me with pandas 0.23.4 and Python 3.7.1 on Windows 10

import numpy as np
import pandas as pd
df=pd.DataFrame()
df['day'] =np.arange(1,32)
df['day']=df['day'].apply(lambda x: "{:0>2d}".format(x)  )
df['month']='01'
df['year']='2018'
df['date']=df['day']+'/'+df['month']+'/'+df['year']
df.to_csv('mydates.csv', index=False)

#same results whether you use parse_dates or not
imp = pd.read_csv('mydates.csv',parse_dates=['date'])
imp['day extracted']=pd.DatetimeIndex(imp['date']).day
print(imp['day extracted'])
like image 917
Pythonista anonymous Avatar asked Mar 22 '19 23:03

Pythonista anonymous


People also ask

When a .CSV file is read with pandas read_csv () what is returned by this function?

Return value read. csv() returns either a DataFrame or a TextParser . A CSV file will be turned into a 2D data structure with labeled columns. If names is supplied, labels will be as mentioned.

What does PD read_csv () do?

Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.

Do pandas recognize dates when importing?

Pandas intelligently handles DateTime values when you import a dataset into a DataFrame. The library will try to infer the data types of your columns when you first import a dataset.

What output type does pandas read_csv () return?

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument. This string can be any valid path, including URLs.


1 Answers

By default it assumes the American dateformat, and indeed switches mid-column without throwing an Error, if that fails. And though it breaks the Zen of Python by letting this Error pass silently, "Explicit is better than implicit". So if you know your data has an international format, you can use dayfirst

imp = pd.read_csv('mydates.csv', parse_dates=['date'], dayfirst=True)

With files you produce, be unambiguous by using an ISO 8601 format with a timezone designator.

like image 127
Chris Wesseling Avatar answered Sep 19 '22 12:09

Chris Wesseling