pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?

Q: When a .CSV file is read with pandas read_csv () what is returned by this function?

Return value read. csv() returns either a DataFrame or a TextParser . A CSV file will be turned into a 2D data structure with labeled columns. If names is supplied, labels will be as mentioned.

Q: What does PD read_csv () do?

Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.

Q: Do pandas recognize dates when importing?

Pandas intelligently handles DateTime values when you import a dataset into a DataFrame. The library will try to infer the data types of your columns when you first import a dataset.

Q: What output type does pandas read_csv () return?

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument. This string can be any valid path, including URLs.

Tags:

python

date

pandas

csv

I have realised that, unless the format of a date column is declared explicitly or semi-explicitly (with dayfirst), pandas can apply different date formats to the same column, when reading a csv file! One row could be dd/mm/yyyy and another row in the same column mm/dd/yyyy! Insane doesn't even come close to describing it! Is it a known bug?

To demonstrate: the script below creates a very simple table with the dates from January 1st to the 31st, in the dd/mm/yyyy format, saves it to a csv file, then reads back the csv.

I then use pandas.DatetimeIndex to extract the day. Well, the day is 1 for the first 12 days (when month and day were both < 13), and 13 14 etc afterwards. How on earth is this possible?

The only way I have found to fix this is to declare the date format, either explicitly or just with dayfirst=True. But it's a pain because it means I must declare the date format even when I import csv with the best-formatted dates ever! Is there a simpler way?

This happens to me with pandas 0.23.4 and Python 3.7.1 on Windows 10

import numpy as np
import pandas as pd
df=pd.DataFrame()
df['day'] =np.arange(1,32)
df['day']=df['day'].apply(lambda x: "{:0>2d}".format(x)  )
df['month']='01'
df['year']='2018'
df['date']=df['day']+'/'+df['month']+'/'+df['year']
df.to_csv('mydates.csv', index=False)

#same results whether you use parse_dates or not
imp = pd.read_csv('mydates.csv',parse_dates=['date'])
imp['day extracted']=pd.DatetimeIndex(imp['date']).day
print(imp['day extracted'])

917

asked Mar 22 '19 23:03

Pythonista anonymous

1 Answers

By default it assumes the American dateformat, and indeed switches mid-column without throwing an Error, if that fails. And though it breaks the Zen of Python by letting this Error pass silently, "Explicit is better than implicit". So if you know your data has an international format, you can use dayfirst

imp = pd.read_csv('mydates.csv', parse_dates=['date'], dayfirst=True)

With files you produce, be unambiguous by using an ISO 8601 format with a timezone designator.

127

answered Sep 19 '22 12:09

Chris Wesseling

Related questions
                            
                                Best way to show interactive table with python (columns adjustably by user)
                            
                                Displaying images at full size in Jupyter
                            
                                Problem with CV2 : numpy.core.multiarray failed to import
                            
                                How to setup vscode Python debugger for an app engine app?
                            
                                How does tf.layers.dense() interact with inputs of higher dim?
                            
                                The uninstall.dat file cannot be found in postgreSQL
                            
                                how to add transfer syntax uid to the filemeta of dataset
                            
                                Django Postgres Connection Pooling
                            
                                Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)
                            
                                Keras: update model with a bigger training set
                            
                                unable to update scikit-learn to version 0.20
                            
                                How do I find all available locales in Python
                            
                                How can I make pylint and autopep8 agree on how to indent wrapped function definitions?
                            
                                Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize
                            
                                How can I use a parametrized dependent fixture twice in pytest?
                            
                                How to understand head pose estimation angles in Python with OpenCV?
                            
                                PyQt keep aspect ratio fixed
                            
                                How to pin pipenv requirements with brackets?
                            
                                Prevent script dir from being added to sys.path in Python 3
                            
                                How should I type-hint an integer variable that can also be infinite?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With