Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing dates in pandas.read_csv with null-value handling?

Consider the following made-up CSV:

from io import StringIO

data = """value,date
7,null
7,10/18/2008
621,(null)"""

fake_file = StringIO(data)

I want to read this file using pandas.read_csv, handling nulls with the na_values parameter and dates with parse_dates and date_parser:

import pandas as pd

date_parser = lambda c: pd.datetime.strptime(c, '%m/%d/%Y')

df = pd.read_csv(fake_file,
                 parse_dates=['date'],
                 date_parser=date_parser,
                 na_values=['null', '(null)'])

Running this code in Python 3.5 gives me this:

  File "<ipython-input-11-aa5bcf0858b7>", line 1, in <lambda>
    date_parser = lambda c: pd.datetime.strptime(c, DATE_FMT)

TypeError: strptime() argument 1 must be str, not float

So it seems the nulls are handled first and then the dates are attempted to be parsed...

I know I can do this:

df = pd.read_csv(fake_file,
                 na_values=['null', '(null)'])
df['date'] = pd.to_datetime(df['date'],
                            format='%m/%d/%Y')

But my real question is how to both handle date formatting and NaN-handling in one fell swoop...

like image 458
blacksite Avatar asked Oct 03 '17 13:10

blacksite


People also ask

What does parse date do in pandas?

By default, date columns are represented as object when loading data from a CSV file. To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

What does parse_dates true do?

If True and parse_dates specifies combining multiple columns then keep the original columns. Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion.

What does Index_col do in pandas?

index_col: This is to allow you to set which columns to be used as the index of the dataframe. The default value is None, and pandas will add a new column start from 0 to specify the index column. It can be set as a column name or column index, which will be used as the index column.


1 Answers

Use to_datetime with format and errors='coerce':

date_parser = lambda c: pd.to_datetime(c, format='%m/%d/%Y', errors='coerce')
df = pd.read_csv(fake_file, parse_dates=['date'], date_parser=date_parser)
print (df)
   value       date
0      7        NaT
1      7 2008-10-18
2    621        NaT
like image 145
jezrael Avatar answered Sep 28 '22 00:09

jezrael