Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

skip rows with bad dates while using pd.read_csv

Tags:

python

pandas

csv

I'm reading in csv files from an external data source using pd.read_csv, as in the code below:

pd.read_csv(
    BytesIO(raw_data),
    parse_dates=['dates'],
    date_parser=np.datetime64,
)

However, somewhere in the csv that's being sent, there is a misformatted date, resulting in the following error:

ValueError: Error parsing datetime string "2015-08-2" at position 8

This causes the entire application to crash. Of course, I can handle this case with a try/except, but then I will lose all the other data in that particular csv. I need pandas to keep and parse that other data.

I have no way of predicting when/where this data (which changes daily) will have badly formatted dates. Is there some way to get pd.read_csv to skip only the rows with bad dates but to still parse all the other rows in the csv?

like image 721
LateCoder Avatar asked Dec 24 '15 22:12

LateCoder


1 Answers

somewhere in the csv that's being sent, there is a misformatted date

np.datetime64 needs ISO8601 formatted strings to work properly. The good news is that you can wrap np.datetime64 in your own function and use this as the date_parser:

def parse_date(v):
   try:
      return np.datetime64(v)
   except:
      # apply whatever remedies you deem appropriate
      pass
   return v

   pd.read_csv(
     ...
     date_parser=parse_date
   )

I need pandas to keep and parse that other data.

I often find that a more flexible date parser like dateutil works better than np.datetime64 and may even work without the extra function:

import dateutil
pd.read_csv(
    BytesIO(raw_data),
    parse_dates=['dates'],
    date_parser=dateutil.parser.parse,
)
like image 56
miraculixx Avatar answered Oct 25 '22 02:10

miraculixx