Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting pandas columns to datetime64 including missing values

Working with Pandas to work with some timeseries based data that contains dates, numbers, categories etc.

The problem I'm having is getting pandas to deal with my date/time columns correctly from a DataFrame created from a CSV. There are 18 date columns in my data, they are not continuous and unknown values in the raw CSV have a string value of "Unknown". Some columns have ALL cells with a valid datetime in it and correctly get their dtype guessed by the pandas read_csv method. There are some columns however that in a particular data sample have ALL cells as "Unknown" and these get typed as object.

My code to load the CSV is as follows:

self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values

The output from print self.csvbear.dtypes

Prac                            object
Doctor Name                     object
Practice Name                   object
Specialty                       object
Speciality Code                  int64
Claim Date              datetime64[ns]
Lock Date               datetime64[ns]
Progress Note Locked            object
Aging by Claim Date              int64
Aging by Lock Date               int64
Closed Date             datetime64[ns]
Service Date            datetime64[ns]
Week Number                      int64
Month                   datetime64[ns]
Current Insurance               object
...
Secondary 2 Deductible        float64
Secondary 2 Co Insurance      float64
Secondary 2 Member Balance    float64
Secondary 2 Paid              float64
Secondary 2 Witheld           float64
Secondary 2 Ins                object
Tertiary Payment Date          object
Tertiary Payment ID           float64
Tertiary Allowed              float64
Tertiary Deductible           float64
Tertiary Co Insurance         float64
Tertiary Member Balance       float64
Tertiary Paid                 float64
Tertiary Witheld              float64
Tertiary Ins                  float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .

As you can see, the Tertiary Payment Date col should be a datetime64 dtype, but it's simply a object, and the actual content of it is just NaN (put there from the read_csv function for string 'Unknown').

How can I reliably convert all of the date columns to have datetime64 as a dtype and NaT for 'Unknown' cells?

like image 915
ptpatil Avatar asked Dec 19 '13 18:12

ptpatil


1 Answers

if you have an all-nan column it won't be coerced properly by read_csv. easiest is just to do this (which if a column is already datetime64[ns] will just pass thru).

In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))

In [4]: df
Out[4]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00  0.410673 NaN
3 2013-01-01 00:00:00  0.480578 NaN
4 2013-01-01 00:00:00  0.464771 NaN

[5 rows x 3 columns]

In [5]: df.dtypes
Out[5]: 
A    datetime64[ns]
B           float64
C           float64
dtype: object

In [6]: df['A'] = pd.to_datetime(df['A'])

In [7]: df['C'] = pd.to_datetime(df['C'])

In [8]: df
Out[8]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00  0.410673 NaT
3 2013-01-01 00:00:00  0.480578 NaT
4 2013-01-01 00:00:00  0.464771 NaT

[5 rows x 3 columns]

In [9]: df.dtypes
Out[9]: 
A    datetime64[ns]
B           float64
C    datetime64[ns]
dtype: object

convert_objects won't forcibly convert a column to datetime unless it has a least 1 non-nan thing that is a date (that why your example fails). to_datetime can be more aggressive because it 'knows' that you really want to convert it.

like image 151
Jeff Avatar answered Oct 20 '22 12:10

Jeff