Working with Pandas to work with some timeseries based data that contains dates, numbers, categories etc.
The problem I'm having is getting pandas to deal with my date/time columns correctly from a DataFrame created from a CSV. There are 18 date columns in my data, they are not continuous and unknown values in the raw CSV have a string value of "Unknown". Some columns have ALL cells with a valid datetime in it and correctly get their dtype guessed by the pandas read_csv method. There are some columns however that in a particular data sample have ALL cells as "Unknown" and these get typed as object.
My code to load the CSV is as follows:
self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values
The output from print self.csvbear.dtypes
Prac object
Doctor Name object
Practice Name object
Specialty object
Speciality Code int64
Claim Date datetime64[ns]
Lock Date datetime64[ns]
Progress Note Locked object
Aging by Claim Date int64
Aging by Lock Date int64
Closed Date datetime64[ns]
Service Date datetime64[ns]
Week Number int64
Month datetime64[ns]
Current Insurance object
...
Secondary 2 Deductible float64
Secondary 2 Co Insurance float64
Secondary 2 Member Balance float64
Secondary 2 Paid float64
Secondary 2 Witheld float64
Secondary 2 Ins object
Tertiary Payment Date object
Tertiary Payment ID float64
Tertiary Allowed float64
Tertiary Deductible float64
Tertiary Co Insurance float64
Tertiary Member Balance float64
Tertiary Paid float64
Tertiary Witheld float64
Tertiary Ins float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .
As you can see, the Tertiary Payment Date col should be a datetime64 dtype, but it's simply a object, and the actual content of it is just NaN (put there from the read_csv function for string 'Unknown').
How can I reliably convert all of the date columns to have datetime64 as a dtype and NaT for 'Unknown' cells?
if you have an all-nan column it won't be coerced properly by read_csv
. easiest is just to do this (which if a column is already datetime64[ns] will just pass thru).
In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))
In [4]: df
Out[4]:
A B C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00 0.410673 NaN
3 2013-01-01 00:00:00 0.480578 NaN
4 2013-01-01 00:00:00 0.464771 NaN
[5 rows x 3 columns]
In [5]: df.dtypes
Out[5]:
A datetime64[ns]
B float64
C float64
dtype: object
In [6]: df['A'] = pd.to_datetime(df['A'])
In [7]: df['C'] = pd.to_datetime(df['C'])
In [8]: df
Out[8]:
A B C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00 0.410673 NaT
3 2013-01-01 00:00:00 0.480578 NaT
4 2013-01-01 00:00:00 0.464771 NaT
[5 rows x 3 columns]
In [9]: df.dtypes
Out[9]:
A datetime64[ns]
B float64
C datetime64[ns]
dtype: object
convert_objects
won't forcibly convert a column to datetime unless it has a least 1 non-nan thing that is a date (that why your example fails). to_datetime
can be more aggressive because it 'knows' that you really want to convert it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With