I'm trying to identify columns which contain dates as strings in order to then convert them to a better type (DateTime or something numeric like UTC). The date format used is 27/11/2012 09:17
which I can search for using a regex of \d{2}/\d{2}/\d{4} \d{2}:\d{2}
.
My current code is:
date_cols = []
df = cleaned_data
date_pattern = re.compile('\d{2}/\d{2}/\d{4} \d{2}:\d{2}')
for column in df:
if date_pattern.search(str(item)):
date_cols += [column]
return date_cols
I'm sure this is not taking advantage of the capabilities of pandas
. Is there a better way, either to identify the columns, or to convert them to DateTime or UTC timestamps directly?
You should add parse_dates=True , or parse_dates=['column name'] when reading, thats usually enough to magically parse it.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
As shown above, the problem is that the date column is read as an object type instead of a date type, which prevents it from accessing any date-related functionalities in Pandas. The easy solution is to ask Pandas to parse the date for us. As shown below, we specify a list object containing the date column name to the parse_dates parameter.
Pandas provide us with a method called to_datetime () which converts the date and time in string format to a DateTime object. pd.date_range () method accepts a start date, an end date, and creates date sequences in that range.
You can see how we can determine a pandas column contains a particular value of DataFrame using Series.Str.contains (). This contains () function is used to test the pattern or regex is conta ined within a string of a Series or Index.
Please be noted that if you have multiple date columns, you can use parse_dates= [“date”, “another_date”]. It should be noted that Pandas integrates powerful date parsers such that many different kinds of dates can be parsed automatically. Thus, you usually just need to set the parse_date parameter.
If you are looking to convert entire columns, you can use convert_objects:
df.convert_objects(convert_dates=True)
To extract dates contained in columns/Series you could use findall:
In [11]: s = pd.Series(['1', '10/11/2011 11:11'])
In [12]: s.str.findall('\d{2}/\d{2}/\d{4} \d{2}:\d{2}')
Out[12]:
0 []
1 [10/11/2011 11:11]
dtype: object
In [13]: s.str.findall('\d{2}/\d{2}/\d{4} \d{2}:\d{2}').apply(pd.Series)
Out[13]:
0
0 NaN
1 10/11/2011 11:11
*and then convert to Timestamps using convert_objects...*
Depending on how overzealous you want to be, to_datetime
will coerce anything it thinks is a datetime into a datetime, including ints → datetimes (defaults to ns since UNIX epoch).
to_datetime
gives you a lot of control over how to interpret the datetimes it finds too.
pandas.to_datetime(arg, errors='ignore', dayfirst=False, utc=None,
box=True, format=None, coerce=False, unit='ns')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With