Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas convert datetime with a separate time zone column

I have a dataframe with a column for the time zone and a column for the datetime. I would like to convert these to UTC first to join with other data, and then I'll have some calculations to convert from UTC to the viewers local time zone eventually.

datetime              time_zone
2016-09-19 01:29:13   America/Bogota 
2016-09-19 02:16:04   America/New_York
2016-09-19 01:57:54   Africa/Cairo

def create_utc(df, column, time_format='%Y-%m-%d %H:%M:%S'):
    timezone = df['TZ']
    df[column + '_utc'] = df[column].dt.tz_localize(timezone).dt.tz_convert('UTC').dt.strftime(time_format)
    df[column + '_utc'].replace('NaT', np.nan, inplace=True)
    df[column + '_utc'] = pd.to_datetime(df[column + '_utc'])
    return df

That was my flawed attempt. The error is that the truth is ambiguous which makes sense because the 'timezone' variable is referring to a column. How do I refer to the value in the same row?

Edit: here are some results from the answers below on one day of data (394,000 rows and 22 unique time zones). Edit2: I added a groupby example in case someone wants to see the results. It is the fastest, by far.

%%timeit

for tz in df['TZ'].unique():
    df.ix[df['TZ'] == tz, 'datetime_utc2'] = df.ix[df['TZ'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc2'] = df['datetime_utc2'].dt.tz_localize(None)

1 loops, best of 3: 1.27 s per loop

%%timeit

df['datetime_utc'] = [d['datetime'].tz_localize(d['TZ']).tz_convert('UTC') for i, d in df.iterrows()]
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)

1 loops, best of 3: 50.3 s per loop

df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('TZ')])



**1 loops, best of 3: 249 ms per loop**
like image 945
trench Avatar asked Sep 22 '16 18:09

trench


2 Answers

Here is a vectorized approach (it will loop df.time_zone.nunique() times):

In [2]: t
Out[2]:
             datetime         time_zone
0 2016-09-19 01:29:13    America/Bogota
1 2016-09-19 02:16:04  America/New_York
2 2016-09-19 01:57:54      Africa/Cairo
3 2016-09-19 11:00:00    America/Bogota
4 2016-09-19 12:00:00  America/New_York
5 2016-09-19 13:00:00      Africa/Cairo

In [3]: for tz in t.time_zone.unique():
   ...:         mask = (t.time_zone == tz)
   ...:         t.loc[mask, 'datetime'] = \
   ...:             t.loc[mask, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
   ...:

In [4]: t
Out[4]:
             datetime         time_zone
0 2016-09-19 06:29:13    America/Bogota
1 2016-09-19 06:16:04  America/New_York
2 2016-09-18 23:57:54      Africa/Cairo
3 2016-09-19 16:00:00    America/Bogota
4 2016-09-19 16:00:00  America/New_York
5 2016-09-19 11:00:00      Africa/Cairo

UPDATE:

In [12]: df['new'] = df.groupby('time_zone')['datetime'] \
                       .transform(lambda x: x.dt.tz_localize(x.name))

In [13]: df
Out[13]:
             datetime         time_zone                 new
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00    America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00  America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00      Africa/Cairo 2016-09-19 11:00:00
like image 100
MaxU - stop WAR against UA Avatar answered Nov 14 '22 21:11

MaxU - stop WAR against UA


Your issue is that tz_localize() can only take a scalar value, so we'll have to iterate through the DataFrame:

df['datetime_utc'] = [d['datetime'].tz_localize(d['time_zone']).tz_convert('UTC') for i,d in df.iterrows()]

The result is:

            datetime         time_zone              datetime_utc
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13+00:00
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04+00:00
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54+00:00

An alternative approach is to group by the timezone and convert all matching rows in one pass:

df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('time_zone')])
like image 20
chrisaycock Avatar answered Nov 14 '22 23:11

chrisaycock