Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does pandas treat timezone when reading from a CSV file?

In my CSV file I have the following entries:

Local time,Open,High,Low,Close,Volume
01.01.2015 00:00:00.000 GMT+0100,1.20976,1.20976,1.20976,1.20976,0
01.01.2015 00:01:00.000 GMT+0100,1.20976,1.20976,1.20976,1.20976,0
01.01.2015 00:02:00.000 GMT+0100,1.20976,1.20976,1.20976,1.20976,0
01.01.2015 00:03:00.000 GMT+0100,1.20976,1.20976,1.20976,1.20976,0

The first column contains date-time in a specific timezone (GMT+01).

I read the CSV file using the following command:

df = pd.read_csv(csv, sep = ',', parse_dates = ['Local time'])

As a result I get the following:

0   2015-01-01 01:00:00 1.20976 1.20976 1.20976 1.20976 0.0
1   2015-01-01 01:01:00 1.20976 1.20976 1.20976 1.20976 0.0
2   2015-01-01 01:02:00 1.20976 1.20976 1.20976 1.20976 0.0
3   2015-01-01 01:03:00 1.20976 1.20976 1.20976 1.20976 0.0
4   2015-01-01 01:04:00 1.20976 1.20976 1.20976 1.20976 0.0

As we can see timestamp has been modified (one hour has been added to it). My interpretation is that the time has been converted to UTC timezone. However, I am not sure about it because, according to Google:

GMT+01 is a time offset that adds 1 hour to Greenwich Mean Time (GMT).

So, time in GMT+01 should be 1 hour larger than in UTC. So, in UTC it should be one hour earlier. So, 00:00, should become 23:00 and not 01:00.

Where do I make an error in interpretation?

ADDED

I have played a bit with pandas to_datetime function. It looks like it is the reason of the above described behaviour.

If I apply it to the time given in the same format as in my CSV:

pd.to_datetime('01.01.2015 00:00:00.000 GMT+0100')

then I get the same result:

Timestamp('2015-01-01 01:00:00')

So, as you can see, 1 hour is added (as before).

However, if I apply it to a bit modified format (which I thought is the same):

pd.to_datetime('01.01.2015 00:00:00.000+01:00')

Then I get another result:

Timestamp('2014-12-31 23:00:00')

To summarise, GMT+0100 and +01:00 are treated differently. Why is that? Do I misinterpret something?

ADDED 2

So, it looks like it is about how python treat timezones. If I execute this command:

pd.to_datetime('01.01.2015 00:00:00.000').tz_localize('Etc/GMT+5').tz_convert('GMT')

I get this:

Timestamp('2015-01-01 05:00:00+0000', tz='GMT')

I would expect that in the GMT+5 timezone, the time is 5 hours larger than in GMT. So, in GMT+5 it should be later. However, it looks like it is the other way around. But why?

When I play with this site: https://time.is/GMT+5 , I do see that GMT+5 has 5 hours more than GMT.

ADDED 3

From the documentation on the timezones I got this:

The 'Etc/GMT*' time zones mentioned above provide fixed offset specifications, but watch out for the counter-intuitive sign convention.

So, it looks like they treat the sing counter-intuitively. It looks like I have found a solution but now I am not sure how 'GMT+0100' should be treated in my CSV (it has nothing to do with Python), it was just downloaded from a website. Is there a standard convention on what GMT+0100 means?

like image 596
Roman Avatar asked Jul 25 '19 14:07

Roman


People also ask

Do pandas recognize dates when importing?

Pandas intelligently handles DateTime values when you import a dataset into a DataFrame. The library will try to infer the data types of your columns when you first import a dataset.

What does Parse_dates in pandas do?

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.


1 Answers

pandas relies on your computer locale and some heuristics. If the datetimes come out weird, specify the exact format.

df = pd.read_csv(csv)
pd.to_datetime(df['Local time'], format='%d.%m.%Y %H:%M:%S.%f GMT%z')

0   2015-01-01 00:00:00+01:00
1   2015-01-01 00:01:00+01:00
2   2015-01-01 00:02:00+01:00
3   2015-01-01 00:03:00+01:00
Name: Local time, dtype: datetime64[ns, pytz.FixedOffset(60)]

Many functions do not work with timezone-aware datetimes, so you may want to convert everything to a single timezone, then drop the timezone altogether:

pd.to_datetime(df['Local time'], format='%d.%m.%Y %H:%M:%S.%f GMT%z') \
    .dt.tz_convert('America/New_York') \
    .dt.tz_localize(None)
like image 165
Code Different Avatar answered Oct 26 '22 22:10

Code Different