I'm using Python 2.7, panda 0.14.1-2, numpy 1.8.1-1. I have to use Python 2.7 because I'm coupling it with something that doesn't work on Python 3
I'm trying to analyze a csv files that outputs Month, Day and Hour in separate columns, and would look something like the following:
Month Day Hour Value
1 1 1 105
1 1 2 30
1 1 3 85
1 1 4 52
1 1 5 65
I basically want to create a timestamp from those columns, and use "2005" as the year, and set this new timestamp column to be the index. I've read a lot of similar questions (here and here) but they all rely on doing during read_csv(). I don't have a year column, so I don't think this applies to me (aside from loading dataframe, inserting column, writing, and redoing read_csv... seems convoluted).
After loading the dataframe, I insert a Year column in position 0 df.insert(0, "Year", 2005)
So now I've got
Year Month Day Hour Value
2005 1 1 1 105
2005 1 1 2 30
2005 1 1 3 85
2005 1 1 4 52
2005 1 1 5 65
df.types tells me that all columns are int64 types.
Then I've tried doing this:
df['Datetime'] = pd.to_datetime(df.Year*1000000 + df.Month*10000 + df.Day+100 + df.Hour, format="%Y%M%d%H")
But I'm getting "TypeError: 'long' object is unsliceable"
On the other hand, the following runs without errors.
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")
As 2.7 doesn't like the %Y%M%d%H, as pointed by @EdChum, I've tried doing it in two steps: creating a datetime without hours, and adding the hours after. But: the output is not what I wanted
In [1]: # Do it without hours first (otherwise doesn't work in Python 2.7)
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")
In [2]: df['Datetime']
Out [2]:
0 2005-01-01 00:01:00
1 2005-01-01 00:01:00
...
13 2005-01-01 00:01:00
14 2005-01-01 00:01:00
...
8745 2005-01-31 00:12:00
8746 2005-01-31 00:12:00
...
8758 2005-01-31 00:12:00
8759 2005-01-31 00:12:00
8758 is supposed to be 2005-12-31 for example. What is wrong with that?
Once I resolve that, I'll be able to re-add the hours:
In [3]: # Then add the hours
df['Datetime'] = df['Datetime'] + pd.to_timedelta(df['Hour'], unit="h")
Letting the pandas parser do the heavy lifting (as in first answer) is obviously the best option if you are getting it from csv. If you are getting or calculating numbers in a different way try:
df['DateTime'] = df[['Year', 'Month', 'Day', 'Hour']].apply(lambda s : datetime.datetime(*s),axis = 1)
find that is still easy to read and very flexible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With