I have the following dataframe read in from a .csv file with the "Date" column being the index. The days are in the rows and the columns show the values for the hours that day.
> Date h1 h2 h3 h4 ... h24
> 14.03.2013 60 50 52 49 ... 73
I would like to arrange it like this, so that there is one index column with the date/time and one column with the values in a sequence
>Date/Time Value
>14.03.2013 00:00:00 60
>14.03.2013 01:00:00 50
>14.03.2013 02:00:00 52
>14.03.2013 03:00:00 49
>.
>.
>.
>14.03.2013 23:00:00 73
I was trying it by using two loops to go through the dataframe. Is there an easier way to do this in pandas?
You need to create a new list of your columns in the desired order, then use df = df[cols] to rearrange the columns in this new order.
Use double brackets to reorder columns in a DataFrame Use the syntax DataFrame[["column1", "column2", "column3"]] with the column names in the desired order to reorder the columns.
I'm not the best at date manipulations, but maybe something like this:
import pandas as pd
from datetime import timedelta
df = pd.read_csv("hourmelt.csv", sep=r"\s+")
df = pd.melt(df, id_vars=["Date"])
df = df.rename(columns={'variable': 'hour'})
df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
combined = df.apply(lambda x:
pd.to_datetime(x['Date'], dayfirst=True) +
timedelta(hours=int(x['hour'])), axis=1)
df['Date'] = combined
del df['hour']
df = df.sort("Date")
Some explanation follows.
Starting from
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>>
>>> df = pd.read_csv("hourmelt.csv", sep=r"\s+")
>>> df
Date h1 h2 h3 h4 h24
0 14.03.2013 60 50 52 49 73
1 14.04.2013 5 6 7 8 9
We can use pd.melt to make the hour columns into one column with that value:
>>> df = pd.melt(df, id_vars=["Date"])
>>> df = df.rename(columns={'variable': 'hour'})
>>> df
Date hour value
0 14.03.2013 h1 60
1 14.04.2013 h1 5
2 14.03.2013 h2 50
3 14.04.2013 h2 6
4 14.03.2013 h3 52
5 14.04.2013 h3 7
6 14.03.2013 h4 49
7 14.04.2013 h4 8
8 14.03.2013 h24 73
9 14.04.2013 h24 9
Get rid of those hs:
>>> df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
>>> df
Date hour value
0 14.03.2013 0 60
1 14.04.2013 0 5
2 14.03.2013 1 50
3 14.04.2013 1 6
4 14.03.2013 2 52
5 14.04.2013 2 7
6 14.03.2013 3 49
7 14.04.2013 3 8
8 14.03.2013 23 73
9 14.04.2013 23 9
Combine the two columns as a date:
>>> combined = df.apply(lambda x: pd.to_datetime(x['Date'], dayfirst=True) + timedelta(hours=int(x['hour'])), axis=1)
>>> combined
0 2013-03-14 00:00:00
1 2013-04-14 00:00:00
2 2013-03-14 01:00:00
3 2013-04-14 01:00:00
4 2013-03-14 02:00:00
5 2013-04-14 02:00:00
6 2013-03-14 03:00:00
7 2013-04-14 03:00:00
8 2013-03-14 23:00:00
9 2013-04-14 23:00:00
Reassemble and clean up:
>>> df['Date'] = combined
>>> del df['hour']
>>> df = df.sort("Date")
>>> df
Date value
0 2013-03-14 00:00:00 60
2 2013-03-14 01:00:00 50
4 2013-03-14 02:00:00 52
6 2013-03-14 03:00:00 49
8 2013-03-14 23:00:00 73
1 2013-04-14 00:00:00 5
3 2013-04-14 01:00:00 6
5 2013-04-14 02:00:00 7
7 2013-04-14 03:00:00 8
9 2013-04-14 23:00:00 9
You could always grab the hourly data_array and flatten it. You would generate a new DatetimeIndex with hourly freq.
df = df.asfreq('D')
hourly_data = df.values[:, :]
new_ind = pd.date_range(start=df.index[0], freq="H", periods=len(df) * 24)
# create Series.
s = pd.Series(hourly_data.flatten(), index=new_ind)
I'm assuming that read_csv is parsing the 'Date' column and making it the index. We change to frequency of 'D' so that the new_ind lines up correctly if you have missing days. The missing days will be filled with np.nan which you can drop with s.dropna().
notebook link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With