I have a Pandas DataFrame of the following form
There is one row per ID per year (2008 - 2015). For the columns Max Temp
, Min Temp
, and Rain
each cell contains an array of values corresponding to a day in that year, i.e. for the frame above
frame3.iloc[0]['Max Temp'][0]
is the value for January 1st 2011frame3.iloc[0]['Max Temp'][364]
is the value for December 31st 2011.I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).
I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Year
column to create a DatetimeIndex
I searched here for relevant answers, but only found this one which doesn't really help me.
Splitting cell into multiple rows For this purpose, we will use DataFrame. explode() method. It will allow us to convert all the values of a column into rows in pandas DataFrame.
You can run .apply(pd.Series)
for each of your columns, then stack
and concatenate the results.
For a series
s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])
s
Out[103]:
2011 [0, 1]
2012 [2, 3, 4]
dtype: object
it works as follows
s.apply(pd.Series).stack()
Out[104]:
2011 0 0.0
1 1.0
2012 0 2.0
1 3.0
2 4.0
dtype: float64
The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack
, had a NaN
value that has been later dropped.
Now, let's take a frame:
a = list(range(14))
b = list(range(20, 34))
df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
'Year': [2011, 2012, 2011, 2012],
'A': [a[:3], a[3:7], a[7:10], a[10:14]],
'B': [b[:3], b[3:7], b[7:10], b[10:14]]})
df
Out[108]:
A B ID Year
0 [0, 1, 2] [20, 21, 22] 11111 2011
1 [3, 4, 5, 6] [23, 24, 25, 26] 11111 2012
2 [7, 8, 9] [27, 28, 29] 11112 2011
3 [10, 11, 12, 13] [30, 31, 32, 33] 11112 2012
Then we can run:
# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)
and get:
result
Out[115]:
A B
ID Year
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
11112 2011 0 7.0 27.0
1 8.0 28.0
2 9.0 29.0
2012 0 10.0 30.0
1 11.0 31.0
2 12.0 32.0
3 13.0 33.0
The rest (datetime index) is more less straightforward. For example:
# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')
# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'
new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])
result.index = new_index
result
Out[130]:
A B
ID Date
11111 2011-01-01 0.0 20.0
2011-01-02 1.0 21.0
2011-01-03 2.0 22.0
2012-01-01 3.0 23.0
2012-01-02 4.0 24.0
2012-01-03 5.0 25.0
2012-01-04 6.0 26.0
11112 2011-01-01 7.0 27.0
2011-01-02 8.0 28.0
2011-01-03 9.0 29.0
2012-01-01 10.0 30.0
2012-01-02 11.0 31.0
2012-01-03 12.0 32.0
2012-01-04 13.0 33.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With