Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split nested array values from Pandas Dataframe cell over multiple rows

I have a Pandas DataFrame of the following form

enter image description here

There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Rain each cell contains an array of values corresponding to a day in that year, i.e. for the frame above

  • frame3.iloc[0]['Max Temp'][0] is the value for January 1st 2011
  • frame3.iloc[0]['Max Temp'][364] is the value for December 31st 2011.

I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).

I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Year column to create a DatetimeIndex

enter image description here

I searched here for relevant answers, but only found this one which doesn't really help me.

like image 591
Philip O'Brien Avatar asked Jul 14 '16 10:07

Philip O'Brien


People also ask

How do I split a row into multiple rows in Pandas DataFrame?

Splitting cell into multiple rows For this purpose, we will use DataFrame. explode() method. It will allow us to convert all the values of a column into rows in pandas DataFrame.


1 Answers

You can run .apply(pd.Series) for each of your columns, then stack and concatenate the results.

For a series

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])

s
Out[103]: 
2011       [0, 1]
2012    [2, 3, 4]
dtype: object

it works as follows

s.apply(pd.Series).stack()
Out[104]: 
2011  0    0.0
      1    1.0
2012  0    2.0
      1    3.0
      2    4.0
dtype: float64

The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack, had a NaN value that has been later dropped.

Now, let's take a frame:

a = list(range(14))
b = list(range(20, 34))

df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
                   'Year': [2011, 2012, 2011, 2012],
                   'A': [a[:3], a[3:7], a[7:10], a[10:14]],
                   'B': [b[:3], b[3:7], b[7:10], b[10:14]]})

df
Out[108]: 
                  A                 B     ID  Year
0         [0, 1, 2]      [20, 21, 22]  11111  2011
1      [3, 4, 5, 6]  [23, 24, 25, 26]  11111  2012
2         [7, 8, 9]      [27, 28, 29]  11112  2011
3  [10, 11, 12, 13]  [30, 31, 32, 33]  11112  2012

Then we can run:

# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
    unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

and get:

result
Out[115]: 
                 A     B
ID    Year              
11111 2011 0   0.0  20.0
           1   1.0  21.0
           2   2.0  22.0
      2012 0   3.0  23.0
           1   4.0  24.0
           2   5.0  25.0
           3   6.0  26.0
11112 2011 0   7.0  27.0
           1   8.0  28.0
           2   9.0  29.0
      2012 0  10.0  30.0
           1  11.0  31.0
           2  12.0  32.0
           3  13.0  33.0

The rest (datetime index) is more less straightforward. For example:

# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')

# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'

new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])

result.index = new_index

result
Out[130]: 
                     A     B
ID    Date                  
11111 2011-01-01   0.0  20.0
      2011-01-02   1.0  21.0
      2011-01-03   2.0  22.0
      2012-01-01   3.0  23.0
      2012-01-02   4.0  24.0
      2012-01-03   5.0  25.0
      2012-01-04   6.0  26.0
11112 2011-01-01   7.0  27.0
      2011-01-02   8.0  28.0
      2011-01-03   9.0  29.0
      2012-01-01  10.0  30.0
      2012-01-02  11.0  31.0
      2012-01-03  12.0  32.0
      2012-01-04  13.0  33.0
like image 155
ptrj Avatar answered Sep 22 '22 04:09

ptrj