Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I generalize my pandas data grouping to more than 3 dimensions?

Tags:

python

pandas

I'm using the excellent pandas package to deal with a large amount of varied meteorological diagnostic data and I'm quickly running out of dimensions as I stitch the data together. Looking at the documentation, it may be that using the MultiIndex may solve my problem, but I'm not sure how to apply it to my situation - the documentation shows examples of creating MultiIndexes with random data and DataFrames, but not Series with pre-existing timeseries data.

Background

The basic data structure I'm using contains two main fields:

  • metadata, which is a dictionary consisting of key-value pairs describing what the numbers are
  • data, which is a pandas data structure containing the numbers themselves.

The lowest common denominator is timeseries data, so the basic structure has a pandas Series object as the data entry, and the metadata field describes what those numbers actually are (e.g. vector RMS error for 10-meter wind over the Eastern Pacific for a 24-hour forecast from experiment Test1).

I'm looking at taking that lowest-common-denominator and gluing the various timeseries together to make the results more useful and allow for easy combinations. For instance, I may want to look at all the different lead times - I have a filter routine that will take my timeseries that share the same metadata entries except for lead time (e.g. experiment, region, etc.) and return a new object where the metadata field consists of only the common entries (i.e. Lead Time has been removed) and now the data field is a pandas DataFrame with the column labels given by the Lead Time value. I can extend this again and say I want to take the resulting frames and group them together with only another entry varying (e.g. the Experiment) to give me a pandas Panel. for my entry where the item index is given by the Experiment metadata values from the constituent frames and the object's new metadata does not contain either Lead Time or Experiment.

When I iterate over these composite objects, I have an iterseries routine for the frame and iterframes routine for the panel that reconstruct the appropriate metadata/data pairing as I drop one dimension (i.e. the series from the frame with lead time varying across the columns will have all the metadata of its parent plus the Lead Time field restored with the value taken from the column label). This works great.

Problem

I've run out of dimensions (up to 3-D with a Panel) and I'm also not able to use things like dropna to remove empty columns once everything is aligned in the Panel (this has led to several bugs when plotting summary statistics). Reading about using pandas with higher-dimensional data has led to reading about the MultiIndex and its use. I've tried the examples given in the documentation, but I'm still a little unclear how to apply it to my situation. Any direction would be useful. I'd like to be able to:

  • Combine my Series-based data into a multi-indexed DataFrame along an arbitrary number of dimensions (this would be great - it would eliminate one call to create the frames from the series, and then another to create the panels from the frames)
  • Iterate over the resulting multi-indexed DataFrame, dropping a single dimension so I can reset the component metadata.

Edit - Add Code Sample

Wes McKinney's answer below is almost exactly what I need - the issue is in the initial translation from the Series-backed storage objects I have to work with to my DataFrame-backed objects once I start grouping elements together. The Data-Frame-backed class has the following method that takes in a list of the series-based objects and the metadata field that will vary across the columns.

@classmethod
def from_list(cls, results_list, column_key):
    """
    Populate object from a list of results that all share the metadata except
    for the field `column_key`.

    """
    # Need two copies of the input results - one for building the object
    # data and one for building the object metadata
    for_data, for_metadata = itertools.tee(results_list)

    self             = cls()
    self.column_key  = column_key
    self.metadata    = next(for_metadata).metadata.copy()
    if column_key in self.metadata:
        del self.metadata[column_key]
    self.data = pandas.DataFrame(dict(((transform(r[column_key]), r.data)
                                        for r in for_data)))
    return self

Once I have the frame given by this routine, I can easily apply the various operations suggested below - of particular utility is being able to use the names field when I call concat - this eliminates the need to store the name of the column key internally since it's stored in the MultiIndex as the name of that index dimension.

I'd like to be able to implement the solution below and just take in the list of matching Series-backed classes and a list of keys and do the grouping sequentially. However, I don't know what the columns will be representing ahead of time, so:

  • it really doesn't make sense to me to store the Series data in a 1-D DataFrame
  • I don't see how to set the name of the index and the columns from the the initial Series -> Frame grouping
like image 686
Tim Whitcomb Avatar asked Jun 14 '12 16:06

Tim Whitcomb


People also ask

How do I iterate over a group in pandas?

groupby() to Iterate over Data frame Groups. DataFrame. groupby() function in Python is used to split the data into groups based on some criteria.

Can you group by multiple columns in pandas?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.

Which structures is used for three dimensional data analysis in pandas?

The pandas Panel It is generally used for 3D time-series data.


1 Answers

I might suggest using pandas.concat along with its keys argument to glue together Series DataFrames to create a MultiIndex in the columns:

In [20]: data
Out[20]: 
{'a': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D,
 'b': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D,
 'c': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D}

In [21]: df = pd.concat(data, axis=1, keys=['a', 'b', 'c'])

In [22]: df
Out[22]: 
            a  b  c
2012-04-16  0  0  0
2012-04-17  1  1  1
2012-04-18  2  2  2
2012-04-19  3  3  3
2012-04-20  4  4  4
2012-04-21  5  5  5
2012-04-22  6  6  6
2012-04-23  7  7  7
2012-04-24  8  8  8
2012-04-25  9  9  9

In [23]: df2 = pd.concat([df, df], axis=1, keys=['group1', 'group2'])

In [24]: df2
Out[24]: 
            group1        group2      
                 a  b  c       a  b  c
2012-04-16       0  0  0       0  0  0
2012-04-17       1  1  1       1  1  1
2012-04-18       2  2  2       2  2  2
2012-04-19       3  3  3       3  3  3
2012-04-20       4  4  4       4  4  4
2012-04-21       5  5  5       5  5  5
2012-04-22       6  6  6       6  6  6
2012-04-23       7  7  7       7  7  7
2012-04-24       8  8  8       8  8  8
2012-04-25       9  9  9       9  9  9

You have then:

In [25]: df2['group2']
Out[25]: 
            a  b  c
2012-04-16  0  0  0
2012-04-17  1  1  1
2012-04-18  2  2  2
2012-04-19  3  3  3
2012-04-20  4  4  4
2012-04-21  5  5  5
2012-04-22  6  6  6
2012-04-23  7  7  7
2012-04-24  8  8  8
2012-04-25  9  9  9

or even

In [27]: df2.xs('b', axis=1, level=1)
Out[27]: 
            group1  group2
2012-04-16       0       0
2012-04-17       1       1
2012-04-18       2       2
2012-04-19       3       3
2012-04-20       4       4
2012-04-21       5       5
2012-04-22       6       6
2012-04-23       7       7
2012-04-24       8       8
2012-04-25       9       9

You can have arbitrarily many levels:

In [29]: pd.concat([df2, df2], axis=1, keys=['tier1', 'tier2'])
Out[29]: 
             tier1                       tier2                    
            group1        group2        group1        group2      
                 a  b  c       a  b  c       a  b  c       a  b  c
2012-04-16       0  0  0       0  0  0       0  0  0       0  0  0
2012-04-17       1  1  1       1  1  1       1  1  1       1  1  1
2012-04-18       2  2  2       2  2  2       2  2  2       2  2  2
2012-04-19       3  3  3       3  3  3       3  3  3       3  3  3
2012-04-20       4  4  4       4  4  4       4  4  4       4  4  4
2012-04-21       5  5  5       5  5  5       5  5  5       5  5  5
2012-04-22       6  6  6       6  6  6       6  6  6       6  6  6
2012-04-23       7  7  7       7  7  7       7  7  7       7  7  7
2012-04-24       8  8  8       8  8  8       8  8  8       8  8  8
2012-04-25       9  9  9       9  9  9       9  9  9       9  9  9
like image 87
Wes McKinney Avatar answered Nov 23 '22 06:11

Wes McKinney