How can I generalize my pandas data grouping to more than 3 dimensions?

Tags:

I'm using the excellent pandas package to deal with a large amount of varied meteorological diagnostic data and I'm quickly running out of dimensions as I stitch the data together. Looking at the documentation, it may be that using the MultiIndex may solve my problem, but I'm not sure how to apply it to my situation - the documentation shows examples of creating MultiIndexes with random data and DataFrames, but not Series with pre-existing timeseries data.

Background

The basic data structure I'm using contains two main fields:

metadata, which is a dictionary consisting of key-value pairs describing what the numbers are
data, which is a pandas data structure containing the numbers themselves.

The lowest common denominator is timeseries data, so the basic structure has a pandas Series object as the data entry, and the metadata field describes what those numbers actually are (e.g. vector RMS error for 10-meter wind over the Eastern Pacific for a 24-hour forecast from experiment Test1).

I'm looking at taking that lowest-common-denominator and gluing the various timeseries together to make the results more useful and allow for easy combinations. For instance, I may want to look at all the different lead times - I have a filter routine that will take my timeseries that share the same metadata entries except for lead time (e.g. experiment, region, etc.) and return a new object where the metadata field consists of only the common entries (i.e. Lead Time has been removed) and now the data field is a pandas DataFrame with the column labels given by the Lead Time value. I can extend this again and say I want to take the resulting frames and group them together with only another entry varying (e.g. the Experiment) to give me a pandas Panel. for my entry where the item index is given by the Experiment metadata values from the constituent frames and the object's new metadata does not contain either Lead Time or Experiment.

When I iterate over these composite objects, I have an iterseries routine for the frame and iterframes routine for the panel that reconstruct the appropriate metadata/data pairing as I drop one dimension (i.e. the series from the frame with lead time varying across the columns will have all the metadata of its parent plus the Lead Time field restored with the value taken from the column label). This works great.

Problem

I've run out of dimensions (up to 3-D with a Panel) and I'm also not able to use things like dropna to remove empty columns once everything is aligned in the Panel (this has led to several bugs when plotting summary statistics). Reading about using pandas with higher-dimensional data has led to reading about the MultiIndex and its use. I've tried the examples given in the documentation, but I'm still a little unclear how to apply it to my situation. Any direction would be useful. I'd like to be able to:

Combine my Series-based data into a multi-indexed DataFrame along an arbitrary number of dimensions (this would be great - it would eliminate one call to create the frames from the series, and then another to create the panels from the frames)
Iterate over the resulting multi-indexed DataFrame, dropping a single dimension so I can reset the component metadata.

Edit - Add Code Sample

Wes McKinney's answer below is almost exactly what I need - the issue is in the initial translation from the Series-backed storage objects I have to work with to my DataFrame-backed objects once I start grouping elements together. The Data-Frame-backed class has the following method that takes in a list of the series-based objects and the metadata field that will vary across the columns.

@classmethod
def from_list(cls, results_list, column_key):
    """
    Populate object from a list of results that all share the metadata except
    for the field `column_key`.

    """
    # Need two copies of the input results - one for building the object
    # data and one for building the object metadata
    for_data, for_metadata = itertools.tee(results_list)

    self             = cls()
    self.column_key  = column_key
    self.metadata    = next(for_metadata).metadata.copy()
    if column_key in self.metadata:
        del self.metadata[column_key]
    self.data = pandas.DataFrame(dict(((transform(r[column_key]), r.data)
                                        for r in for_data)))
    return self

Once I have the frame given by this routine, I can easily apply the various operations suggested below - of particular utility is being able to use the names field when I call concat - this eliminates the need to store the name of the column key internally since it's stored in the MultiIndex as the name of that index dimension.

I'd like to be able to implement the solution below and just take in the list of matching Series-backed classes and a list of keys and do the grouping sequentially. However, I don't know what the columns will be representing ahead of time, so:

it really doesn't make sense to me to store the Series data in a 1-D DataFrame
I don't see how to set the name of the index and the columns from the the initial Series -> Frame grouping

686

asked Jun 14 '12 16:06

Tim Whitcomb

1 Answers

I might suggest using pandas.concat along with its keys argument to glue together Series DataFrames to create a MultiIndex in the columns:

In [20]: data
Out[20]: 
{'a': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D,
 'b': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D,
 'c': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D}

In [21]: df = pd.concat(data, axis=1, keys=['a', 'b', 'c'])

In [22]: df
Out[22]: 
            a  b  c
2012-04-16  0  0  0
2012-04-17  1  1  1
2012-04-18  2  2  2
2012-04-19  3  3  3
2012-04-20  4  4  4
2012-04-21  5  5  5
2012-04-22  6  6  6
2012-04-23  7  7  7
2012-04-24  8  8  8
2012-04-25  9  9  9

In [23]: df2 = pd.concat([df, df], axis=1, keys=['group1', 'group2'])

In [24]: df2
Out[24]: 
            group1        group2      
                 a  b  c       a  b  c
2012-04-16       0  0  0       0  0  0
2012-04-17       1  1  1       1  1  1
2012-04-18       2  2  2       2  2  2
2012-04-19       3  3  3       3  3  3
2012-04-20       4  4  4       4  4  4
2012-04-21       5  5  5       5  5  5
2012-04-22       6  6  6       6  6  6
2012-04-23       7  7  7       7  7  7
2012-04-24       8  8  8       8  8  8
2012-04-25       9  9  9       9  9  9

You have then:

In [25]: df2['group2']
Out[25]: 
            a  b  c
2012-04-16  0  0  0
2012-04-17  1  1  1
2012-04-18  2  2  2
2012-04-19  3  3  3
2012-04-20  4  4  4
2012-04-21  5  5  5
2012-04-22  6  6  6
2012-04-23  7  7  7
2012-04-24  8  8  8
2012-04-25  9  9  9

or even

In [27]: df2.xs('b', axis=1, level=1)
Out[27]: 
            group1  group2
2012-04-16       0       0
2012-04-17       1       1
2012-04-18       2       2
2012-04-19       3       3
2012-04-20       4       4
2012-04-21       5       5
2012-04-22       6       6
2012-04-23       7       7
2012-04-24       8       8
2012-04-25       9       9

You can have arbitrarily many levels:

In [29]: pd.concat([df2, df2], axis=1, keys=['tier1', 'tier2'])
Out[29]: 
             tier1                       tier2                    
            group1        group2        group1        group2      
                 a  b  c       a  b  c       a  b  c       a  b  c
2012-04-16       0  0  0       0  0  0       0  0  0       0  0  0
2012-04-17       1  1  1       1  1  1       1  1  1       1  1  1
2012-04-18       2  2  2       2  2  2       2  2  2       2  2  2
2012-04-19       3  3  3       3  3  3       3  3  3       3  3  3
2012-04-20       4  4  4       4  4  4       4  4  4       4  4  4
2012-04-21       5  5  5       5  5  5       5  5  5       5  5  5
2012-04-22       6  6  6       6  6  6       6  6  6       6  6  6
2012-04-23       7  7  7       7  7  7       7  7  7       7  7  7
2012-04-24       8  8  8       8  8  8       8  8  8       8  8  8
2012-04-25       9  9  9       9  9  9       9  9  9       9  9  9

answered Nov 23 '22 06:11

Wes McKinney

Related questions
                            
                                Numpy NdArray Memoization
                            
                                Summing with a for loop faster than with reduce?
                            
                                What Javascript constructs does JsLex incorrectly lex?
                            
                                Linux blocking signals to Python init
                            
                                What did Apple do to the Python framework?
                            
                                (Are there) PERFORMANCE advantages of python socketserver over regular socket object?
                            
                                Efficient item binning algorithm (itertools/numpy)
                            
                                NumPy: use 2D index array from argmin in a 3D slice
                            
                                Python timedelta behaviour on subtraction
                            
                                How to get Inserted or selected row id in postgres using python
                            
                                How to structure standard Python project (modules, libraries) for future use in Google App Engine?
                            
                                extract relationships using NLTK
                            
                                Passing an object to C module, in Python
                            
                                Scraping with BeautifulSoup and multiple paragraphs
                            
                                Trying to write my own Neural Network in Python
                            
                                Django one-to-many forms
                            
                                Site configuration in django-admin
                            
                                Is there an implementation of 'expect' or an expect-like library that works in python3?
                            
                                Create a hyperlink (or button) that executes a python script and then redirects when script completes
                            
                                Connect two TCP sockets without defining client/server role [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I generalize my pandas data grouping to more than 3 dimensions?

Tags:

python

pandas

Tim Whitcomb

People also ask

1 Answers

Wes McKinney

Recent Activity

Donate For Us