I'm using the excellent pandas
package to deal with a large amount of varied meteorological diagnostic data and I'm quickly running out of dimensions as I stitch the data together. Looking at the documentation, it may be that using the MultiIndex
may solve my problem, but I'm not sure how to apply it to my situation - the documentation shows examples of creating MultiIndexes with random data and DataFrame
s, but not Series with pre-existing timeseries data.
Background
The basic data structure I'm using contains two main fields:
metadata
, which is a dictionary consisting of key-value pairs describing what the numbers aredata
, which is a pandas data structure containing the numbers themselves.The lowest common denominator is timeseries data, so the basic structure has a pandas Series
object as the data
entry, and the metadata
field describes what those numbers actually are (e.g. vector RMS error for 10-meter wind over the Eastern Pacific for a 24-hour forecast from experiment Test1).
I'm looking at taking that lowest-common-denominator and gluing the various timeseries together to make the results more useful and allow for easy combinations. For instance, I may want to look at all the different lead times - I have a filter routine that will take my timeseries that share the same metadata entries except for lead time (e.g. experiment, region, etc.) and return a new object where the metadata
field consists of only the common entries (i.e. Lead Time
has been removed) and now the data
field is a pandas DataFrame
with the column labels given by the Lead Time
value. I can extend this again and say I want to take the resulting frames and group them together with only another entry varying (e.g. the Experiment
) to give me a pandas Panel
. for my entry where the item index is given by the Experiment
metadata values from the constituent frames and the object's new metadata does not contain either Lead Time
or Experiment
.
When I iterate over these composite objects, I have an iterseries
routine for the frame and iterframes
routine for the panel that reconstruct the appropriate metadata/data pairing as I drop one dimension (i.e. the series from the frame with lead time varying across the columns will have all the metadata of its parent plus the Lead Time
field restored with the value taken from the column label). This works great.
Problem
I've run out of dimensions (up to 3-D with a Panel) and I'm also not able to use things like dropna
to remove empty columns once everything is aligned in the Panel (this has led to several bugs when plotting summary statistics). Reading about using pandas with higher-dimensional data has led to reading about the MultiIndex
and its use. I've tried the examples given in the documentation, but I'm still a little unclear how to apply it to my situation. Any direction would be useful. I'd like to be able to:
Series
-based data into a multi-indexed DataFrame
along an arbitrary number of dimensions (this would be great - it would eliminate one call to create the frames from the series, and then another to create the panels from the frames)DataFrame
, dropping a single dimension so I can reset the component metadata.Edit - Add Code Sample
Wes McKinney's answer below is almost exactly what I need - the issue is in the initial translation from the Series-backed storage objects I have to work with to my DataFrame-backed objects once I start grouping elements together. The Data-Frame-backed class has the following method that takes in a list of the series-based objects and the metadata field that will vary across the columns.
@classmethod
def from_list(cls, results_list, column_key):
"""
Populate object from a list of results that all share the metadata except
for the field `column_key`.
"""
# Need two copies of the input results - one for building the object
# data and one for building the object metadata
for_data, for_metadata = itertools.tee(results_list)
self = cls()
self.column_key = column_key
self.metadata = next(for_metadata).metadata.copy()
if column_key in self.metadata:
del self.metadata[column_key]
self.data = pandas.DataFrame(dict(((transform(r[column_key]), r.data)
for r in for_data)))
return self
Once I have the frame given by this routine, I can easily apply the various operations suggested below - of particular utility is being able to use the names
field when I
call concat
- this eliminates the need to store the name of the column key internally
since it's stored in the MultiIndex as the name of that index dimension.
I'd like to be able to implement the solution below and just take in the list of matching Series-backed classes and a list of keys and do the grouping sequentially. However, I don't know what the columns will be representing ahead of time, so:
groupby() to Iterate over Data frame Groups. DataFrame. groupby() function in Python is used to split the data into groups based on some criteria.
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
The pandas Panel It is generally used for 3D time-series data.
I might suggest using pandas.concat
along with its keys
argument to glue together Series DataFrames to create a MultiIndex in the columns:
In [20]: data
Out[20]:
{'a': 2012-04-16 0
2012-04-17 1
2012-04-18 2
2012-04-19 3
2012-04-20 4
2012-04-21 5
2012-04-22 6
2012-04-23 7
2012-04-24 8
2012-04-25 9
Freq: D,
'b': 2012-04-16 0
2012-04-17 1
2012-04-18 2
2012-04-19 3
2012-04-20 4
2012-04-21 5
2012-04-22 6
2012-04-23 7
2012-04-24 8
2012-04-25 9
Freq: D,
'c': 2012-04-16 0
2012-04-17 1
2012-04-18 2
2012-04-19 3
2012-04-20 4
2012-04-21 5
2012-04-22 6
2012-04-23 7
2012-04-24 8
2012-04-25 9
Freq: D}
In [21]: df = pd.concat(data, axis=1, keys=['a', 'b', 'c'])
In [22]: df
Out[22]:
a b c
2012-04-16 0 0 0
2012-04-17 1 1 1
2012-04-18 2 2 2
2012-04-19 3 3 3
2012-04-20 4 4 4
2012-04-21 5 5 5
2012-04-22 6 6 6
2012-04-23 7 7 7
2012-04-24 8 8 8
2012-04-25 9 9 9
In [23]: df2 = pd.concat([df, df], axis=1, keys=['group1', 'group2'])
In [24]: df2
Out[24]:
group1 group2
a b c a b c
2012-04-16 0 0 0 0 0 0
2012-04-17 1 1 1 1 1 1
2012-04-18 2 2 2 2 2 2
2012-04-19 3 3 3 3 3 3
2012-04-20 4 4 4 4 4 4
2012-04-21 5 5 5 5 5 5
2012-04-22 6 6 6 6 6 6
2012-04-23 7 7 7 7 7 7
2012-04-24 8 8 8 8 8 8
2012-04-25 9 9 9 9 9 9
You have then:
In [25]: df2['group2']
Out[25]:
a b c
2012-04-16 0 0 0
2012-04-17 1 1 1
2012-04-18 2 2 2
2012-04-19 3 3 3
2012-04-20 4 4 4
2012-04-21 5 5 5
2012-04-22 6 6 6
2012-04-23 7 7 7
2012-04-24 8 8 8
2012-04-25 9 9 9
or even
In [27]: df2.xs('b', axis=1, level=1)
Out[27]:
group1 group2
2012-04-16 0 0
2012-04-17 1 1
2012-04-18 2 2
2012-04-19 3 3
2012-04-20 4 4
2012-04-21 5 5
2012-04-22 6 6
2012-04-23 7 7
2012-04-24 8 8
2012-04-25 9 9
You can have arbitrarily many levels:
In [29]: pd.concat([df2, df2], axis=1, keys=['tier1', 'tier2'])
Out[29]:
tier1 tier2
group1 group2 group1 group2
a b c a b c a b c a b c
2012-04-16 0 0 0 0 0 0 0 0 0 0 0 0
2012-04-17 1 1 1 1 1 1 1 1 1 1 1 1
2012-04-18 2 2 2 2 2 2 2 2 2 2 2 2
2012-04-19 3 3 3 3 3 3 3 3 3 3 3 3
2012-04-20 4 4 4 4 4 4 4 4 4 4 4 4
2012-04-21 5 5 5 5 5 5 5 5 5 5 5 5
2012-04-22 6 6 6 6 6 6 6 6 6 6 6 6
2012-04-23 7 7 7 7 7 7 7 7 7 7 7 7
2012-04-24 8 8 8 8 8 8 8 8 8 8 8 8
2012-04-25 9 9 9 9 9 9 9 9 9 9 9 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With