Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle meta data associated with a pandas dataframe?

Q1: What is the best practice for saving meta information to a dataframe? I know of the following coding practice

import pandas as pd
df = pd.DataFrame([])
df.currency = 'USD'
df.measure = 'Price'
df.frequency = 'daily'

But as stated in this post Adding meta-information/metadata to pandas DataFrame this is associated with the risk of losing the information by appling functions such as "groupby, pivot, join or loc" as they may return "a new DataFrame without the metadata attached".

Is this still valid or has there been an update to meta information processing in the meantime? Is it good coding practice to subclass pandas for this purpose?

Q2: What would be an alternative coding practice?

I do not think building a seperate object is very suitable. Also working with Multiindex does not convince me. Lets say I want to divide a dataframe with prices by a dataframe with earnings. Working with Multiindices would be very involved.

#define price DataFrame
p_index = pd.MultiIndex.from_tuples([['Apple', 'price', 'daily'],['MSFT', 'price', 'daily']])
price = pd.DataFrame([[90, 20], [85, 30], [70, 25]], columns=p_index)

# define earnings dataframe
e_index = pd.MultiIndex.from_tuples(
    [['Apple', 'earnings', 'daily'], ['MSFT', 'earnings', 'daily']])
earnings=pd.DataFrame([[5000, 2000], [5800, 2200], [5100, 3000]], 
                columns=e_index)

price.divide(earnings.values, level=1, axis=0)

In the example above I do not even ensure that the company indices really match. I would probably need to invoke a pd.DataFrame.reindex() or similar. This cannot be a good coding practice in my point of view.

Is there a straightforward solution to the problem of handling meta information in that context that I don't see?

Thank you in advance

like image 891
quantguy Avatar asked Sep 28 '16 15:09

quantguy


2 Answers

Although building a custom object is not your first choice, it might be your only feasible option, and has the significant advantage of being extremely flexible. Here's a really simple example:

df=pd.DataFrame({'stock': 'AAPL AAPL MSFT MSFT'.split(),
                 'price':[ 445.,455.,195.,205.]})

col_labels = { 'stock' : 'Ticker Symbol',
               'price' : 'Closing Price in USD' }

That's just a dictionary of column labels, but often the majority of metadata is related to specific columns. Here's the sample data, with labels:

df.rename(columns=col_labels)

#   Ticker Symbol  Closing Price in USD
# 0          AAPL                 445.0
# 1          AAPL                 455.0
# 2          MSFT                 195.0
# 3          MSFT                 205.0

The nice thing is that the labels "persist" in the sense that you can basically apply them to any data whose columns are a subset or superset of the original columns:

df.groupby('stock').mean().rename(columns=col_labels)

#        Closing Price in USD
# stock                      
# AAPL                  450.0
# MSFT                  200.0

You can get some limited persistence if you use the attrs attribute:

df.attrs = col_labels

But it's fairly limited. It will persist for dataframes derived via .copy(),loc[], or iloc[], but not for a groupby(). You can of course reattach to any derivative dataframe with, for example,

df2.attrs = df.attrs

But as noted in the documentation (or lack thereof), this is an experimental feature and subject to change. Seems slightly better than nothing, and maybe will be expanded in the future. I couldn't find much info at all regarding attrs, but it appears to be initialized as an empty dictionary, and can only be a dictionary (or similar) although of course lists could be nested below the top level.

like image 82
JohnE Avatar answered Sep 19 '22 16:09

JohnE


I think that MultiIndexes is the way to go, but this way:

daily_price_data = pd.DataFrame({'Apple': [90, 85, 30], 'MSFT':[20, 30, 25]})
daily_earnings_data = pd.DataFrame({'Apple': [5000, 58000, 5100], 'MSFT':[2000, 2200, 3000]})
data = pd.concat({'price':daily_price_data, 'earnings': daily_earnings_data}, axis=1)
data


    earnings        price
    Apple   MSFT    Apple   MSFT
0   5000    2000    90      20
1   58000   2200    85      30
2   5100    3000    30      25

Then, to divide:

data['price'] / data['earnings']

If you find that your workflow makes more sense to have companies listed on the first level of the index, then pandas.DataFrame.xs will be very helpful:

data2 = data.reorder_levels([1,0], axis=1).sort_index(axis=1)
data2.xs('price', axis=1, level=-1) / data2.xs('earnings', axis=1, level=-1)
like image 21
Gordon Bean Avatar answered Sep 18 '22 16:09

Gordon Bean