Pandas Design Considerations for MultiIndexed Dataframes

Tags:

python

pandas

The purpose of this question is to further explore MultiIndex dataframes and to ask questions of the best approach for various tasks.

Create the DataFrame

import pandas as pd

df = pd.DataFrame({'index_date' : ['12/07/2016','12/07/2016','12/07/2016','12/07/2016','12/07/2016'], 
               'portfolio' : ['A','B','C','D','E'], 
               'reporting_ccy' : ['GBP','GBP','GBP','GBP','GBP'],
               'portfolio_ccy' : ['JPY','USD','USD','EUR','EUR'],
               'amount' : [100,200,300,400,500],
               'injection' : [1,2,3,4,5],
               'to_usd' : [1.3167,1.3167,1.3167,1.3167,1.3167],
               'to_ccy' : [0.009564,1,1,1.1093,1.1093],
               'm5' : [2,4,6,8,10],
               'm6' : [1,3,5,7,9]});

Pivot the DataFrame

df_pivot = df.pivot_table(index='index_date',columns=['portfolio','portfolio_ccy','reporting_ccy']).swaplevel(0, 1, axis=1).sortlevel(axis=1)

Rename the columns

df_pivot.columns.names = ['portfolio','measures', 'portfolio_ccy', 'reporting_ccy']

This yields a pivoted representation of the data such that:

a portfolio may have 1 or many measures
shows the portfolio default currency
shows the portfolio reporting currency
a measure may have 1 or many reporting currencies.

I terms of 4. what is the best approach for implementation given that we have the xRates for the currencies?

Such that we create a dataframe such as that derived here:

Create DataFrame

df1 = pd.DataFrame({'index_date' : ['12/07/2016','12/07/2016','12/07/2016','12/07/2016','12/07/2016'], 
           'portfolio' : ['A','B','C','D','E'], 
           'reporting_ccy' : ['JPY','USD','USD','EUR','EUR'],
           'portfolio_ccy' : ['JPY','USD','USD','EUR','EUR'],
           'amount' : [13767.2522, 263.34, 395.01, 474.785901, 593.4823763],
           'injection' : [1,2,3,4,5],
           'to_usd' : [0.009564, 1, 1, 1.1093, 1.1093],
           'to_ccy' : [1.3167, 1.3167, 1.3167, 1.3167, 1.3167],
           'm5' : [2,4,6,8,10],
           'm6' : [1,3,5,7,9]});

Concatenate & Pivot the DataFrames

df_concat = pd.concat([df,df1])
df_pivot1 = df_concat.pivot_table(index='index_date',columns=['portfolio','portfolio_ccy','reporting_ccy']).swaplevel(0, 1, axis=1).sortlevel(axis=1)
df_pivot1.columns.names = ['portfolio','measures', 'portfolio_ccy', 'reporting_ccy']

This now shows 1 measure having many currencies.

df_pivot1.xs(('amount', 'A'), level=('measures','portfolio'), drop_level=False, axis=1)

Question

Is there a better way, such as adding data directly to a multiIndexed dataframe at level 3 df_pivot1.columns.get_level_values(3).unique()?

I would like to be able to iterate through each level and add new measures either derived from other measures using df.assign() or other methods.

The use case here is to add other currencies to the measures where applicable. The concatenation and re-pivot as above does not seem optimal.

651

asked Jul 13 '16 13:07

ctrl-alt-delete

1 Answers

You can append df1 row by row onto df_pivot instead of rebuilding the pivot with both frames concat together.

Appending to the end of the DataFrame would be less memory expensive than concatenating and rebuilding the pivots from scratch every time new data is received.

import pandas as pd

df = pd.DataFrame({'index_date' : ['12/07/2016','12/07/2016','12/07/2016','12/07/2016','12/07/2016'], 
               'portfolio' : ['A','B','C','D','E'], 
               'reporting_ccy' : ['GBP','GBP','GBP','GBP','GBP'],
               'portfolio_ccy' : ['JPY','USD','USD','EUR','EUR'],
               'amount' : [100,200,300,400,500],
               'injection' : [1,2,3,4,5],
               'to_usd' : [1.3167,1.3167,1.3167,1.3167,1.3167],
               'to_ccy' : [0.009564,1,1,1.1093,1.1093],
               'm5' : [2,4,6,8,10],
               'm6' : [1,3,5,7,9]}); 

# %%
df_pivot = df.pivot_table(index='index_date',columns=['portfolio','portfolio_ccy','reporting_ccy']).swaplevel(0, 1, axis=1).sortlevel(axis=1)
df1 = pd.DataFrame({'index_date' : ['12/07/2016','12/07/2016','12/07/2016','12/07/2016','12/07/2016'], 
           'portfolio' : ['A','B','C','D','E'], 
           'reporting_ccy' : ['JPY','USD','USD','EUR','EUR'],
           'portfolio_ccy' : ['JPY','USD','USD','EUR','EUR'],
           'amount' : [13767.2522, 263.34, 395.01, 474.785901, 593.4823763],
           'injection' : [1,2,3,4,5],
           'to_usd' : [0.009564, 1, 1, 1.1093, 1.1093],
           'to_ccy' : [1.3167, 1.3167, 1.3167, 1.3167, 1.3167],
           'm5' : [2,4,6,8,10],
           'm6' : [1,3,5,7,9]}); 

df_pivot.columns.names = ['portfolio','measures', 'portfolio_ccy', 'reporting_ccy']
# instead of joining the 2 df's add df1 to df_pivot 1 row at a time. 
for i in range(len(df1)):
    row = df1.iloc[i]
    for measure in 'amount injection m5 m6 to_ccy to_usd'.split():
        df_pivot.ix[row.index_date, (row.portfolio,measure,row.portfolio_ccy, row.reporting_ccy)] = row[measure]

#%% check the end result
print(df_pivot.xs(('amount', 'A'), 
      level=('measures','portfolio'), drop_level=False, axis=1))

199

answered Nov 01 '22 21:11

PabTorre

Related questions
                            
                                Multiprocessing Queue maxsize limit is 32767
                            
                                Why does "import module" and then "from package import module" load the module again?
                            
                                Python multiprocessing utilizes only one core
                            
                                How do I make an ubuntu/debian package for a twistd/twisted plugin?
                            
                                Install pyopencv inside virtualenv with --no-site-packages
                            
                                Syntax-highlight Python in a vim fold line
                            
                                How to define a Python metaclass with Boost.Python?
                            
                                Why is Celery not shutting down cleanly?
                            
                                Get the string that is the midpoint between two other strings
                            
                                Running Heroku Cedar Locally using Vagrant
                            
                                requests response.iter_content() gets incomplete file ( 1024MB instead of 1.5GB )?
                            
                                Fabric: How can I unit test my fabfile?
                            
                                color matplotlib map using bicubic interpolation
                            
                                Python Pandas Merge Causing Memory Overflow
                            
                                Pytest KeyError when attempting to access a command line variable
                            
                                Is there any official way to get the admin options of a model?
                            
                                Scipy sparse invert or spsolve lead to UMFPACK_ERROR_OUT_OF_MEMORY
                            
                                Get ordered list of attributes of a Python module
                            
                                Difference in sequence of query generated in Django and Postgres for select_for_update
                            
                                Sending OpenCV output to VLC stream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With