Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good design pattern to combine datasets that are related but stored in different dataframes?

Suppose we want to construct a stock portfolio. To decide which stocks to include in the portfolio and what weight to assign to these stocks, we use different metrics such as e.g., price, earnings-per-share (eps), dividend yield, etc... All these metrics are stored in individual pandas dataframes where rows specify a certain point in time and columns are associated with a specific stock (e.g., IBM, MSFT, ...):

import pandas as pd

price = pd.DataFrame([[-1.332298,  0.396217,  0.574269, -0.679972, -0.470584,  0.234379],
                      [-0.222567,  0.281202, -0.505856, -1.392477,  0.941539,  0.974867],
                      [-1.139867, -0.458111, -0.999498,  1.920840,  0.478174, -0.315904],
                      [-0.189720, -0.542432, -0.471642,  1.506206, -1.506439,  0.301714]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

eps = pd.DataFrame([[-1.91,  1.63,  0.51, -.32, -0.84,  0.37],
                      [-0.56,  0.02, 0.56, 1.77,  0.99,  0.97],
                      [-1.67, -0.41, -0.98,  1.20,  0.74, -0.04],
                      [-0.80, -0.43, -0.12,  1.06, 1.59,  0.34]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))


price

    IBM MSFT    APPL    ORCL    FB  TWTR
2000-01-01  -1.332298   0.396217    0.574269    -0.679972   -0.470584   0.234379
2000-01-02  -0.222567   0.281202    -0.505856   -1.392477   0.941539    0.974867
2000-01-03  -1.139867   -0.458111   -0.999498   1.920840    0.478174    -0.315904
2000-01-04  -0.189720   -0.542432   -0.471642   1.506206    -1.506439   0.301714


eps

    IBM MSFT    APPL    ORCL    FB  TWTR
2000-01-01  -1.91   1.63    0.51    -0.32   -0.84   0.37
2000-01-02  -0.56   0.02    0.56    1.77    0.99    0.97
2000-01-03  -1.67   -0.41   -0.98   1.20    0.74    -0.04
2000-01-04  -0.80   -0.43   -0.12   1.06    1.59    0.34

The different dataframes are obviously closely connected. However, they are all stored in separate variables. In a large application, it can become difficult to keep track of which variables belong together and form a coherent unit. What is a good design paradigm to arrange this kind of related datasets?

Using an object-oriented design pattern, I would construct something like a StockList() object that stores individual Stock() objects, which in turn store the information (time series) that correspond to a specific stock.

class Stock():
    def __init__(self, price_series, eps_series, div_yield_series):
        self.price = price_series
        self.eps = eps_series
        self.div_yield = div_yield_series

class  StockList():
    def __init__(self, stock_list):
        self.stock_list = stock_list
        
    def append(self, stock):
        self.stock_list.append(stock)


But is this a viable option when working with dataframes? I think taking the time series apart and merging them back together when queried, leads to a considerable loss in performance and a superfluous set of operations.

Alternatively, the StockList() could store the dataframes directly, without constructing single Stock() objects (serving more or less as a data structure). However, is this an appropriate compromise?

I generally wonder whether a separate object should be created at all or if these individual dataframes should just be left as separate variables. This most likely would increase performance, reduce memory usage, support parallel computing and foster a functional programming style.

But how can we then bundle data that belongs together?

like image 540
quantguy Avatar asked Jul 30 '20 13:07

quantguy


People also ask

What are various ways to combine two datasets?

In this article, I have listed the three best and most time-saving ways to combine multiple datasets using Python pandas methods. merge(): To combine the datasets on common column or index or both. concat(): To combine the datasets across rows or columns. join(): To combine the datasets on key column or index.

How do I combine two Dataframes?

When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame.

Which method is used to join another DataFrame in Python?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another.


1 Answers

This example has 3 measures, so I would create a pandas Series with a 3-level MultiIndex:

  • metric (eps, price)
  • date (2000-01-01, 2000-01-02, ...)
  • ticker ('AAPL', 'FB', ...)

First, create the eps and price data frames as per the original post:

import pandas as pd

price = pd.DataFrame([[-1.332298,  0.396217,  0.574269, -0.679972, -0.470584,  0.234379],
                      [-0.222567,  0.281202, -0.505856, -1.392477,  0.941539,  0.974867],
                      [-1.139867, -0.458111, -0.999498,  1.920840,  0.478174, -0.315904],
                      [-0.189720, -0.542432, -0.471642,  1.506206, -1.506439,  0.301714]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

eps = pd.DataFrame([[-1.91,  1.63,  0.51, -.32, -0.84,  0.37],
                      [-0.56,  0.02, 0.56, 1.77,  0.99,  0.97],
                      [-1.67, -0.41, -0.98,  1.20,  0.74, -0.04],
                      [-0.80, -0.43, -0.12,  1.06, 1.59,  0.34]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

Second, combine these to create the new stock table (with multi-index):

# re-shape `eps` data frame
eps.index.name = 'date'
eps.columns.name = 'ticker'
eps = (eps.assign(metric='eps')
       .set_index('metric', append=True)
       .stack()
       .swaplevel('metric', 'date')
       .sort_index()
      )

# re-shape `price` data frame
price.index.name = 'date'
price.columns.name = 'ticker'
price = (price.assign(metric='price')
         .set_index('metric', append=True)
         .stack()
         .swaplevel('metric', 'date')
         .sort_index())

# you could put, say, `volume` data frame here...

# concatenate
stock_data = pd.concat([eps, price]).rename('value')

# display
print(stock_data.head(8))

metric  date        ticker
eps     2000-01-01  APPL      0.51
                    FB       -0.84
                    IBM      -1.91
                    MSFT      1.63
                    ORCL     -0.32
                    TWTR      0.37
        2000-01-02  APPL      0.56
                    FB        0.99
Name: value, dtype: float64

The pandas MultiIndex is powerful, but non-intuitive for Data Frames. It's more straightforward for Series. Everything is specified with .loc[::]. Then we can use .unstack() to re-shape for further downstream processing (e.g., create Data Frame with dates on the rows, tickers on the columns, and create plots with Matplotlib)

# index level 0, scalar
t0 = stock_data.loc['eps']

# index level 1, range
t1 = stock_data.loc[:, '2000-01-02':'2000-01-03']

# index level 2, list
t2 = stock_data.loc[:, :, ['AAPL', 'MSFT', 'TWTR']]
like image 158
jsmart Avatar answered Sep 28 '22 12:09

jsmart