Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Replicate / Broadcast single indexed DataFrame on MultiIndex DataFrame: HowTo and Memory Efficiency

Problem

ML data preparation for stock trading. I have 3-dim MultiIndex on a large DataFrame (maybe n=800000 x f=20). One index-dimension is date with about dt=1000 levels, the others identify m=800 different stocks (with 20 features each, individual for each stock). So for each date, there are 800 x 20 different values.

Now I have dt=1000 x g=30 global time series (like DJIA, currency exchange rates etc.), so 30 values for each date that are the same for every stock. This is a single indexed DataFrame with only the date as an index.

Question 1

How do I merge these two datasets so that the 30 series are broadcast onto every stock to end up with shape (800000 x 50)?

Question 2

Is there a way to achieve this not by replicating the data of the latter 30 columns, but with a view on the original data to save memory? With the numbers I mentioned, I will still be at ~ 300 MB for float64 accuracy, that's still ok. But I'm curious.

Example

Here is a minimal example of f=2, g=1, m=4 and dt=3 of what I've got:

import pandas as pd

data = {
    'x': [5,6,7,3,4,5,1,1,0,12,15,14],
    'y': [4,6,5,5,4,3,2,0,1,13,14,13]
}

dates = [pd.to_datetime('2018-01-01'), pd.to_datetime('2018-01-02'), pd.to_datetime('2018-01-03')]

index = pd.MultiIndex.from_arrays([
    ['alpha'] * 6 + ['beta'] * 6,
    ['A'] * 3 + ['B'] * 3 + ['C'] * 3 + ['D'] * 3,
    dates * 4,
])
df1 = pd.DataFrame(data, index=index)

df1.index.names = ['level', 'name', 'date']


df2 = pd.DataFrame([123,124,125], index=dates, columns=['z'])
df2.index.name = "date"

print (df1)
print (df2)
-------------------------------
                        x   y
level name date              
alpha A    2018-01-01   5   4
           2018-01-02   6   6
           2018-01-03   7   5
      B    2018-01-01   3   5
           2018-01-02   4   4
           2018-01-03   5   3
beta  C    2018-01-01   1   2
           2018-01-02   1   0
           2018-01-03   0   1
      D    2018-01-01  12  13
           2018-01-02  15  14
           2018-01-03  14  13

              z
date           
2018-01-01  123
2018-01-02  124
2018-01-03  125

And what I like to have:

                        x   y     z
level name date              
alpha A    2018-01-01   5   4   123
           2018-01-02   6   6   124
           2018-01-03   7   5   125
      B    2018-01-01   3   5   123
           2018-01-02   4   4   124
           2018-01-03   5   3   125
beta  C    2018-01-01   1   2   123
           2018-01-02   1   0   124
           2018-01-03   0   1   125
      D    2018-01-01  12  13   123
           2018-01-02  15  14   124
           2018-01-03  14  13   125
like image 867
ascripter Avatar asked Feb 15 '18 12:02

ascripter


People also ask

How to create a multi-index index in a Dataframe in pandas?

So, we create an index with multi-indexing by using the pandas set_index (), passing the name of the column names as the list. Now, the dataframe has Hierarchical Indexing or multi-indexing. To revert the index of the dataframe from multi-index to a single index using the Pandas inbuilt function reset_index ().

How to select data from a Dataframe in pandas?

When it comes to select data on a DataFrame, Pandas loc is one of the top favorites. In a previous article, we have introduced the loc and iloc for selecting data in a general (single-index) DataFrame. Accessing data in a MultiIndex DataFrame can be done in a similar way to a single index DataFrame. We can also use : to return all data.

What are multi-level columns in pandas Dataframe?

Multi-level columns are used when you wanted to group columns together. 1. Create MultiIndex pandas DataFrame (Multi level Index) A multi-level index DataFrame is a type of DataFrame that contains multiple level or hierarchical indexing. You can create a MultiIndex (multi-level index) in the following ways.

Why do we use indexslice in pandas?

Using IndexSlice You can use Pandas IndexSlice to facilitate a more natural syntax. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working with MultiIndex DataFrame.


1 Answers

I think need join what align for same index name date in both DataFrames:

df = df1.join(df2)
print (df)
                        x   y    z
level name date                   
alpha A    2018-01-01   5   4  123
           2018-01-02   6   6  124
           2018-01-03   7   5  125
      B    2018-01-01   3   5  123
           2018-01-02   4   4  124
           2018-01-03   5   3  125
beta  C    2018-01-01   1   2  123
           2018-01-02   1   0  124
           2018-01-03   0   1  125
      D    2018-01-01  12  13  123
           2018-01-02  15  14  124
           2018-01-03  14  13  125
like image 106
jezrael Avatar answered Oct 27 '22 21:10

jezrael