pandas: Replicate / Broadcast single indexed DataFrame on MultiIndex DataFrame: HowTo and Memory Efficiency

Tags:

Problem

ML data preparation for stock trading. I have 3-dim MultiIndex on a large DataFrame (maybe n=800000 x f=20). One index-dimension is date with about dt=1000 levels, the others identify m=800 different stocks (with 20 features each, individual for each stock). So for each date, there are 800 x 20 different values.

Now I have dt=1000 x g=30 global time series (like DJIA, currency exchange rates etc.), so 30 values for each date that are the same for every stock. This is a single indexed DataFrame with only the date as an index.

Question 1

How do I merge these two datasets so that the 30 series are broadcast onto every stock to end up with shape (800000 x 50)?

Question 2

Is there a way to achieve this not by replicating the data of the latter 30 columns, but with a view on the original data to save memory? With the numbers I mentioned, I will still be at ~ 300 MB for float64 accuracy, that's still ok. But I'm curious.

Example

Here is a minimal example of f=2, g=1, m=4 and dt=3 of what I've got:

import pandas as pd

data = {
    'x': [5,6,7,3,4,5,1,1,0,12,15,14],
    'y': [4,6,5,5,4,3,2,0,1,13,14,13]
}

dates = [pd.to_datetime('2018-01-01'), pd.to_datetime('2018-01-02'), pd.to_datetime('2018-01-03')]

index = pd.MultiIndex.from_arrays([
    ['alpha'] * 6 + ['beta'] * 6,
    ['A'] * 3 + ['B'] * 3 + ['C'] * 3 + ['D'] * 3,
    dates * 4,
])
df1 = pd.DataFrame(data, index=index)

df1.index.names = ['level', 'name', 'date']


df2 = pd.DataFrame([123,124,125], index=dates, columns=['z'])
df2.index.name = "date"

print (df1)
print (df2)
-------------------------------
                        x   y
level name date              
alpha A    2018-01-01   5   4
           2018-01-02   6   6
           2018-01-03   7   5
      B    2018-01-01   3   5
           2018-01-02   4   4
           2018-01-03   5   3
beta  C    2018-01-01   1   2
           2018-01-02   1   0
           2018-01-03   0   1
      D    2018-01-01  12  13
           2018-01-02  15  14
           2018-01-03  14  13

              z
date           
2018-01-01  123
2018-01-02  124
2018-01-03  125

And what I like to have:

                        x   y     z
level name date              
alpha A    2018-01-01   5   4   123
           2018-01-02   6   6   124
           2018-01-03   7   5   125
      B    2018-01-01   3   5   123
           2018-01-02   4   4   124
           2018-01-03   5   3   125
beta  C    2018-01-01   1   2   123
           2018-01-02   1   0   124
           2018-01-03   0   1   125
      D    2018-01-01  12  13   123
           2018-01-02  15  14   124
           2018-01-03  14  13   125

867

asked Feb 15 '18 12:02

ascripter

1 Answers

I think need join what align for same index name date in both DataFrames:

df = df1.join(df2)
print (df)
                        x   y    z
level name date                   
alpha A    2018-01-01   5   4  123
           2018-01-02   6   6  124
           2018-01-03   7   5  125
      B    2018-01-01   3   5  123
           2018-01-02   4   4  124
           2018-01-03   5   3  125
beta  C    2018-01-01   1   2  123
           2018-01-02   1   0  124
           2018-01-03   0   1  125
      D    2018-01-01  12  13  123
           2018-01-02  15  14  124
           2018-01-03  14  13  125

106

answered Oct 27 '22 21:10

jezrael

Related questions
                            
                                Python list keep value only if equal to n predecessors
                            
                                I want to select specific range of indexes from an array
                            
                                Run a python script from unity, to use its output (text file) in my game later
                            
                                Neural network: estimating sine wave frequency
                            
                                Literate way to index a list where each element has an interpretation?
                            
                                How to install win32com.client on Python 3.4 or Python 2.7
                            
                                Tweepy Connection broken: IncompleteRead - best way to handle exception? or, can threading help avoid?
                            
                                Python 3.6 Statistics module - NameError: name 'statistics' is not defined
                            
                                pandas cut multiple columns
                            
                                How to efficiently add multiple columns to pandas dataframe with values that depend on other columns
                            
                                What is the meaning of mu, loc and size in the scipy.stats.poisson?
                            
                                How to get shellscript filename without $0?
                            
                                Error enabling python-markdown extension for jupyter notebooks
                            
                                Extract class name in scrapy
                            
                                Dask delayed object of unspecified length not iterable error when combining dictionaries
                            
                                How can I call multiple views in one url address in Django?
                            
                                How to set a Tkinter widget to a monospaced, platform independent font?
                            
                                Python and C++ sharing the same memory resources
                            
                                Numpy: Fastest way to insert value into array such that array's in order
                            
                                Jupyter pandas.DataFrame output table format configuration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas: Replicate / Broadcast single indexed DataFrame on MultiIndex DataFrame: HowTo and Memory Efficiency

Tags:

python

pandas

sklearn-pandas

ascripter

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us