Problem
ML data preparation for stock trading. I have 3-dim MultiIndex on a large DataFrame (maybe n=800000 x f=20
). One index-dimension is date
with about dt=1000
levels, the others identify m=800
different stocks (with 20 features each, individual for each stock). So for each date, there are 800 x 20 different values.
Now I have dt=1000 x g=30
global time series (like DJIA, currency exchange rates etc.), so 30 values for each date that are the same for every stock. This is a single indexed DataFrame with only the date as an index.
Question 1
How do I merge these two datasets so that the 30 series are broadcast onto every stock to end up with shape (800000 x 50)
?
Question 2
Is there a way to achieve this not by replicating the data of the latter 30 columns, but with a view on the original data to save memory? With the numbers I mentioned, I will still be at ~ 300 MB for float64 accuracy, that's still ok. But I'm curious.
Example
Here is a minimal example of f=2
, g=1
, m=4
and dt=3
of what I've got:
import pandas as pd
data = {
'x': [5,6,7,3,4,5,1,1,0,12,15,14],
'y': [4,6,5,5,4,3,2,0,1,13,14,13]
}
dates = [pd.to_datetime('2018-01-01'), pd.to_datetime('2018-01-02'), pd.to_datetime('2018-01-03')]
index = pd.MultiIndex.from_arrays([
['alpha'] * 6 + ['beta'] * 6,
['A'] * 3 + ['B'] * 3 + ['C'] * 3 + ['D'] * 3,
dates * 4,
])
df1 = pd.DataFrame(data, index=index)
df1.index.names = ['level', 'name', 'date']
df2 = pd.DataFrame([123,124,125], index=dates, columns=['z'])
df2.index.name = "date"
print (df1)
print (df2)
-------------------------------
x y
level name date
alpha A 2018-01-01 5 4
2018-01-02 6 6
2018-01-03 7 5
B 2018-01-01 3 5
2018-01-02 4 4
2018-01-03 5 3
beta C 2018-01-01 1 2
2018-01-02 1 0
2018-01-03 0 1
D 2018-01-01 12 13
2018-01-02 15 14
2018-01-03 14 13
z
date
2018-01-01 123
2018-01-02 124
2018-01-03 125
And what I like to have:
x y z
level name date
alpha A 2018-01-01 5 4 123
2018-01-02 6 6 124
2018-01-03 7 5 125
B 2018-01-01 3 5 123
2018-01-02 4 4 124
2018-01-03 5 3 125
beta C 2018-01-01 1 2 123
2018-01-02 1 0 124
2018-01-03 0 1 125
D 2018-01-01 12 13 123
2018-01-02 15 14 124
2018-01-03 14 13 125
So, we create an index with multi-indexing by using the pandas set_index (), passing the name of the column names as the list. Now, the dataframe has Hierarchical Indexing or multi-indexing. To revert the index of the dataframe from multi-index to a single index using the Pandas inbuilt function reset_index ().
When it comes to select data on a DataFrame, Pandas loc is one of the top favorites. In a previous article, we have introduced the loc and iloc for selecting data in a general (single-index) DataFrame. Accessing data in a MultiIndex DataFrame can be done in a similar way to a single index DataFrame. We can also use : to return all data.
Multi-level columns are used when you wanted to group columns together. 1. Create MultiIndex pandas DataFrame (Multi level Index) A multi-level index DataFrame is a type of DataFrame that contains multiple level or hierarchical indexing. You can create a MultiIndex (multi-level index) in the following ways.
Using IndexSlice You can use Pandas IndexSlice to facilitate a more natural syntax. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working with MultiIndex DataFrame.
I think need join
what align for same index name date
in both DataFrame
s:
df = df1.join(df2)
print (df)
x y z
level name date
alpha A 2018-01-01 5 4 123
2018-01-02 6 6 124
2018-01-03 7 5 125
B 2018-01-01 3 5 123
2018-01-02 4 4 124
2018-01-03 5 3 125
beta C 2018-01-01 1 2 123
2018-01-02 1 0 124
2018-01-03 0 1 125
D 2018-01-01 12 13 123
2018-01-02 15 14 124
2018-01-03 14 13 125
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With