I'm having difficulty constructing a 3D DataFrame in Pandas. I want something like this
A B C start end start end start end ... 7 20 42 52 90 101 11 21 213 34 56 74 9 45 45 12
Where A
, B
, etc are the top-level descriptors and start
and end
are subdescriptors. The numbers that follow are in pairs and there aren't the same number of pairs for A
, B
etc. Observe that A
has four such pairs, B
has only 1, and C
has 3.
I'm not sure how to proceed in constructing this DataFrame. Modifying this example didn't give me the designed output:
import numpy as np import pandas as pd A = np.array(['one', 'one', 'two', 'two', 'three', 'three']) B = np.array(['start', 'end']*3) C = [np.random.randint(10, 99, 6)]*6 df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C']) df.set_index(['A', 'B'], inplace=True) df
yielded:
C A B one start [22, 19, 16, 20, 63, 54] end [22, 19, 16, 20, 63, 54] two start [22, 19, 16, 20, 63, 54] end [22, 19, 16, 20, 63, 54] three start [22, 19, 16, 20, 63, 54] end [22, 19, 16, 20, 63, 54]
Is there any way of breaking up the lists in C into their own columns?
EDIT: The structure of my C
is important. It looks like the following:
C = [[7,11,56,45], [20,21,74,12], [42], [52], [90,213,9], [101, 34, 45]]
And the desired output is the one at the top. It represents the starting and ending points of subsequences within a certain sequence (A
, B
. C
are the different sequences). Depending on the sequence itself, there are a differing number of subsequences that satisfy a given condition I'm looking for. As a result, there are a differing number of start:end pairs for A
, B
, etc
Introduction to Pandas 3D DataFrame. Pandas 3D dataframe representation has consistently been a difficult errand yet with the appearance of dataframe plot() work it is very simple to make fair-looking plots with your dataframe. 3D plotting in Matplotlib begins by empowering the utility toolbox.
The pandas Panel A Panel is a 3D array. It is not as widely used as Series or DataFrames. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. It is generally used for 3D time-series data.
DataFrame. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
Pandas DataFrame is a two-dimensional size-mutable, potentially composite tabular data structure with labeled axes (rows and columns). DataFrame can contain the following data type of data. The Pandas Series: a one-dimensional labeled array capable of holding any data type with axis labels or indexes.
First, I think you need to fill C to represent missing values
In [341]: max_len = max(len(sublist) for sublist in C) In [344]: for sublist in C: ...: sublist.extend([np.nan] * (max_len - len(sublist))) In [345]: C Out[345]: [[7, 11, 56, 45], [20, 21, 74, 12], [42, nan, nan, nan], [52, nan, nan, nan], [90, 213, 9, nan], [101, 34, 45, nan]]
Then, convert to a numpy array, transpose, and pass to the DataFrame constructor along with the columns.
In [288]: C = np.array(C) In [289]: df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B))) In [349]: df Out[349]: one two three start end start end start end 0 7 20 42 52 90 101 1 11 21 NaN NaN 213 34 2 56 74 NaN NaN 9 45 3 45 12 NaN NaN NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With