I want to be able to create n-dimensional
dataframes. I've heard of a method for 3D dataframes using panels
in pandas
but, if possible, I would like to extend the dimensions past 3 dims by combining different datasets into a super dataframe
I tried this but I cannot figure out how to use these methods with my test dataset -> Constructing 3D Pandas DataFrame
Also, this did not help for my case -> Pandas Dataframe or Panel to 3d numpy array
I made a random test dataset with arbitrary axis data trying to mimic a real situation; there are 3 axis (i.e. patients, years, and samples). I tried adding a bunch of dataframes to a list and then making a dataframe with that but it didn't work :( I even tried a panel
as in the 2nd link above but I couldn't get it to work either.
Does anybody know how to create a N-dimensional pandas dataframe w/ labels?
The first way I tried:
#Reproducibility
np.random.seed(1618033)
#Set 3 axis labels/dims
axis_1 = np.arange(2000,2010) #Years
axis_2 = np.arange(0,20) #Samples
axis_3 = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
#Create empty list to store 2D dataframes (axis_2=rows, axis_3=columns) along axis_1
list_of_dataframes=[]
#Iterate through all of the year indices
for i in range(axis_1.size):
#Create dataframe of (samples, patients)
DF_slice = pd.DataFrame(A_3D[i,:,:],index=axis_2,columns=axis_3)
list_of_dataframes.append(DF_slice)
# print(DF_slice) #preview of the 2D dataframes "slice" of the 3D array
# patient_0 patient_1 patient_2
# 0 0.727753 0.154701 0.205916
# 1 0.796355 0.597207 0.897153
# 2 0.603955 0.469707 0.580368
# 3 0.365432 0.852758 0.293725
# 4 0.906906 0.355509 0.994513
# 5 0.576911 0.336848 0.265967
# ...
# 19 0.583495 0.400417 0.020099
# DF_3D = pd.DataFrame(list_of_dataframes,index=axis_2, columns=axis_1)
# Error
# Shape of passed values is (1, 10), indices imply (10, 20)
2nd way I tried:
DF = pd.DataFrame(axis_3,columns=axis_2)
#Error:
#Shape of passed values is (1, 3), indices imply (20, 3)
# p={}
# for i in axis_1:
# p[i]=DF
# panel= pd.Panel(p)
I could do something like this I guess, but I really like pandas
and would rather use one of their methods if one exists:
#Set data for query
query_year = 2007
query_sample = 15
query_patient = "patient_1"
#Index based on query
A_3D[
(axis_1 == query_year).argmax(),
(axis_2 == query_sample).argmax(),
(axis_3 == query_patient).argmax()
]
#0.1231212416981845
It would be awesome to access the data in this way:
DF_3D[query_year][query_sample][query_patient]
#Where DF_3D[query_year] would give a list of 2D arrays (row=sample, col=patient)
# DF_3D[query_year][query_sample] would give a 1D vector/list of patient data for a particular year, of a particular sample.
# and DF_3D[query_year][query_sample][query_patient] would be a particular sample of a particular patient of a particular year
Rather than using an n-dimensional Panel, you are probably better off using a two dimensional representation of data, but using MultiIndexes for the index, column or both.
For example:
np.random.seed(1618033)
#Set 3 axis labels/dims
years = np.arange(2000,2010) #Years
samples = np.arange(0,20) #Samples
patients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
# Create the MultiIndex from years, samples and patients.
midx = pd.MultiIndex.from_product([years, samples, patients])
# Create sample data for each patient, and add the MultiIndex.
patient_data = pd.DataFrame(np.random.randn(len(midx), 3), index = midx)
>>> patient_data.head()
0 1 2
2000 0 patient_0 -0.128005 0.371413 -0.078591
patient_1 -0.378728 -2.003226 -0.024424
patient_2 1.339083 0.408708 1.724094
1 patient_0 -0.997879 -0.251789 -0.976275
patient_1 0.131380 -0.901092 1.456144
Once you have data in this form, it is relatively easy to juggle it around. For example:
>>> patient_data.unstack(level=0).head() # Years.
0 ... 2
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 patient_0 -0.128005 0.051558 1.251120 0.666061 -1.048103 0.259231 1.535370 0.156281 -0.609149 0.360219 ... -0.078591 -2.305314 -2.253770 0.865997 0.458720 1.479144 -0.214834 -0.791904 0.800452 0.235016
patient_1 -0.378728 -0.117470 -0.306892 0.810256 2.702960 -0.748132 -1.449984 -0.195038 1.151445 0.301487 ... -0.024424 0.114843 0.143700 1.732072 0.602326 1.465946 -1.215020 0.648420 0.844932 -1.261558
patient_2 1.339083 -0.915771 0.246077 0.820608 -0.935617 -0.449514 -1.105256 -0.051772 -0.671971 0.213349 ... 1.724094 0.835418 0.000819 1.149556 -0.318513 -0.450519 -0.694412 -1.535343 1.035295 0.627757
1 patient_0 -0.997879 -0.242597 1.028464 2.093807 1.380361 0.691210 -2.420800 1.593001 0.925579 0.540447 ... -0.976275 1.928454 -0.626332 -0.049824 -0.912860 0.225834 0.277991 0.326982 -0.520260 0.788685
patient_1 0.131380 0.398155 -1.671873 -1.329554 -0.298208 -0.525148 0.897745 -0.125233 -0.450068 -0.688240 ... 1.456144 -0.503815 -1.329334 0.475751 -0.201466 0.604806 -0.640869 -1.381123 0.524899 0.041983
In order to select the data, please refere to the docs for MultiIndexing.
You should consider using xarray
instead. From their documentation:
Panel, pandas’ data structure for 3D arrays, was always a second class data structure compared to the Series and DataFrame. To allow pandas developers to focus more on its core functionality built around the DataFrame, pandas removed Panel in favor of directing users who use multi-dimensional arrays to xarray.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With