Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas beginner: multi-dimensional data-analysis workflow (groupby+agg+plot)

I'm new into pandas and try to learn how to process my multi-dimensional data.

My data

Let's assume, my data is a big CSV of the columns ['A', 'B', 'C', 'D', 'E', 'F', 'G']. This data describes some simulation results, where ['A', 'B', ..., 'F'] are simulation parameters and 'G' is one of the ouputs (only existing output in this example!).

EDIT / UPDATE: As Boud suggested in the comments, let's generate some data which is compatible to mine:

import pandas as pd
import itertools
import numpy as np

npData = np.zeros(5000, dtype=[('A','i4'),('B','f4'),('C','i4'), ('D', 'i4'), ('E', 'f4'), ('F', 'i4'), ('G', 'f4')])

A = [0,1,2,3,6] # param A: int
B = [1000.0, 10.000] # param B: float
C = [100,150,200,250,300] # param C: int
D = [10,15,20,25,30] # param D: int
E = [0.1, 0.3] # param E: float
F = [0,1,2,3,4,5,6,7,8,9] # param F = random-seed = int -> 10 runs per scenario

# some beta-distribution parameters for randomizing the results in column "G"
aDistParams = [ (6,1),
                (5,2),
                (4,3),
                (3,4),
                (2,5),
                (1,6),
                (1,7) ]

counter = 0
for i in itertools.product(A,B,C,D,E,F):
    npData[counter]['A'] = i[0]
    npData[counter]['B'] = i[1]
    npData[counter]['C'] = i[2]
    npData[counter]['D'] = i[3]
    npData[counter]['E'] = i[4]
    npData[counter]['F'] = i[5]

    np.random.seed = i[5]
    npData[counter]['G'] = np.random.beta(a=aDistParams[i[0]][0], b=aDistParams[i[0]][1])
    counter += 1

data = pd.DataFrame(npData)
data = data.reindex(np.random.permutation(data.index)) # shuffle rows because my original data doesn't give any guarantees

Because the parameters ['A', 'B', ..., 'F'] are generated as a cartesian-product (meaning: nested for-loops; a priori), i want to use groupby for obtaining each possible 'simulation scenario' before analysing the output.

The parameter 'F' describe multiple runs for each scenario (each scenario defined by 'A', 'B', ..., 'E' ; let's assume, that 'F' is the random-seed), so my code becomes:

grouped = data.groupby(['A','B','C','D','E'])
# -> every group defines one simulation scenario

grouped_agg = grouped.agg(({'G' : np.mean}))
# -> the mean of the simulation output in 'G' over 'F' is calculated for each group/scenario

What do i want to do now?

  • I: display all the (unique) values of each scenario-parameter within these groups -> grouped_agg gives me an iterable of tuples, where for example all the entries at each position 0 give me all the values for 'A' (so with a few lines of python i would get my unique values, but maybe there is a function for that)

    • Update: my approach
    • list(set(grouped_agg.index.get_level_values('A'))) (when interested in 'A'; using set for obtaining unique values; probably not the stuff you want to do, if you need high performance)
    • => [0, 1, 2, 3, 6]
  • II: generate some plots (of lower dimension) -> i need to make some variables constant and filter/select my data before plotting (therefore step I needed) =>

    • 'B' const
    • 'C', const
    • 'E' const
    • 'D' = x-axis
    • 'G' = y-axis / output from my aggregation
    • 'A' = one more dimension = multiple colors within 2d-plot -> one G/y-axis for each value of 'A'

    How would i generate a plot like that?

    I think, that reshaping my data is the key step and pandas plotting capabilities will handle it then. Maybe achieving a shape, where there are 5 columns (one for each value of parameter A) and the corresponding G-values for each index-selection + param-A-selection is enough, but i wasn't able to achieve that form yet.

Thanks for your input!

(i'm using pandas 0.12 within enthought canopy)

Sascha

like image 238
sascha Avatar asked Nov 08 '13 15:11

sascha


1 Answers

I: If I understand your example and desired output, I don't see why grouping is necessary.

data.A.unique()

II: Updated....

I will implement the example you sketch above. Assume that we have averaged 'G' over the random seed ('F') like so:

data = data.groupby(['A','B','C','D','E']).agg(({'G' : np.mean})).reset_index()

Start by selecting the rows where B, C, and E have some constant values that you specify.

df1 = data[(data['B'] == const1) & (data['C'] == const2) & (data['E'] == const3)]

Now we want to plot 'G' as a function of 'D', with a different color for every value of 'A'.

df1.set_index('D').groupby('A')['G'].plot(legend=True)

I tested the above on some dummy data, and it works as you describe. The range of 'G' corresponding to each 'A' are plotting in the distinct color on the same axes.

III: I don't know how to answer that broad question.

IV: No, I don't think that's an issue for you here.

I suggest playing with simpler, small data sets and getting more familiar with pandas.

like image 193
Dan Allan Avatar answered Sep 27 '22 17:09

Dan Allan