Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - How to group and unstack on multiple variables?

I currently have some dataset that is structured as follows:

data = {'participant': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
        'step_name': ['first', 'first', 'second', 'third', 'second', 'first', 'first', 'first', 'second', 'third'],
        'title': ['acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'acceptable', 'acceptable'],
        'colour': ['blue', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'blue', 'blue', 'green'],
        'class': ['A', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B']}
df = pd.DataFrame(data, columns=['participant', 'step_name', 'title', 'colour', 'class'])

which looks like:

+----+---------------+-------------+----------------+----------+---------+
|    |   participant | step_name   | title          | colour   | class   |
|----+---------------+-------------+----------------+----------+---------|
|  0 |           100 | first       | acceptable     | blue     | A       |
|  1 |           101 | first       | acceptable     | blue     | B       |
|  2 |           102 | second      | not acceptable | blue     | B       |
|  3 |           103 | third       | acceptable     | green    | A       |
|  4 |           104 | second      | not acceptable | green    | B       |
|  5 |           105 | first       | acceptable     | blue     | A       |
|  6 |           106 | first       | not acceptable | green    | A       |
|  7 |           107 | first       | acceptable     | blue     | A       |
|  8 |           108 | second      | acceptable     | blue     | A       |
|  9 |           109 | third       | acceptable     | green    | B       |
+----+---------------+-------------+----------------+----------+---------+

Now I want to aggregate the dataset so that each row counts each of the repeat variables, which I've currently managed to do along two variables (step_name and title) as follows:

count_df = df[['participant', 'step_name', 'title']].groupby(['step_name', 'title']).count()
count_df = count_df.unstack()
count_df.fillna(0, inplace=True)
count_df.columns = count_df.columns.get_level_values(1)
count_df

+--------+--------------+------------------+
|        |   acceptable |   not acceptable |
|--------+--------------+------------------|
| first  |            4 |                1 |
| second |            1 |                2 |
| third  |            2 |                0 |
+--------+--------------+------------------+

Now though, I'd like to have an extra set of columns that includes the values for the other variables(colour and class) -- basically, I want to group and then unstack on those variables, but am not sure how to do it with more than 2 variables. Ultimately, I'd like for my final table to look like this:

+------+------+--------+--------------+------------------+
|class |colour| step   |   acceptable |   not acceptable |
|----------------------+--------------+------------------|
| A    | blue | first  |            3 |                0 |
| B    | blue | first  |            1 |                0 |
| A    |green | first  |            0 |                1 |
| B    |green | first  |            0 |                0 |
| A    | blue | second |            1 |                0 |
| B    | blue | second |            0 |                1 |
| A    |green | second |            0 |                0 |
| B    |green | second |            0 |                1 |
| A    |blue  | third  |            0 |                0 |
| B    |blue  | third  |            0 |                0 |
| A    |green | third  |            1 |                0 |
| B    |green | third  |            1 |                0 |
+------+------+--------+--------------+------------------+

How do I reshape my data so that it looks like my final example? Do I still use the unstack and group functions?

like image 461
orange1 Avatar asked May 09 '16 17:05

orange1


1 Answers

I think you need pivot_table with aggfunc=len, reset_index and rename_axis (new in pandas 0.18.0):

df = df.pivot_table(index=['class','colour','step_name'], 
                    columns='title', 
                    aggfunc=len, 
                    values='participant', 
                    fill_value=0).reset_index().rename_axis(None, axis=1)
print df
      class colour step_name  acceptable  not acceptable
0         A   blue     first           3               0
1         A   blue    second           1               0
2         A  green     first           0               1
3         A  green     third           1               0
4         B   blue     first           1               0
5         B   blue    second           0               1
6         B  green    second           0               1
7         B  green     third           1               0
like image 53
jezrael Avatar answered Sep 30 '22 10:09

jezrael