Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivot a two-column dataframe

Tags:

python

pandas

Question

I have a dataframe untidy

  attribute value
0       age    49
1       sex     M
2    height   176
3       age    27
4       sex     F
5    height   172

where the values in the 'attribute' column repeat periodically. The desired output is tidy

  age sex height
0  49   M    176
1  27   F    172

(The row and column order or additional labels don't matter, I can clean this up myself.)

Code for instantiation:

untidy = pd.DataFrame([['age', 49],['sex', 'M'],['height', 176],['age', 27],['sex', 'F'],['height', 172]], columns=['attribute', 'value'])
tidy = pd.DataFrame([[49, 'M', 176], [27, 'F', 172]], columns=['age', 'sex', 'height']) 

Attempts

This looks like a simple pivot-operation, but my initial approach introduces NaN values:

>>> untidy.pivot(columns='attribute', values='value')                                                                                                       
attribute  age height  sex
0           49    NaN  NaN
1          NaN    NaN    M
2          NaN    176  NaN
3           27    NaN  NaN
4          NaN    NaN    F
5          NaN    172  NaN

Some messy attempts to fix this:

>>> untidy.pivot(columns='attribute', values='value').apply(lambda c: c.dropna().reset_index(drop=True))
attribute age height sex
0          49    176   M
1          27    172   F


>>> untidy.set_index([untidy.index//untidy['attribute'].nunique(), 'attribute']).unstack('attribute')
          value           
attribute   age height sex
0            49    176   M
1            27    172   F

What's the idiomatic way to do this?

like image 659
actual_panda Avatar asked Jan 23 '19 08:01

actual_panda


2 Answers

Use pandas.pivot with GroupBy.cumcount for new index values and rename_axis for remove columns name:

df = pd.pivot(index=untidy.groupby('attribute').cumcount(),
              columns=untidy['attribute'], 
              values=untidy['value']).rename_axis(None, axis=1) 
print (df)
  age height sex
0  49    176   M
1  27    172   F

Another solution:

df = (untidy.set_index([untidy.groupby('attribute').cumcount(), 'attribute'])['value']
            .unstack()
            .rename_axis(None, axis=1))
like image 152
jezrael Avatar answered Oct 09 '22 03:10

jezrael


An alternative approach would be to introduce a new column first with the cumulative count of age:

untidy["index"] = (untidy["attribute"] == "age").cumsum() - 1

Now untidy looks like

      attribute value  index
0       age    49      0
1       sex     M      0
2    height   176      0
3       age    27      1
4       sex     F      1
5    height   172      1

In this way you can create a multiindex dataframe based on attribute and index like this

tidy = untidy.set_index(["index", "attribute"]).unstack()

Which leads to the following format

              value           
attribute   age height sex
index                     
0            49    176   M
1            27    172   F

The only problem still left is that the columns is a multi-index now with a level too much. You can get rid of it but transposing the columns as index first, drop the level of the index and transposing it back

tidy = tidy.T.reset_index(level=0).drop("level_0", axis=1).T

The final result is your tidy data frame

    attribute age height sex
index                   
0          49    176   M
1          27    172   F

You can combine the second and third step to one of course. I am not sure if this is more idiomatic, but for me it is at least more intuitive.

like image 36
Eelco van Vliet Avatar answered Oct 09 '22 04:10

Eelco van Vliet