Question
I have a dataframe untidy
attribute value
0 age 49
1 sex M
2 height 176
3 age 27
4 sex F
5 height 172
where the values in the 'attribute'
column repeat periodically. The desired output is tidy
age sex height
0 49 M 176
1 27 F 172
(The row and column order or additional labels don't matter, I can clean this up myself.)
Code for instantiation:
untidy = pd.DataFrame([['age', 49],['sex', 'M'],['height', 176],['age', 27],['sex', 'F'],['height', 172]], columns=['attribute', 'value'])
tidy = pd.DataFrame([[49, 'M', 176], [27, 'F', 172]], columns=['age', 'sex', 'height'])
Attempts
This looks like a simple pivot-operation, but my initial approach introduces NaN
values:
>>> untidy.pivot(columns='attribute', values='value')
attribute age height sex
0 49 NaN NaN
1 NaN NaN M
2 NaN 176 NaN
3 27 NaN NaN
4 NaN NaN F
5 NaN 172 NaN
Some messy attempts to fix this:
>>> untidy.pivot(columns='attribute', values='value').apply(lambda c: c.dropna().reset_index(drop=True))
attribute age height sex
0 49 176 M
1 27 172 F
>>> untidy.set_index([untidy.index//untidy['attribute'].nunique(), 'attribute']).unstack('attribute')
value
attribute age height sex
0 49 176 M
1 27 172 F
What's the idiomatic way to do this?
Use pandas.pivot
with GroupBy.cumcount
for new index values and rename_axis
for remove columns name:
df = pd.pivot(index=untidy.groupby('attribute').cumcount(),
columns=untidy['attribute'],
values=untidy['value']).rename_axis(None, axis=1)
print (df)
age height sex
0 49 176 M
1 27 172 F
Another solution:
df = (untidy.set_index([untidy.groupby('attribute').cumcount(), 'attribute'])['value']
.unstack()
.rename_axis(None, axis=1))
An alternative approach would be to introduce a new column first with the cumulative count of age:
untidy["index"] = (untidy["attribute"] == "age").cumsum() - 1
Now untidy looks like
attribute value index
0 age 49 0
1 sex M 0
2 height 176 0
3 age 27 1
4 sex F 1
5 height 172 1
In this way you can create a multiindex dataframe based on attribute and index like this
tidy = untidy.set_index(["index", "attribute"]).unstack()
Which leads to the following format
value
attribute age height sex
index
0 49 176 M
1 27 172 F
The only problem still left is that the columns is a multi-index now with a level too much. You can get rid of it but transposing the columns as index first, drop the level of the index and transposing it back
tidy = tidy.T.reset_index(level=0).drop("level_0", axis=1).T
The final result is your tidy data frame
attribute age height sex
index
0 49 176 M
1 27 172 F
You can combine the second and third step to one of course. I am not sure if this is more idiomatic, but for me it is at least more intuitive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With