Consider this Pandas dataframe:
df = pd.DataFrame({
'User ID': [1, 2, 2, 3],
'Cupcakes': [1, 5, 4, 2],
'Biscuits': [2, 5, 3, 3],
'Score': [0.65, 0.12, 0.15, 0.9]
})
ie.
User ID Cupcakes Biscuits Score
0 1 1 2 0.65
1 2 5 5 0.12
2 2 4 3 0.15
3 3 2 3 0.90
I want to tidy ("melt") this data so that the dessert type are separate observations. But I also want to keep the score for each user.
Using melt()
directly doesn't work:
df.melt(
id_vars=['User ID'],
value_vars=['Cupcakes', 'Biscuits'],
var_name='Dessert', value_name='Enjoyment'
)
...gives:
User ID Dessert Enjoyment
0 1 Cupcakes 1
1 2 Cupcakes 5
2 2 Cupcakes 4
3 3 Cupcakes 2
4 1 Biscuits 2
5 2 Biscuits 5
6 2 Biscuits 3
7 3 Biscuits 3
I've lost the score data!
I can't use wide_to_long()
because I don't have a common "stub name" for my dessert types.
I can't join or merge the tidied data with the original data because the tidied data is reindexed and the user ID is not unique for each observation.
How do I tidy this data but retain columns that aren't involved in the tidying?
Pandas melt() function is used to change the DataFrame format from wide to long. It's used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns - variable and value.
Pandas.melt() unpivots a DataFrame from wide format to long format. melt() function is useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are unpivoted to the row axis, leaving just two non-identifier columns, variable and value.
Pandas.melt () melt () is used to convert a wide dataframe into a longer form. This function can be used when there are requirements to consider a specific column as an identifier. Syntax: pandas.melt (frame, id_vars=None, value_vars=None, var_name=None, value_name=’value’, col_level=None)
We can also see that they are only duplicate across two of the columns and that one of the records is more recent. We can modify the behavior of the method to keep the most recent record by first sorting the data based on the last modified date. Then, we can ask Pandas to drop based on a subset of relevant columns. Let’s see what this looks like:
Using melt () function to print all the unpivot column values. In the above program, we first import the pandas library as pd, and then we define the dataframe. Once the dataframe is defined, we use the melt () function to unpivot all the column values and print them in the output.
Add column Score
to id_vars
in DataFrame.melt
:
id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
df1 = df.melt(
id_vars=['User ID', 'Score'],
value_vars=['Cupcakes', 'Biscuits'],
var_name='Dessert', value_name='Enjoyment'
)
print (df1)
User ID Score Dessert Enjoyment
0 1 0.65 Cupcakes 1
1 2 0.12 Cupcakes 5
2 2 0.15 Cupcakes 4
3 3 0.90 Cupcakes 2
4 1 0.65 Biscuits 2
5 2 0.12 Biscuits 5
6 2 0.15 Biscuits 3
7 3 0.90 Biscuits 3
If need melting all columns without User ID
and Score
omit value_vars
:
df.melt(
id_vars=['User ID', 'Score'],
var_name='Dessert', value_name='Enjoyment'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With