I have a Polars dataframe that contains multiple questions and answers. The problem is that each answer is contained in its own column, which means that I have a lot of redundant information. Therefore, I would like to have only one column for the questions and another for the answers.
Here is an example of the data:
data = {
"ID" : [1,1,1],
"Question" : ["A","B","C"],
"Answer A" : ["Answer A", "Answer A", "Answer A"],
"Answer B" : ["Answer B", "Answer B", "Answer B"],
"Answer C" : ["Answer C", "Answer C", "Answer C"]
}
df = pl.DataFrame(data)
df
My approach is to create other filter dataframes and then concact them, however i would like a fancier approach to this problem
My current approach:
A_df = (
df
.drop(["Answer B","Answer C"])
.filter(pl.col("Question") == "A")
.rename({"Answer A" : "Answer"})
)
B_df = (
df
.drop(["Answer A","Answer C"])
.filter(pl.col("Question") == "B")
.rename({"Answer B" : "Answer"})
)
C_df = (
df
.drop(["Answer A","Answer B"])
.filter(pl.col("Question") == "C")
.rename({"Answer C" : "Answer"})
)
df_final = pl.concat([A_df,B_df,C_df])
TLDR.
import polars.selectors as cs
(
df
.unpivot(
on=cs.starts_with("Answer"),
index=["ID", "Question"],
variable_name="Source",
value_name="Answer",
)
.filter(
pl.col("Question") == pl.col("Source").str.strip_prefix("Answer ")
)
.drop("Source")
)
shape: (3, 3)
┌─────┬──────────┬───────────────────┐
│ ID ┆ Question ┆ Answer │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪══════════╪═══════════════════╡
│ 1 ┆ A ┆ Some answer │
│ 1 ┆ B ┆ Some other answer │
│ 1 ┆ C ┆ Another answer │
└─────┴──────────┴───────────────────┘
Explanation.
An approach that is a bit more general is to melt (pl.DataFrame.unpivot) the dataframe on the answer columns. This gives you a long format dataframe, which for each original row contains one row for each answer column.
import polars.selectors as cs
(
df
.unpivot(
on=cs.starts_with("Answer"),
index=["ID", "Question"],
variable_name="Source",
value_name="Answer",
)
)
shape: (9, 4)
┌─────┬──────────┬──────────┬───────────────────┐
│ ID ┆ Question ┆ Source ┆ Answer │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═════╪══════════╪══════════╪═══════════════════╡
│ 1 ┆ A ┆ Answer A ┆ Some answer │
│ 1 ┆ B ┆ Answer A ┆ Some answer │
│ 1 ┆ C ┆ Answer A ┆ Some answer │
│ 1 ┆ A ┆ Answer B ┆ Some other answer │
│ 1 ┆ B ┆ Answer B ┆ Some other answer │
│ 1 ┆ C ┆ Answer B ┆ Some other answer │
│ 1 ┆ A ┆ Answer C ┆ Another answer │
│ 1 ┆ B ┆ Answer C ┆ Another answer │
│ 1 ┆ C ┆ Answer C ┆ Another answer │
└─────┴──────────┴──────────┴───────────────────┘
From here, it is easy to (after some transformation) filter for rows in which the Source column matches the question (see TLDR).
If there are a limited number of answer columns, this can be done with a simple pl.when().then() chain:
df2 = df.select(
"ID",
"Question",
pl.when(pl.col("Question") == "A")
.then("Answer A")
.when(pl.col("Question") == "B")
.then("Answer B")
.when(pl.col("Question") == "C")
.then("Answer C")
.alias("Answer"),
)
This will set the value from the column Answer A as Answer if Question == "A" and so on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With