Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to re order duplicates answers on polars dataframe

I have a Polars dataframe that contains multiple questions and answers. The problem is that each answer is contained in its own column, which means that I have a lot of redundant information. Therefore, I would like to have only one column for the questions and another for the answers.

Here is an example of the data:

data = {
    "ID" : [1,1,1],
    "Question" : ["A","B","C"],
    "Answer A" : ["Answer A", "Answer A", "Answer A"],
    "Answer B" : ["Answer B", "Answer B", "Answer B"],
    "Answer C" : ["Answer C", "Answer C", "Answer C"]
}

df = pl.DataFrame(data)
df

My approach is to create other filter dataframes and then concact them, however i would like a fancier approach to this problem

My current approach:

A_df = (
    df
    .drop(["Answer B","Answer C"])
    .filter(pl.col("Question") == "A")
    .rename({"Answer A" : "Answer"})
)

B_df = (
    df
    .drop(["Answer A","Answer C"])
    .filter(pl.col("Question") == "B")
    .rename({"Answer B" : "Answer"})
)

C_df = (
    df
    .drop(["Answer A","Answer B"])
    .filter(pl.col("Question") == "C")
    .rename({"Answer C" : "Answer"})
)

df_final = pl.concat([A_df,B_df,C_df])
like image 278
user24900119 Avatar asked Dec 07 '25 08:12

user24900119


2 Answers

TLDR.

import polars.selectors as cs

(
    df
    .unpivot(
        on=cs.starts_with("Answer"),
        index=["ID", "Question"],
        variable_name="Source",
        value_name="Answer",
    )
    .filter(
        pl.col("Question") == pl.col("Source").str.strip_prefix("Answer ")
    )
    .drop("Source")
)
shape: (3, 3)
┌─────┬──────────┬───────────────────┐
│ ID  ┆ Question ┆ Answer            │
│ --- ┆ ---      ┆ ---               │
│ i64 ┆ str      ┆ str               │
╞═════╪══════════╪═══════════════════╡
│ 1   ┆ A        ┆ Some answer       │
│ 1   ┆ B        ┆ Some other answer │
│ 1   ┆ C        ┆ Another answer    │
└─────┴──────────┴───────────────────┘

Explanation.

An approach that is a bit more general is to melt (pl.DataFrame.unpivot) the dataframe on the answer columns. This gives you a long format dataframe, which for each original row contains one row for each answer column.

import polars.selectors as cs

(
    df
    .unpivot(
        on=cs.starts_with("Answer"),
        index=["ID", "Question"],
        variable_name="Source",
        value_name="Answer",
    )
)
shape: (9, 4)
┌─────┬──────────┬──────────┬───────────────────┐
│ ID  ┆ Question ┆ Source   ┆ Answer            │
│ --- ┆ ---      ┆ ---      ┆ ---               │
│ i64 ┆ str      ┆ str      ┆ str               │
╞═════╪══════════╪══════════╪═══════════════════╡
│ 1   ┆ A        ┆ Answer A ┆ Some answer       │
│ 1   ┆ B        ┆ Answer A ┆ Some answer       │
│ 1   ┆ C        ┆ Answer A ┆ Some answer       │
│ 1   ┆ A        ┆ Answer B ┆ Some other answer │
│ 1   ┆ B        ┆ Answer B ┆ Some other answer │
│ 1   ┆ C        ┆ Answer B ┆ Some other answer │
│ 1   ┆ A        ┆ Answer C ┆ Another answer    │
│ 1   ┆ B        ┆ Answer C ┆ Another answer    │
│ 1   ┆ C        ┆ Answer C ┆ Another answer    │
└─────┴──────────┴──────────┴───────────────────┘

From here, it is easy to (after some transformation) filter for rows in which the Source column matches the question (see TLDR).

like image 134
Hericks Avatar answered Dec 12 '25 11:12

Hericks


If there are a limited number of answer columns, this can be done with a simple pl.when().then() chain:

df2 = df.select(
    "ID",
    "Question",
    pl.when(pl.col("Question") == "A")
    .then("Answer A")
    .when(pl.col("Question") == "B")
    .then("Answer B")
    .when(pl.col("Question") == "C")
    .then("Answer C")
    .alias("Answer"),
)

This will set the value from the column Answer A as Answer if Question == "A" and so on.

like image 29
Dogbert Avatar answered Dec 12 '25 11:12

Dogbert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!