How to re order duplicates answers on polars dataframe

Question

I have a Polars dataframe that contains multiple questions and answers. The problem is that each answer is contained in its own column, which means that I have a lot of redundant information. Therefore, I would like to have only one column for the questions and another for the answers.

Here is an example of the data:

data = {
    "ID" : [1,1,1],
    "Question" : ["A","B","C"],
    "Answer A" : ["Answer A", "Answer A", "Answer A"],
    "Answer B" : ["Answer B", "Answer B", "Answer B"],
    "Answer C" : ["Answer C", "Answer C", "Answer C"]
}

df = pl.DataFrame(data)
df

My approach is to create other filter dataframes and then concact them, however i would like a fancier approach to this problem

My current approach:

A_df = (
    df
    .drop(["Answer B","Answer C"])
    .filter(pl.col("Question") == "A")
    .rename({"Answer A" : "Answer"})
)

B_df = (
    df
    .drop(["Answer A","Answer C"])
    .filter(pl.col("Question") == "B")
    .rename({"Answer B" : "Answer"})
)

C_df = (
    df
    .drop(["Answer A","Answer B"])
    .filter(pl.col("Question") == "C")
    .rename({"Answer C" : "Answer"})
)

df_final = pl.concat([A_df,B_df,C_df])

Hericks · Accepted Answer

TLDR.

import polars.selectors as cs

(
    df
    .unpivot(
        on=cs.starts_with("Answer"),
        index=["ID", "Question"],
        variable_name="Source",
        value_name="Answer",
    )
    .filter(
        pl.col("Question") == pl.col("Source").str.strip_prefix("Answer ")
    )
    .drop("Source")
)

shape: (3, 3)
┌─────┬──────────┬───────────────────┐
│ ID  ┆ Question ┆ Answer            │
│ --- ┆ ---      ┆ ---               │
│ i64 ┆ str      ┆ str               │
╞═════╪══════════╪═══════════════════╡
│ 1   ┆ A        ┆ Some answer       │
│ 1   ┆ B        ┆ Some other answer │
│ 1   ┆ C        ┆ Another answer    │
└─────┴──────────┴───────────────────┘

Explanation.

An approach that is a bit more general is to melt (pl.DataFrame.unpivot) the dataframe on the answer columns. This gives you a long format dataframe, which for each original row contains one row for each answer column.

import polars.selectors as cs

(
    df
    .unpivot(
        on=cs.starts_with("Answer"),
        index=["ID", "Question"],
        variable_name="Source",
        value_name="Answer",
    )
)

shape: (9, 4)
┌─────┬──────────┬──────────┬───────────────────┐
│ ID  ┆ Question ┆ Source   ┆ Answer            │
│ --- ┆ ---      ┆ ---      ┆ ---               │
│ i64 ┆ str      ┆ str      ┆ str               │
╞═════╪══════════╪══════════╪═══════════════════╡
│ 1   ┆ A        ┆ Answer A ┆ Some answer       │
│ 1   ┆ B        ┆ Answer A ┆ Some answer       │
│ 1   ┆ C        ┆ Answer A ┆ Some answer       │
│ 1   ┆ A        ┆ Answer B ┆ Some other answer │
│ 1   ┆ B        ┆ Answer B ┆ Some other answer │
│ 1   ┆ C        ┆ Answer B ┆ Some other answer │
│ 1   ┆ A        ┆ Answer C ┆ Another answer    │
│ 1   ┆ B        ┆ Answer C ┆ Another answer    │
│ 1   ┆ C        ┆ Answer C ┆ Another answer    │
└─────┴──────────┴──────────┴───────────────────┘

From here, it is easy to (after some transformation) filter for rows in which the Source column matches the question (see TLDR).

Dogbert · Answer

If there are a limited number of answer columns, this can be done with a simple pl.when().then() chain:

df2 = df.select(
    "ID",
    "Question",
    pl.when(pl.col("Question") == "A")
    .then("Answer A")
    .when(pl.col("Question") == "B")
    .then("Answer B")
    .when(pl.col("Question") == "C")
    .then("Answer C")
    .alias("Answer"),
)

This will set the value from the column Answer A as Answer if Question == "A" and so on.

How to re order duplicates answers on polars dataframe

Tags:

python

dataframe

python-polars

user24900119

2 Answers

Hericks

Dogbert

Recent Activity

Donate For Us

How to re order duplicates answers on polars dataframe

Tags:

python

dataframe

python-polars

user24900119

2 Answers

Hericks

Dogbert

Related questions

Recent Activity

Donate For Us