I am involving in a data-mining project and have some problems whiling doing feature engineering. One of my goal is to aggregate data according to the primary key, and to produce new columns. So I write this:
df = df.group_by("case_id").agg(date_exprs(df,df_base))
def date_expr(df, df_base):
# Join df and df_base on 'case_id' column
df = df.join(df_base[['case_id','date_decision']], on="case_id", how="left")
for col in df.columns:
if col[-1] in ("D",):
df = df.with_columns(pl.col(col) - pl.col("date_decision"))
df = df.with_columns(pl.col(col).dt.total_days())
cols = [col for col in df.columns if col[-1] in ("D",)]
# Generate expressions for max, min, mean, mode, and std of date differences
expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
expr_mode = [pl.mode(col).alias(f"mode_{col}") for col in cols]
expr_std = [pl.std(col).alias(f"std_{col}") for col in cols]
return expr_max + expr_min + expr_mean + expr_mode + expr_std
However, there goes an error: AttributeError: module 'polars' has no attribute 'mode'.
I looked up document of polars on github and found there was no Dataframe.mode() but Series.mode(), which I thought might be the reason of error? I referred to chatGPT, which could not help because these codes with error were just from it.
Besides, here is only an example of dealing with float type. What about string type? Can I also apply your method?
I am looking forward to your kind help!!
In your example it fails because there's no syntactic sugar for Expr.mode()
as it is for aggregate functions (for example, pl.max()
is a syntactic sugar for Expr.max()
. The mode()
is actually not aggregation function but computation one, which means it just calculates the most occuring value(s) within the column.
So, given the DataFrame like this:
df = (
pl.DataFrame({
'aD' : [200, 200, 300, 400, 1, 3],
'bD': [2, 3, 6, 4, 5, 1],
'case_id': [1,1,1,2,2,2]
})
)
┌─────┬─────┬─────────┐
│ aD ┆ bD ┆ case_id │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════════╡
│ 200 ┆ 2 ┆ 1 │
│ 200 ┆ 3 ┆ 1 │
│ 300 ┆ 6 ┆ 1 │
│ 400 ┆ 4 ┆ 2 │
│ 1 ┆ 5 ┆ 2 │
│ 3 ┆ 1 ┆ 2 │
└─────┴─────┴─────────┘
you can calculate mode()
with the following code:
df.with_columns(
pl.col('aD').mode(),
pl.col('bD').mode()
)
┌─────┬─────┬─────────┐
│ aD ┆ bD ┆ case_id │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════════╡
│ 200 ┆ 1 ┆ 1 │
│ 200 ┆ 5 ┆ 1 │
│ 200 ┆ 6 ┆ 1 │
│ 200 ┆ 4 ┆ 2 │
│ 200 ┆ 2 ┆ 2 │
│ 200 ┆ 3 ┆ 2 │
└─────┴─────┴─────────┘
given that, we still can calculate the results you need. I'll simplify you function a bit by using selectors
and Expr.prefix()
:
import polars.selectors as cs
def date_expr():
# Generate expressions for max, min, mean, mode, and std of date differences
expr_max = cs.ends_with('D').max().name.prefix("max_")
expr_min = cs.ends_with('D').min().name.prefix("min_")
expr_mean = cs.ends_with('D').mean().name.prefix("mean_")
expr_mode = cs.ends_with('D').mode().first().name.prefix("mode_")
expr_std = cs.ends_with('D').std().name.prefix("std_")
return expr_max, expr_min, expr_mean, expr_std, expr_mode
df.group_by("case_id").agg(date_expr())
┌─────────┬────────┬────────┬────────┬───┬────────────┬──────────┬─────────┬─────────┐
│ case_id ┆ max_aD ┆ max_bD ┆ min_aD ┆ … ┆ std_aD ┆ std_bD ┆ mode_aD ┆ mode_bD │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ f64 ┆ f64 ┆ i64 ┆ i64 │
╞═════════╪════════╪════════╪════════╪═══╪════════════╪══════════╪═════════╪═════════╡
│ 2 ┆ 400 ┆ 5 ┆ 1 ┆ … ┆ 229.787583 ┆ 2.081666 ┆ 3 ┆ 4 │
│ 1 ┆ 300 ┆ 6 ┆ 200 ┆ … ┆ 57.735027 ┆ 2.081666 ┆ 200 ┆ 2 │
└─────────┴────────┴────────┴────────┴───┴────────────┴──────────┴─────────┴─────────┘
note that I've used Expr.first()
to the one of the values for mode
- as there might be different ones with the same frequency. You can use list
expressions to specify which one you'd like to get.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With