Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently get both a column and a scalar using Polars expressions?

Polars suggests the usage of Expressions to avoid eager execution and then execute all expressions together at the very end. I am unsure how this is possible if I want a column and a scalar. For example let's say I start with a single column 'test' and want to calculate its mean and produce a centered column. It is trivial to express this using expressions:

>>> import polars as pl
>>> import numpy as np
>>> df = pl.DataFrame({"test": np.array([0.,1,2,3,4])})
>>> df
shape: (5, 1)
┌──────┐
│ test │
│ ---  │
│ f64  │
╞══════╡
│ 0.0  │
│ 1.0  │
│ 2.0  │
│ 3.0  │
│ 4.0  │
└──────┘

>>> mean = pl.col('test').mean().alias('mean')
>>> df.select(mean)
shape: (1, 1)
┌──────┐
│ mean │
│ ---  │
│ f64  │
╞══════╡
│ 2.0  │
└──────┘
>>> centered = pl.col('test') - mean
>>> df.select(centered)
shape: (5, 1)
┌──────┐
│ test │
│ ---  │
│ f64  │
╞══════╡
│ -2.0 │
│ -1.0 │
│ 0.0  │
│ 1.0  │
│ 2.0  │
└──────┘

Of course you could select them both, but then the mean gets broadcasted over all rows which does not seem storage efficient. Is there a good way to obtain both the column and the scalar?

In this case the best thing to do may be to calculate the mean eagerly and then proceed with the centering. But of course this might not work as well for more general cases.

like image 844
Felix B. Avatar asked Nov 03 '25 22:11

Felix B.


2 Answers

Polars has a concept called a ScalarColumn which holds scalars. Just because it broadcasts on your screen doesn't mean it is always going to copy itself for every row. Unfortunately, it's not a guarantee so sometimes you will get that copy.

That said, if you want what you see to match the memory layout then what you want to do is .implode() your non-scalar results.

df.select(
    pl.col("test").mean().alias("mean"),
    (pl.col("test")-pl.col("test").mean()).implode().alias("centered")
    )

shape: (1, 2)
┌──────┬─────────────────────┐
│ mean ┆ centered            │
│ ---  ┆ ---                 │
│ f64  ┆ list[f64]           │
╞══════╪═════════════════════╡
│ 2.0  ┆ [-2.0, -1.0, … 2.0] │
└──────┴─────────────────────┘
like image 99
Dean MacGregor Avatar answered Nov 05 '25 11:11

Dean MacGregor


The "polars" way to do the substract , would be the follownig:

import polars as pl
import numpy as np

df = pl.DataFrame({"test": np.array([0.0, 1, 2, 3, 4])})


output = df.with_columns(
    (pl.col("test") - pl.col("test").mean()).alias("test_less_mean")
)
print(output)

shape: (5, 2)
┌──────┬────────────────┐
│ test ┆ test_less_mean │
│ ---  ┆ ---            │
│ f64  ┆ f64            │
╞══════╪════════════════╡
│ 0.0  ┆ -2.0           │
│ 1.0  ┆ -1.0           │
│ 2.0  ┆ 0.0            │
│ 3.0  ┆ 1.0            │
│ 4.0  ┆ 2.0            │
└──────┴────────────────┘

However, if you want to store the mean as a value to do operation with, i would suggest:

import polars as pl
import numpy as np

df = pl.DataFrame({"test": np.array([0.0, 1, 2, 3, 4])})
mean = (
    df.select(pl.col("test").mean()).item()
)

output =(
    df.with_columns(
        (pl.col("test") - mean).alias("test_less_mean")
    )
)
print(output)
shape: (5, 2)
┌──────┬────────────────┐
│ test ┆ test_less_mean │
│ ---  ┆ ---            │
│ f64  ┆ f64            │
╞══════╪════════════════╡
│ 0.0  ┆ -2.0           │
│ 1.0  ┆ -1.0           │
│ 2.0  ┆ 0.0            │
│ 3.0  ┆ 1.0            │
│ 4.0  ┆ 2.0            │
└──────┴────────────────┘
like image 29
Simon Avatar answered Nov 05 '25 12:11

Simon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!