How can I efficiently get both a column and a scalar using Polars expressions?

Question

Polars suggests the usage of Expressions to avoid eager execution and then execute all expressions together at the very end. I am unsure how this is possible if I want a column and a scalar. For example let's say I start with a single column 'test' and want to calculate its mean and produce a centered column. It is trivial to express this using expressions:

>>> import polars as pl
>>> import numpy as np
>>> df = pl.DataFrame({"test": np.array([0.,1,2,3,4])})
>>> df
shape: (5, 1)
┌──────┐
│ test │
│ ---  │
│ f64  │
╞══════╡
│ 0.0  │
│ 1.0  │
│ 2.0  │
│ 3.0  │
│ 4.0  │
└──────┘

>>> mean = pl.col('test').mean().alias('mean')
>>> df.select(mean)
shape: (1, 1)
┌──────┐
│ mean │
│ ---  │
│ f64  │
╞══════╡
│ 2.0  │
└──────┘
>>> centered = pl.col('test') - mean
>>> df.select(centered)
shape: (5, 1)
┌──────┐
│ test │
│ ---  │
│ f64  │
╞══════╡
│ -2.0 │
│ -1.0 │
│ 0.0  │
│ 1.0  │
│ 2.0  │
└──────┘

Of course you could select them both, but then the mean gets broadcasted over all rows which does not seem storage efficient. Is there a good way to obtain both the column and the scalar?

In this case the best thing to do may be to calculate the mean eagerly and then proceed with the centering. But of course this might not work as well for more general cases.

Dean MacGregor · Accepted Answer

Polars has a concept called a ScalarColumn which holds scalars. Just because it broadcasts on your screen doesn't mean it is always going to copy itself for every row. Unfortunately, it's not a guarantee so sometimes you will get that copy.

That said, if you want what you see to match the memory layout then what you want to do is .implode() your non-scalar results.

df.select(
    pl.col("test").mean().alias("mean"),
    (pl.col("test")-pl.col("test").mean()).implode().alias("centered")
    )

shape: (1, 2)
┌──────┬─────────────────────┐
│ mean ┆ centered            │
│ ---  ┆ ---                 │
│ f64  ┆ list[f64]           │
╞══════╪═════════════════════╡
│ 2.0  ┆ [-2.0, -1.0, … 2.0] │
└──────┴─────────────────────┘

Simon · Answer

The "polars" way to do the substract , would be the follownig:

import polars as pl
import numpy as np

df = pl.DataFrame({"test": np.array([0.0, 1, 2, 3, 4])})


output = df.with_columns(
    (pl.col("test") - pl.col("test").mean()).alias("test_less_mean")
)
print(output)

shape: (5, 2)
┌──────┬────────────────┐
│ test ┆ test_less_mean │
│ ---  ┆ ---            │
│ f64  ┆ f64            │
╞══════╪════════════════╡
│ 0.0  ┆ -2.0           │
│ 1.0  ┆ -1.0           │
│ 2.0  ┆ 0.0            │
│ 3.0  ┆ 1.0            │
│ 4.0  ┆ 2.0            │
└──────┴────────────────┘

However, if you want to store the mean as a value to do operation with, i would suggest:

import polars as pl
import numpy as np

df = pl.DataFrame({"test": np.array([0.0, 1, 2, 3, 4])})
mean = (
    df.select(pl.col("test").mean()).item()
)

output =(
    df.with_columns(
        (pl.col("test") - mean).alias("test_less_mean")
    )
)
print(output)
shape: (5, 2)
┌──────┬────────────────┐
│ test ┆ test_less_mean │
│ ---  ┆ ---            │
│ f64  ┆ f64            │
╞══════╪════════════════╡
│ 0.0  ┆ -2.0           │
│ 1.0  ┆ -1.0           │
│ 2.0  ┆ 0.0            │
│ 3.0  ┆ 1.0            │
│ 4.0  ┆ 2.0            │
└──────┴────────────────┘

How can I efficiently get both a column and a scalar using Polars expressions?

Tags:

python

dataframe

python-polars

Felix B.

2 Answers

Dean MacGregor

Simon

Recent Activity

Donate For Us

How can I efficiently get both a column and a scalar using Polars expressions?

Tags:

python

dataframe

python-polars

Felix B.

2 Answers

Dean MacGregor

Simon

Related questions

Recent Activity

Donate For Us