Aggregating multiple columns in polars with missing values

Question

I am trying to perform some aggregations, while I loved polars, there are certain things, which I am unable to perform. Here are my approach and question for reference.

import polars as pl
import polars.selectors as cs

import numpy as np

data = pl.DataFrame({'x': ['a', 'b', 'a', 'b', 'a', 'a', 'a', 'b', 'a'],
                     'y': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                     'z': [4, np.nan, np.nan, 8,1, 1, 3, 4, 0],
                     'm' : [np.nan, 8, 1, np.nan, 3, 4, 8, 7, 1]})

I have a dataframe like above. Here are my questions and corresponding attempt

How to calculate multiple summaries on multiple columns (I get duplicate column error, how do I fix this?)

Attempt:

data.group_by('x').agg(pl.all().mean(),
                       pl.all().sum())

why median is coming as valid value but mean isn't? possible answer: is it because median is calculated by sorting and selecting middle value and since in this case central value is not null hence it is valid (not sure if this the reason)

print(data.select(pl.col('m').median())) ## line 1
print(data.select(pl.col('m').mean())) ## line 2

If I replace np.nan with None the mean calculation works fine on "line 2" in the above code, why?
why does this doesn't work? I get a compute error, which says : expanding more than one col is not allowed, what does it really mean? Bascially I wanted to filter any rows which has missing in either columns


data.filter(pl.col(['z']).is_nan() | pl.col(['m']).is_nan())

How do I replace NaN in multiple columns in one go, I wrote this code and it works too, but its clunky, is there any better way?

mean_impute = np.nanmean(data.select(pl.col(['z', 'm'])).to_numpy(), axis=0)


def replace_na(data, colname, i):
    return data.with_columns(pl.when(pl.col(colname).is_nan()
                                        ).then(mean_impute[i]).otherwise(pl.col(colname)).alias(colname)).select(colname).to_numpy().flatten()

data.with_columns(z = replace_na(data, 'z', 0),
                  m = replace_na(data, 'm', 1))

Thanks for reading the question and answering. I don't want to put a duplicate entry in SO. I understand the rules, so please let me know if these are duplicates in any sense. I would gladly delete them. But I couldn't able to solve some of these or written a solution which might not be great. Thanks again !!!

Edit:

running python version: '3.10.9'

running polars version: '0.20.31'

Henry Harbeck · Accepted Answer

1:

Aggregations will take the input column's name, so you need to alias if you are doing multiple aggregations on one or more columns.

This can be achieved with the .name.suffix, .name.prefix and .name.map methods here to rename multi-column expressions (such as pl.all()) at once. .alias is another option if you are just renaming a single column.

# Assuming `data` from the question is already defined

# Solution - multiple aggregations on a single column
data.group_by('x').agg(
    pl.col("y").sum().alias("y_sum"),
    pl.col("y").mean().alias("y_mean"),
)

# Solution - multiple aggregations on multiple columns
data.group_by('x').agg(
    pl.all().sum().name.suffix("_sum"),
    pl.all().mean().name.suffix("_mean"),
)

Expressions are composable, so you can use alias, suffix, prefix, etc. according to your needs.

2 & 3:

There is an important distinction to be made between NaN and null in Polars. null is meant for missing data while NaN is a type of floating point data. In your data DataFrame, you can see columns z and m are converted to floating point due to the presence of NaNs.

If you are looking to represent missing data, convert NaNs to null, or define the data with None instead of np.nan. More info here.

NaNs are propagated to the mean. I suspect your possible answer relating to median is correct. As you can see, this one common. Numpy has a dedicated function to ignore NaNs and pandas probably defaults to skipping NaNs as there is no null in pandas.

np.mean(np.array([0, 1, 2, np.nan])) # NaN
np.nanmean(np.array([0, 1, 2, np.nan])) # 1.0
pd.DataFrame([0, 1, 2, np.nan]).mean() # 1.0
pd.DataFrame([0, 1, 2, np.nan]).mean(skipna=False) # NaN
pl.DataFrame([0, 1, 2, np.nan]).mean() # NaN
pl.DataFrame([0, 1, 2, None]).mean() # 1.0

As you observed, Polars does not propagate null to the mean, and this is also the preferred way to represent missing data.

4:

It looks to be the square brackets inside col that cause the

ComputeError: expanding more than one `col` is not allowed

From my intuition based on the error, the square brackets indicate to Polars to expand the expression to more than one column, which is not valid when doing a logical or with another column. Removing the square brackets fixes it.

data.filter(pl.col('z').is_nan() | pl.col('m').is_nan())

5:

The code you posted looks to be replacing the NaNs with the mean.

If you're just wanting to replace NaN with null this will do the trick

data.with_columns(pl.col("z", "m").fill_nan(None))

And if you want to replace NaNs with the mean, still prefer replacing NaNs with null, and then you won't have to go numpy to calculate the mean.

data.with_columns(
    pl.when(pl.col("z", "m").is_nan())
    # Fill the NaNs with null, then calculate the mean
    .then(pl.col("z", "m").fill_nan(None).mean())
    # Otherwise keep the original value
    .otherwise(pl.col("z", "m"))
)

Aggregating multiple columns in polars with missing values

Tags:

python

dataframe

python-polars

Edit:

PKumar

1 Answers

1:

2 & 3:

4:

5:

Henry Harbeck

Recent Activity

Donate For Us

Aggregating multiple columns in polars with missing values

Tags:

python

dataframe

python-polars

Edit:

PKumar

1 Answers

1:

2 & 3:

4:

5:

Henry Harbeck

Related questions

Recent Activity

Donate For Us