In Pandas, one can perform boolean operations on boolean DataFrames with the all
and any
methods, providing an axis
argument. For example:
import pandas as pd
data = dict(A=["a","b","?"], B=["d","?","f"])
pd_df = pd.DataFrame(data)
For example, to get a boolean mask on columns containing the element "?":
(pd_df == "?").any(axis=0)
and to get a mask on rows:
(pd_df == "?").any(axis=1)
Also, to get a single boolean:
(pd_df == "?").any().any()
In comparison, in polars
, the best I could come up with are the following:
import polars as pl
pl_df = pl.DataFrame(data)
To get a mask on columns:
(pl_df == "?").select(pl.all().any())
To get a mask on rows:
pl_df.select(
pl.concat_list(pl.all() == "?").alias("mask")
).select(
pl.col("mask").list.eval(pl.element().any()).list.first()
)
And to get a single boolean value:
pl_df.select(
pl.concat_list(pl.all() == "?").alias("mask")
).select(
pl.col("mask").list.eval(pl.element().any()).list.first()
)["mask"].any()
The last two cases seem particularly verbose and convoluted for such a straightforward task, so I'm wondering whether there are shorter/faster equivalents?
Polars added dedicated horizontal methods in version 0.18.7 for "row-wise" operations.
For these examples:
pl.all_horizontal()
pl.any_horizontal()
If we start with your sample frame:
df = pl.DataFrame(dict(A=["a","b","?"], B=["d","?","f"]))
boolean mask:
df.select(pl.all() == "?")
shape: (3, 2)
┌───────┬───────┐
│ A ┆ B │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ false ┆ false │
│ false ┆ true │
│ true ┆ false │
└───────┴───────┘
mask on columns:
df.select((pl.all() == "?").any())
shape: (1, 2)
┌──────┬──────┐
│ A ┆ B │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════╪══════╡
│ true ┆ true │
└──────┴──────┘
horizontal mask / mask on rows:
df.select(pl.any_horizontal(pl.all() == "?"))
shape: (3, 1)
┌───────┐
│ any │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
│ true │
└───────┘
.list
also received any/all methods in 0.18.5 meaning it could also be written as in your example:
df.select(pl.concat_list(pl.all() == "?").list.any())
single boolean for horizontal mask:
df.select(pl.any_horizontal(pl.all() == "?").any())
shape: (1, 1)
┌──────┐
│ any │
│ --- │
│ bool │
╞══════╡
│ true │
└──────┘
If you want to extract it as a single value into Python, you can use .item()
df.select(pl.any_horizontal(pl.all() == "?").any()).item()
# True
I think one thing that might be making this more confusing is that you're not doing everything in the select context. In other words, don't do this: (pl_df == "?")
The first thing we can do is
pl_df.select(pl.all()=="?")
shape: (3, 2)
┌───────┬───────┐
│ A ┆ B │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ false ┆ false │
│ false ┆ true │
│ true ┆ false │
└───────┴───────┘
When we call pl.all()
it means all of the columns. For each column we're converting its original value into a bool of whether or not it's equal to ?
Now let's do this:
pl_df.select((pl.all()=="?").any())
shape: (1, 2)
┌──────┬──────┐
│ A ┆ B │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════╪══════╡
│ true ┆ true │
└──────┴──────┘
This gives you the per column. All we did was add .any
which tells it that if anything in the parenthesis that preceded it is true then return True.
Now let's do
pl_df.select(pl.any_horizontal(pl.all()=="?"))
shape: (3, 1)
┌───────┐
│ any │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
│ true │
└───────┘
When we call pl.any_horizontal(...)
then it is going to do that rowwise for whatever ...
is.
Lastly, if we put them together...
pl_df.select(pl.any_horizontal(pl.all()=="?").any())
shape: (1, 1)
┌──────┐
│ any │
│ --- │
│ bool │
╞══════╡
│ true │
└──────┘
then we get the single value indicating that somewhere in the dataframe is an item that is equal to "?"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With