Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate what quantile a value is in a Polars column AKA Polars CDF

I would like to calculate the quantile each row in a Polars column is. Polars has a quantile function for calculating what value corresponds to the input quantile (the inverse CDF), but it does not seem to have any kind of empirical CDF function.

Does Polars possess this functionality currently?

like image 797
GBPU Avatar asked Oct 16 '25 19:10

GBPU


1 Answers

Original answer, scroll to the end for more concise solution

You can derive an ecdf by sorting by the value in question and then taking the cum_count/count

For example, let's compare that to plotly's ecdf

import polars as pl
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

df=pl.DataFrame({'a':np.random.normal(10,5,1000)})
df_ecdf = df.sort('a').with_columns(
         ecdf=((pl.first().cum_count()+1)/(pl.count()))
)
myecdf=px.line(df_ecdf,
        x='a', y='ecdf')
myecdf.update_traces(line_color='red')
pxecdf=px.ecdf(df,
        x='a')
fig=go.Figure()
fig.add_trace(list(myecdf.select_traces())[0])
fig.add_trace(list(pxecdf.select_traces())[0])
fig.show()

enter image description here

Plotly's ecdf seems to have more of a stairstep to it that I can't explain if we zoom in to an arbitrary part, it can be seen easier...

enter image description here

That said, it could be that the px.line is being smoothed excessively compared to px.ecdf.

If we extract the data from pxecdf then we can compare numerically.

compare=pl.DataFrame({'plotly_ecdf': pxecdf._data[0]['y'],
                      'plotly_x':pxecdf._data[0]['x']})

compare=df_ecdf.join(compare, left_on='a', right_on='plotly_x')
compare.select(diff=(pl.col('ecdf')-pl.col('plotly_ecdf')).abs().sum())
### returns 0

So the visible stairsteps in px.ecdf must be driven by some default smoothing in px.line that isn't applied to px.ecdf since they're numerically the same.

Response to comment and more concise way of doing it

Here's a way you can generate the ecdf for any number of columns in parallel via expressions only.

df=pl.DataFrame({'a':np.random.normal(10,5,1000),
                 'b':np.random.normal(10,5,1000),
                 'c':np.random.normal(10,5,1000)})
(
    df
    .with_columns(**{
        f"{x}_ecdf":pl.int_range(1,pl.count()+1).sort_by(pl.arg_sort_by(x))/pl.count()
        for x in df.columns # change df.columns to list of columns for subset only
        })
)
┌───────────┬───────────┬───────────┬────────┬────────┬────────┐
│ a         ┆ b         ┆ c         ┆ a_ecdf ┆ b_ecdf ┆ c_ecdf │
│ ---       ┆ ---       ┆ ---       ┆ ---    ┆ ---    ┆ ---    │
│ f64       ┆ f64       ┆ f64       ┆ f64    ┆ f64    ┆ f64    │
╞═══════════╪═══════════╪═══════════╪════════╪════════╪════════╡
│ 3.115462  ┆ 15.602951 ┆ 1.041053  ┆ 0.085  ┆ 0.873  ┆ 0.033  │
│ 4.481795  ┆ 1.868424  ┆ 9.563978  ┆ 0.121  ┆ 0.044  ┆ 0.477  │
│ 12.686753 ┆ 11.747184 ┆ 9.464207  ┆ 0.673  ┆ 0.644  ┆ 0.462  │
│ 11.416048 ┆ 13.163161 ┆ -0.304657 ┆ 0.598  ┆ 0.739  ┆ 0.02   │
│ 18.453647 ┆ 11.83464  ┆ 8.279558  ┆ 0.956  ┆ 0.649  ┆ 0.359  │
└───────────┴───────────┴───────────┴────────┴────────┴────────┘

In theory, you should actually do that in lazy mode (or pre-generate the int_range in another context) otherwise it'll generate the int_range series for every column rather than making it once and using it for each column. In practice, it probably doesn't matter since it's such a trivial operation.

like image 158
Dean MacGregor Avatar answered Oct 18 '25 23:10

Dean MacGregor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!