4. Yes we SCAN¶
The operations we've seen so far have all been elementwise, e.g.:
- for each row, we calculated the absolute value
- for each row, we summed the respective values in two columns
Let's do something (completely) different - instead of working with each row in isolation, we'll calculate a quantity which depends on the rows which precede it.
We're going to implement cum_sum
.
Python side¶
Add this to minimal_plugin/__init__.py
:
def cum_sum(expr: IntoExprColumn) -> pl.Expr:
return register_plugin_function(
args=[expr],
plugin_path=LIB,
function_name="cum_sum",
is_elementwise=False,
)
is_elementwise=False
.
You'll see why this is so important at the end of this page.
Rust¶
Time to learn a new Rust function: scan
.
If you're not familiar with it, please take a little break from this tutorial
and read the scan docs.
Welcome back! Let's use our newfound scan-superpowers to implement cum_sum
. Here's what goes into src/expressions.rs
:
#[polars_expr(output_type_func=same_output_type)]
fn cum_sum(inputs: &[Series]) -> PolarsResult<Series> {
let s = &inputs[0];
let ca: &Int64Chunked = s.i64()?;
let out: Int64Chunked = ca
.iter()
.scan(0_i64, |state: &mut i64, x: Option<i64>| {
match x {
Some(x) => {
*state += x;
Some(Some(*state))
},
None => Some(None),
}
})
.collect_trusted();
Ok(out.into_series())
}
The cum_sum
definition may look complex, but it's not too bad once we
break it down:
- we hold the running sum in
state
- we iterate over rows, initialising
state
to be0
- if the current row is
Some
, then add the current row's value tostate
and emit the current value ofstate
- if the current row is
None
, then don't modifystate
and emitNone
Note how we use collect_trusted
at the end, rather than collect
.
collect
would work as well, but if we know the length of the output
(and we do in this case, cum_sum
doesn't change the column's length)
then we can safely use collect_trusted
and save some precious time.
Let's compile with maturin develop
(or maturin develop --release
if you're benchmarking), change the last line of run.py
to
python run.py
:
shape: (3, 3)
┌─────┬──────┬───────────┐
│ a ┆ b ┆ a_cum_sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪═══════════╡
│ 1 ┆ 3 ┆ 1 │
│ 5 ┆ null ┆ 6 │
│ 2 ┆ -1 ┆ 8 │
└─────┴──────┴───────────┘
Elementwise, my dear Watson¶
Why was it so important to set is_elementwise
correctly? Let's see
with an example.
Put the following in run.py
:
import polars as pl
import minimal_plugin as mp
df = pl.DataFrame({
'a': [1, 2, 3, 4, None, 5],
'b': [1, 1, 1, 2, 2, 2],
})
print(df.with_columns(a_cum_sum=mp.cum_sum('a')))
Then, run python run.py
.
Finally, go to minimal_plugin/__init__.py
and change is_elementwise
from False
to True
, and run python run.py
again.
In both cases, you should see the following output:
shape: (6, 3)
┌──────┬─────┬───────────┐
│ a ┆ b ┆ a_cum_sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═══════════╡
│ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1 ┆ 3 │
│ 3 ┆ 1 ┆ 6 │
│ 4 ┆ 2 ┆ 10 │
│ null ┆ 2 ┆ null │
│ 5 ┆ 2 ┆ 15 │
└──────┴─────┴───────────┘
is_elementwise
?
The deal is that we need it in order for window functions / group_by
s
to be correct. Change the last line of run.py
to
Now, we get:
-
with
elementwise=True
: -
with
elementwise=False
:
Only elementwise=False
actually respected the window! This is why
it's important to set elementwise
correctly.