4. Yes we SCAN¶
The operations we've seen so far have all been elementwise, e.g.:
- for each row, we calculated the absolute value
- for each row, we summed the respective values in two columns
Let's do something (completely) different - instead of working with each row in isolation, we'll calculate a quantity which depends on the rows which precede it.
We're going to implement cum_sum.
Python side¶
Add this to minimal_plugin/__init__.py:
def cum_sum(expr: IntoExprColumn) -> pl.Expr:
return register_plugin_function(
args=[expr],
plugin_path=LIB,
function_name="cum_sum",
is_elementwise=False,
)
is_elementwise=False.
You'll see why this is so important at the end of this page.
Rust¶
Time to learn a new Rust function: scan.
If you're not familiar with it, please take a little break from this tutorial
and read the scan docs.
Welcome back! Let's use our newfound scan-superpowers to implement cum_sum. Here's what goes into src/expressions.rs:
#[polars_expr(output_type_func=same_output_type)]
fn cum_sum(inputs: &[Series]) -> PolarsResult<Series> {
let s = &inputs[0];
let ca: &Int64Chunked = s.i64()?;
let out: Int64Chunked = ca
.iter()
.scan(0_i64, |state: &mut i64, x: Option<i64>| {
match x {
Some(x) => {
*state += x;
Some(Some(*state))
},
None => Some(None),
}
})
.collect_trusted();
Ok(out.into_series())
}
The cum_sum definition may look complex, but it's not too bad once we
break it down:
- we hold the running sum in
state - we iterate over rows, initialising
stateto be0 - if the current row is
Some, then add the current row's value tostateand emit the current value ofstate - if the current row is
None, then don't modifystateand emitNone
Note how we use collect_trusted at the end, rather than collect.
collect would work as well, but if we know the length of the output
(and we do in this case, cum_sum doesn't change the column's length)
then we can safely use collect_trusted and save some precious time.
Let's compile with maturin develop (or maturin develop --release
if you're benchmarking), change the last line of run.py to
python run.py:
shape: (3, 3)
┌─────┬──────┬───────────┐
│ a ┆ b ┆ a_cum_sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪═══════════╡
│ 1 ┆ 3 ┆ 1 │
│ 5 ┆ null ┆ 6 │
│ 2 ┆ -1 ┆ 8 │
└─────┴──────┴───────────┘
Elementwise, my dear Watson¶
Why was it so important to set is_elementwise correctly? Let's see
with an example.
Put the following in run.py:
import polars as pl
import minimal_plugin as mp
df = pl.DataFrame({
'a': [1, 2, 3, 4, None, 5],
'b': [1, 1, 1, 2, 2, 2],
})
print(df.with_columns(a_cum_sum=mp.cum_sum('a')))
Then, run python run.py.
Finally, go to minimal_plugin/__init__.py and change is_elementwise
from False to True, and run python run.py again.
In both cases, you should see the following output:
shape: (6, 3)
┌──────┬─────┬───────────┐
│ a ┆ b ┆ a_cum_sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═══════════╡
│ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1 ┆ 3 │
│ 3 ┆ 1 ┆ 6 │
│ 4 ┆ 2 ┆ 10 │
│ null ┆ 2 ┆ null │
│ 5 ┆ 2 ┆ 15 │
└──────┴─────┴───────────┘
is_elementwise?
The deal is that we need it in order for window functions / group_bys
to be correct. Change the last line of run.py to
Now, we get:
-
with
elementwise=True: -
with
elementwise=False:
Only elementwise=False actually respected the window! This is why
it's important to set elementwise correctly.