9.0 Weighted-mean watchers¶
According to one YouTube talk,
the list
namespace is one of Polars' main selling points.
If you're also a fan of it, this section will teach you how to extend it even further.
Motivation¶
Say you have
In [10]: df = pl.DataFrame({
...: 'values': [[1, 3, 2], [5, 7]],
...: 'weights': [[.5, .3, .2], [.1, .9]]
...: })
In [11]: df
Out[11]:
shape: (2, 2)
┌───────────┬─────────────────┐
│ values ┆ weights │
│ --- ┆ --- │
│ list[i64] ┆ list[f64] │
╞═══════════╪═════════════════╡
│ [1, 3, 2] ┆ [0.5, 0.3, 0.2] │
│ [5, 7] ┆ [0.1, 0.9] │
└───────────┴─────────────────┘
Can you calculate the mean of the values in 'values'
, weighted by the values in 'weights'
?
So:
.5*1 + .3*3 + .2*2 = 1.8
5*.1 + 7*.9 = 6.8
I don't know of an easy way to do this with Polars expressions. There probably is a way - but as you'll see here, it's not that hard to write a plugin, and it's probably faster too.
Weighted mean¶
On the Python side, this'll be similar to sum_i64
:
def weighted_mean(expr: IntoExprColumn, weights: IntoExprColumn) -> pl.Expr:
return register_plugin_function(
args=[expr, weights],
plugin_path=LIB,
function_name="weighted_mean",
is_elementwise=True,
)
On the Rust side, we'll define a helper function which will let us work with pairs of list chunked arrays:
fn binary_amortized_elementwise<'a, T, K, F>(
lhs: &'a ListChunked,
rhs: &'a ListChunked,
mut f: F,
) -> ChunkedArray<T>
where
T: PolarsDataType,
T::Array: ArrayFromIter<Option<K>>,
F: FnMut(&AmortSeries, &AmortSeries) -> Option<K> + Copy,
{
{
let (lhs, rhs) = align_chunks_binary(lhs, rhs);
lhs.amortized_iter()
.zip(rhs.amortized_iter())
.map(|(lhs, rhs)| match (lhs, rhs) {
(Some(lhs), Some(rhs)) => f(&lhs, &rhs),
_ => None,
})
.collect_ca(PlSmallStr::EMPTY)
}
}
That's a bit of a mouthful, so let's try to make sense of it.
- As we learned about in Prerequisites, Polars Series are backed by chunked arrays.
align_chunks_binary
just ensures that the chunks have the same lengths. It may need to rechunk under the hood for us; amortized_iter
returns an iterator ofAmortSeries
, each of which corresponds to a row from our input.
We'll explain more about AmortSeries
in a future iteration of this tutorial.
For now, let's just look at how to use this utility:
- we pass it
ListChunked
as inputs; - we also pass a function which takes two
AmortSeries
and produces a scalar value.
#[polars_expr(output_type=Float64)]
fn weighted_mean(inputs: &[Series]) -> PolarsResult<Series> {
let values = inputs[0].list()?;
let weights = &inputs[1].list()?;
polars_ensure!(
values.dtype() == &DataType::List(Box::new(DataType::Int64)),
ComputeError: "Expected `values` to be of type `List(Int64)`, got: {}", values.dtype()
);
polars_ensure!(
weights.dtype() == &DataType::List(Box::new(DataType::Float64)),
ComputeError: "Expected `weights` to be of type `List(Float64)`, got: {}", weights.dtype()
);
let out: Float64Chunked = binary_amortized_elementwise(
values,
weights,
|values_inner: &AmortSeries, weights_inner: &AmortSeries| -> Option<f64> {
let values_inner = values_inner.as_ref().i64().unwrap();
let weights_inner = weights_inner.as_ref().f64().unwrap();
if values_inner.len() == 0 {
// Mirror Polars, and return None for empty mean.
return None
}
let mut numerator: f64 = 0.;
let mut denominator: f64 = 0.;
values_inner
.iter()
.zip(weights_inner.iter())
.for_each(|(v, w)| {
if let (Some(v), Some(w)) = (v, w) {
numerator += v as f64 * w;
denominator += w;
}
});
Some(numerator / denominator)
},
);
Ok(out.into_series())
}
If you just need to get a problem solved, this function works! But let's note its limitations:
- it assumes that each inner element of
values
andweights
has the same length - it would be better to raise an error if this assumption is not met - it only accepts
Int64
values
andFloat64
weights
(see section 2 for how you could make it more generic).
To try it out, we compile with maturin develop
(or maturin develop --release
if you're
benchmarking), and then we should be able to run run.py
:
import polars as pl
import minimal_plugin as mp
df = pl.DataFrame({
'values': [[1, 3, 2], [5, 7]],
'weights': [[.5, .3, .2], [.1, .9]]
})
print(df.with_columns(weighted_mean = mp.weighted_mean('values', 'weights')))
shape: (2, 3)
┌───────────┬─────────────────┬───────────────┐
│ values ┆ weights ┆ weighted_mean │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[f64] ┆ f64 │
╞═══════════╪═════════════════╪═══════════════╡
│ [1, 3, 2] ┆ [0.5, 0.3, 0.2] ┆ 1.8 │
│ [5, 7] ┆ [0.1, 0.9] ┆ 6.8 │
└───────────┴─────────────────┴───────────────┘
Gimme chocolate challenge¶
Could you implement a weighted standard deviation calculator?