9.0 Weighted-mean watchers¶
According to one YouTube talk,
the list namespace is one of Polars' main selling points.
If you're also a fan of it, this section will teach you how to extend it even further.
Motivation¶
Say you have
In [10]: df = pl.DataFrame({
    ...:     'values': [[1, 3, 2], [5, 7]],
    ...:     'weights': [[.5, .3, .2], [.1, .9]]
    ...: })
In [11]: df
Out[11]:
shape: (2, 2)
┌───────────┬─────────────────┐
│ values    ┆ weights         │
│ ---       ┆ ---             │
│ list[i64] ┆ list[f64]       │
╞═══════════╪═════════════════╡
│ [1, 3, 2] ┆ [0.5, 0.3, 0.2] │
│ [5, 7]    ┆ [0.1, 0.9]      │
└───────────┴─────────────────┘
Can you calculate the mean of the values in 'values', weighted by the values in 'weights'?
So:
.5*1 + .3*3 + .2*2 = 1.85*.1 + 7*.9 = 6.8
I don't know of an easy way to do this with Polars expressions. There probably is a way - but as you'll see here, it's not that hard to write a plugin, and it's probably faster too.
Weighted mean¶
On the Python side, this'll be similar to sum_i64:
def weighted_mean(expr: IntoExprColumn, weights: IntoExprColumn) -> pl.Expr:
    return register_plugin_function(
        args=[expr, weights],
        plugin_path=LIB,
        function_name="weighted_mean",
        is_elementwise=True,
    )
On the Rust side, we'll define a helper function which will let us work with pairs of list chunked arrays:
fn binary_amortized_elementwise<'a, T, K, F>(
    lhs: &'a ListChunked,
    rhs: &'a ListChunked,
    mut f: F,
) -> ChunkedArray<T>
where
    T: PolarsDataType,
    T::Array: ArrayFromIter<Option<K>>,
    F: FnMut(&AmortSeries, &AmortSeries) -> Option<K> + Copy,
{
    {
        let (lhs, rhs) = align_chunks_binary(lhs, rhs);
        lhs.amortized_iter()
            .zip(rhs.amortized_iter())
            .map(|(lhs, rhs)| match (lhs, rhs) {
                (Some(lhs), Some(rhs)) => f(&lhs, &rhs),
                _ => None,
            })
            .collect_ca(PlSmallStr::EMPTY)
    }
}
That's a bit of a mouthful, so let's try to make sense of it.
- As we learned about in Prerequisites, Polars Series are backed by chunked arrays.
  
align_chunks_binaryjust ensures that the chunks have the same lengths. It may need to rechunk under the hood for us; amortized_iterreturns an iterator ofAmortSeries, each of which corresponds to a row from our input.
We'll explain more about AmortSeries in a future iteration of this tutorial.
For now, let's just look at how to use this utility:
- we pass it 
ListChunkedas inputs; - we also pass a function which takes two 
AmortSeriesand produces a scalar value. 
#[polars_expr(output_type=Float64)]
fn weighted_mean(inputs: &[Series]) -> PolarsResult<Series> {
    let values = inputs[0].list()?;
    let weights = &inputs[1].list()?;
    polars_ensure!(
        values.dtype() == &DataType::List(Box::new(DataType::Int64)),
        ComputeError: "Expected `values` to be of type `List(Int64)`, got: {}", values.dtype()
    );
    polars_ensure!(
        weights.dtype() == &DataType::List(Box::new(DataType::Float64)),
        ComputeError: "Expected `weights` to be of type `List(Float64)`, got: {}", weights.dtype()
    );
    let out: Float64Chunked = binary_amortized_elementwise(
        values,
        weights,
        |values_inner: &AmortSeries, weights_inner: &AmortSeries| -> Option<f64> {
            let values_inner = values_inner.as_ref().i64().unwrap();
            let weights_inner = weights_inner.as_ref().f64().unwrap();
            if values_inner.len() == 0 {
                // Mirror Polars, and return None for empty mean.
                return None
            }
            let mut numerator: f64 = 0.;
            let mut denominator: f64 = 0.;
            values_inner
                .iter()
                .zip(weights_inner.iter())
                .for_each(|(v, w)| {
                    if let (Some(v), Some(w)) = (v, w) {
                        numerator += v as f64 * w;
                        denominator += w;
                    }
                });
            Some(numerator / denominator)
        },
    );
    Ok(out.into_series())
}
If you just need to get a problem solved, this function works! But let's note its limitations:
- it assumes that each inner element of 
valuesandweightshas the same length - it would be better to raise an error if this assumption is not met - it only accepts 
Int64valuesandFloat64weights(see section 2 for how you could make it more generic). 
To try it out, we compile with maturin develop (or maturin develop --release if you're 
benchmarking), and then we should be able to run run.py:
import polars as pl
import minimal_plugin as mp
df = pl.DataFrame({
    'values': [[1, 3, 2], [5, 7]],
    'weights': [[.5, .3, .2], [.1, .9]]
})
print(df.with_columns(weighted_mean = mp.weighted_mean('values', 'weights')))
shape: (2, 3)
┌───────────┬─────────────────┬───────────────┐
│ values    ┆ weights         ┆ weighted_mean │
│ ---       ┆ ---             ┆ ---           │
│ list[i64] ┆ list[f64]       ┆ f64           │
╞═══════════╪═════════════════╪═══════════════╡
│ [1, 3, 2] ┆ [0.5, 0.3, 0.2] ┆ 1.8           │
│ [5, 7]    ┆ [0.1, 0.9]      ┆ 6.8           │
└───────────┴─────────────────┴───────────────┘
Gimme chocolate challenge¶
Could you implement a weighted standard deviation calculator?