9.1 Lists in, lists out, lists all about¶

Chapter 9.0 (Weighted-mean watchers) was fun. Let's do it all over again!

Or rather, let's do another list operation. We're going to start with a dataframe such as:

shape: (4, 1)
┌──────────────┐
│ dense        │
│ ---          │
│ list[i64]    │
╞══════════════╡
│ [0, 9]       │
│ [8, 6, 0, 9] │
│ null         │
│ [3, 3]       │
└──────────────┘

Before we start, however, let's take a look into how Polars stores lists in memory. As we saw, lists are backed up by chunks. Inside each chunk, Polars stores all the lists ("rows") as one single list, while keeping track of where each row starts, and how many elements they have. This is consistent with Apache Arrow's columnar format. It looks something like this:

Diagram showing how Polars stores lists under the hood

Back to where we were - we're going to try to count the indices which are non-zero. →

Note

You don't really need a plugin to do this, you can just do

df.with_columns(sparse_indices=pl.col('dense').list.eval(pl.arg_where(pl.element() != 0)))

But eval won't cover every need you ever have ever, so...it's good to learn how to do this as a plugin so you can then customize it according to your needs.

Polars has a helper function built-in for dealing with this: apply_amortized. We can use it to apply a function to each element of a List Series. In this case, we just want to find the indices of non-zero elements, so we'll do:

fn list_idx_dtype(input_fields: &[Field]) -> PolarsResult<Field> {
    let field = Field::new(input_fields[0].name.clone(), DataType::List(Box::new(IDX_DTYPE)));
    Ok(field.clone())
}

#[polars_expr(output_type_func=list_idx_dtype)]
fn non_zero_indices(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].list()?;
    polars_ensure!(
        ca.dtype() == &DataType::List(Box::new(DataType::Int64)),
        ComputeError: "Expected `List(Int64)`, got: {}", ca.dtype()
    );

    let out: ListChunked = ca.apply_amortized(|s| {
        let s: &Series = s.as_ref();
        let ca: &Int64Chunked = s.i64().unwrap();
        let out: IdxCa = ca
            .iter()
            .enumerate()
            .filter(|(_idx, opt_val)| opt_val != &Some(0))
            .map(|(idx, _opt_val)| Some(idx as IdxSize))
            .collect_ca(PlSmallStr::EMPTY);
        out.into_series()
    });
    Ok(out.into_series())
}

apply_amortized is a bit like the apply_into_string_amortized function we used in How to STRING something together, in that it makes a big allocation upfront to amortize the allocation costs. Think of it as a list version of apply_values, where each element is itself a Series.

Something new in this example is:

IdxSize
IdxCa
IDX_DTYPE

IdxSize is either u32 or u64, depending on your platform, and are what Polars generally uses for counting-related operations. IdxCa is the associated ChunkedArray, and IDX_DTYPE the associated Polars dtype.

To finish this off, the Python side will be a bog-standard:

def non_zero_indices(expr: IntoExprColumn) -> pl.Expr:
    return register_plugin_function(
        args=[expr], plugin_path=LIB, function_name="non_zero_indices", is_elementwise=True
    )

If we then make run.py with

import polars as pl
import minimal_plugin as mp

pl.Config().set_fmt_table_cell_list_len(10)

df = pl.DataFrame({'dense': [[0, 9], [8, 6, 0, 9], None, [3, 3]]})
print(df)
print(df.with_columns(indices=mp.non_zero_indices('dense')))

and compile with maturin develop (or maturin develop --release if you're benchmarking!) then we'll see

shape: (4, 2)
┌──────────────┬───────────┐
│ dense        ┆ indices   │
│ ---          ┆ ---       │
│ list[i64]    ┆ list[u32] │
╞══════════════╪═══════════╡
│ [0, 9]       ┆ [1]       │
│ [8, 6, 0, 9] ┆ [0, 1, 3] │
│ null         ┆ null      │
│ [3, 3]       ┆ [0, 1]    │
└──────────────┴───────────┘

Yay, it worked! And not only that, but it's about 1.5x as fast as the list.eval solution noted above!