9.1 Lists in, lists out, lists all about¶
Chapter 9.0 (Weighted-mean watchers) was fun. Let's do it all over again!
Or rather, let's do another list operation. We're going to start with a dataframe such as:
shape: (4, 1)
┌──────────────┐
│ dense │
│ --- │
│ list[i64] │
╞══════════════╡
│ [0, 9] │
│ [8, 6, 0, 9] │
│ null │
│ [3, 3] │
└──────────────┘
Before we start, however, let's take a look into how Polars stores lists in memory. As we saw, lists are backed up by chunks. Inside each chunk, Polars stores all the lists ("rows") as one single list, while keeping track of where each row starts, and how many elements they have. This is consistent with Apache Arrow's columnar format. It looks something like this:
Back to where we were - we're going to try to count the indices which are non-zero. →
Note
You don't really need a plugin to do this, you can just do
But eval
won't cover every need you ever have ever, so...it's good
to learn how to do this as a plugin so you can then customize it according to your needs.
Polars has a helper function built-in for dealing with this: apply_amortized
. We can use it to apply
a function to each element of a List Series. In this case, we just want to find the indices of non-zero
elements, so we'll do:
fn list_idx_dtype(input_fields: &[Field]) -> PolarsResult<Field> {
let field = Field::new(input_fields[0].name.clone(), DataType::List(Box::new(IDX_DTYPE)));
Ok(field.clone())
}
#[polars_expr(output_type_func=list_idx_dtype)]
fn non_zero_indices(inputs: &[Series]) -> PolarsResult<Series> {
let ca = inputs[0].list()?;
polars_ensure!(
ca.dtype() == &DataType::List(Box::new(DataType::Int64)),
ComputeError: "Expected `List(Int64)`, got: {}", ca.dtype()
);
let out: ListChunked = ca.apply_amortized(|s| {
let s: &Series = s.as_ref();
let ca: &Int64Chunked = s.i64().unwrap();
let out: IdxCa = ca
.iter()
.enumerate()
.filter(|(_idx, opt_val)| opt_val != &Some(0))
.map(|(idx, _opt_val)| Some(idx as IdxSize))
.collect_ca(PlSmallStr::EMPTY);
out.into_series()
});
Ok(out.into_series())
}
apply_amortized
is a bit like the apply_into_string_amortized
function we used in How to STRING something together,
in that it makes a big allocation upfront to amortize the allocation costs. Think of it as a list version
of apply_values
, where each element is itself a Series
.
Something new in this example is:
IdxSize
IdxCa
IDX_DTYPE
IdxSize
is either u32
or u64
, depending on your platform, and are what Polars generally uses
for counting-related operations. IdxCa
is the associated ChunkedArray
, and IDX_DTYPE
the associated
Polars dtype.
To finish this off, the Python side will be a bog-standard:
def non_zero_indices(expr: IntoExprColumn) -> pl.Expr:
return register_plugin_function(
args=[expr], plugin_path=LIB, function_name="non_zero_indices", is_elementwise=True
)
If we then make run.py
with
import polars as pl
import minimal_plugin as mp
pl.Config().set_fmt_table_cell_list_len(10)
df = pl.DataFrame({'dense': [[0, 9], [8, 6, 0, 9], None, [3, 3]]})
print(df)
print(df.with_columns(indices=mp.non_zero_indices('dense')))
maturin develop
(or maturin develop --release
if you're benchmarking!)
then we'll see
shape: (4, 2)
┌──────────────┬───────────┐
│ dense ┆ indices │
│ --- ┆ --- │
│ list[i64] ┆ list[u32] │
╞══════════════╪═══════════╡
│ [0, 9] ┆ [1] │
│ [8, 6, 0, 9] ┆ [0, 1, 3] │
│ null ┆ null │
│ [3, 3] ┆ [0, 1] │
└──────────────┴───────────┘
Yay, it worked! And not only that, but it's about 1.5x as fast as the list.eval
solution
noted above!