Skip to content

5. How to STRING something together

Tired of examples which only include numeric data? Me neither. But we need to address the elephant in the room: strings.

We're going to start by re-implementing a pig-latinnifier. This example is already part of the pyo3-polars repo examples, but we'll tackle it with a different spin here by first doing it the wrong way 😈.

Pig-latinnify - take 1

Let's start by doing this the wrong way. We'll use our abs example, and adapt it to the string case. We'll follow the same strategy:

  • iterate over arrow arrays;
  • for each element in each array, create a new output value.

Put the following in src/expressions.rs:

use std::borrow::Cow;
use std::fmt::Write;

#[polars_expr(output_type=String)]
fn pig_latinnify(inputs: &[Series]) -> PolarsResult<Series> {
    let s = &inputs[0];
    let ca = s.str()?;
    let out: StringChunked = ca.apply(|opt_v: Option<&str>| {
        opt_v.map(|value: &str| {
            // Not the recommended way to do it,
            // see below for a better way!
            if let Some(first_char) = value.chars().next() {
                Cow::Owned(format!("{}{}ay", &value[1..], first_char))
            } else {
                Cow::Borrowed(value)
            }
        })
    });
    Ok(out.into_series())
}
If you're not familiar with clone-on-write, don't worry about it - we're about to see a simpler and better way to do this anyway. What I'd like you to focus on is that for every row, we're creating a new String.

If you combine this with a Python definition (which you should put in minimal_plugin/__init__.py):

def pig_latinnify(expr: IntoExprColumn) -> pl.Expr:
    return register_plugin_function(
        args=[expr],
        plugin_path=LIB,
        function_name="pig_latinnify",
        is_elementwise=True,
    )
then you'll be able to pig-latinnify a column of strings! To see it in action, compile with maturin develop (or maturin develop --release if you're benchmarking) and put the following in run.py:

import polars as pl
import minimal_plugin as mp

df = pl.DataFrame({'a': ["I", "love", "pig", "latin"]})
print(df.with_columns(a_pig_latin=mp.pig_latinnify('a')))
shape: (4, 2)
┌───────┬─────────────┐
│ a     ┆ a_pig_latin │
│ ---   ┆ ---         │
│ str   ┆ str         │
╞═══════╪═════════════╡
│ I     ┆ Iay         │
│ love  ┆ ovelay      │
│ pig   ┆ igpay       │
│ latin ┆ atinlay     │
└───────┴─────────────┘

This will already be an order of magnitude faster than using map_elements. But as mentioned earlier, we're creating a new string for every single row.

Can we do better?

Pig-latinnify - take 2

Yes! StringChunked has a utility apply_into_string_amortized method which amortises the cost of creating new strings for each row by creating a string upfront, clearing it, and repeatedly writing to it. This gives a 4x speedup! All you need to do is change pig_latinnify to:

#[polars_expr(output_type=String)]
fn pig_latinnify(inputs: &[Series]) -> PolarsResult<Series> {
    let ca: &StringChunked = inputs[0].str()?;
    let out: StringChunked = ca.apply_into_string_amortized(|value: &str, output: &mut String| {
        if let Some(first_char) = value.chars().next() {
            write!(output, "{}{}ay", &value[1..], first_char).unwrap()
        }
    });
    Ok(out.into_series())
}

Simpler, faster, and more memory-efficient. Thinking about allocations can really make a difference!

So let's think about allocations!

If you have an elementwise function which produces String output, then chances are it does one of the following:

  • Creates a new string. In this case, you can use apply_into_string_amortized to amortise the cost of allocating a new string for each input row, as we did above in pig_latinnify. This works by allocating a String upfront and then repeatedly re-writing to it.
  • Slices the original string. In this case, you can use apply_values with Cow::Borrowed, for example:

    fn remove_last_extension(s: &str) -> &str {
        match s.rfind('.') {
            Some(pos) => &s[..pos],
            None => s,
        }
    }
    
    #[polars_expr(output_type=String)]
    fn remove_extension(inputs: &[Series]) -> PolarsResult<Series> {
        let s = &inputs[0];
        let ca = s.str()?;
        let out: StringChunked = ca.apply_values(|val| {
            let res = Cow::Borrowed(remove_last_extension(val));
            res
        });
        Ok(out.into_series())
    }
    

There are low-level optimisations you can do to take things further, but - if in doubt - apply_into_string_amortized / binary_elementwise_into_string_amortized are probably good enough.