5. How to STRING something together¶
Tired of examples which only include numeric data? Me neither. But we need to address the elephant in the room: strings.
We're going to start by re-implementing a pig-latinnifier.
This example is already part of the pyo3-polars
repo examples,
but we'll tackle it with a different spin here by first doing it
the wrong way 😈.
Pig-latinnify - take 1¶
Let's start by doing this the wrong way.
We'll use our abs
example, and adapt it to the
string case. We'll follow the same strategy:
- iterate over arrow arrays;
- for each element in each array, create a new output value.
Put the following in src/expressions.rs
:
use std::borrow::Cow;
use std::fmt::Write;
#[polars_expr(output_type=String)]
fn pig_latinnify(inputs: &[Series]) -> PolarsResult<Series> {
let s = &inputs[0];
let ca = s.str()?;
let out: StringChunked = ca.apply(|opt_v: Option<&str>| {
opt_v.map(|value: &str| {
// Not the recommended way to do it,
// see below for a better way!
if let Some(first_char) = value.chars().next() {
Cow::Owned(format!("{}{}ay", &value[1..], first_char))
} else {
Cow::Borrowed(value)
}
})
});
Ok(out.into_series())
}
String
.
If you combine this with a Python definition (which you should put
in minimal_plugin/__init__.py
):
def pig_latinnify(expr: IntoExprColumn) -> pl.Expr:
return register_plugin_function(
args=[expr],
plugin_path=LIB,
function_name="pig_latinnify",
is_elementwise=True,
)
maturin develop
(or maturin develop --release
if you're benchmarking) and put the following in run.py
:
import polars as pl
import minimal_plugin as mp
df = pl.DataFrame({'a': ["I", "love", "pig", "latin"]})
print(df.with_columns(a_pig_latin=mp.pig_latinnify('a')))
shape: (4, 2)
┌───────┬─────────────┐
│ a ┆ a_pig_latin │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═════════════╡
│ I ┆ Iay │
│ love ┆ ovelay │
│ pig ┆ igpay │
│ latin ┆ atinlay │
└───────┴─────────────┘
This will already be an order of magnitude faster than using map_elements
.
But as mentioned earlier, we're creating a new string for every single row.
Can we do better?
Pig-latinnify - take 2¶
Yes! StringChunked
has a utility apply_into_string_amortized
method which amortises
the cost of creating new strings for each row by creating a string upfront,
clearing it, and repeatedly writing to it.
This gives a 4x speedup! All you need to do is change pig_latinnify
to:
#[polars_expr(output_type=String)]
fn pig_latinnify(inputs: &[Series]) -> PolarsResult<Series> {
let ca: &StringChunked = inputs[0].str()?;
let out: StringChunked = ca.apply_into_string_amortized(|value: &str, output: &mut String| {
if let Some(first_char) = value.chars().next() {
write!(output, "{}{}ay", &value[1..], first_char).unwrap()
}
});
Ok(out.into_series())
}
Simpler, faster, and more memory-efficient. Thinking about allocations can really make a difference!
So let's think about allocations!¶
If you have an elementwise function which produces String
output, then chances are it does one of the following:
- Creates a new string. In this case, you can use
apply_into_string_amortized
to amortise the cost of allocating a new string for each input row, as we did above inpig_latinnify
. This works by allocating aString
upfront and then repeatedly re-writing to it. -
Slices the original string. In this case, you can use
apply_values
withCow::Borrowed
, for example:fn remove_last_extension(s: &str) -> &str { match s.rfind('.') { Some(pos) => &s[..pos], None => s, } } #[polars_expr(output_type=String)] fn remove_extension(inputs: &[Series]) -> PolarsResult<Series> { let s = &inputs[0]; let ca = s.str()?; let out: StringChunked = ca.apply_values(|val| { let res = Cow::Borrowed(remove_last_extension(val)); res }); Ok(out.into_series()) }
There are low-level optimisations you can do to take things further, but - if in doubt - apply_into_string_amortized
/ binary_elementwise_into_string_amortized
are probably good enough.