Skip to content

0. Prerequisites

Knowledge

"But you know what I like more than materialistic things? Knowledge." Tai Lopez

How much Rust do you need to know to write your own Polars plugin? Less than you think.

I'd suggest starting out with the Rustlings course, which provides some fun and interactive exercises designed to make you familiar with the language. I'd suggest starting the following sections:

  • 00 intro
  • 01 variables
  • 02 functions
  • 03 if
  • 05 vecs
  • 12 options
  • 13 error handling

You'll also need basic Python knowledge: classes, decorators, and functions.

Alternatively, you could just clone this repo and then hack away at the examples trial-and-error style until you get what you're looking for - the compiler will probably help you more than you're expecting.

Software

To get started, please install cookiecutter.

Then, from your home directory (or wherever you store your Python projects) please run

cookiecutter https://github.com/MarcoGorelli/cookiecutter-polars-plugins
When prompted, please enter (let's suppose your name is "Maja Anima", but replace that with your preferred name):
[1/3] plugin_name (Polars Cookiecutter): Minimal Plugin
[2/3] project_slug (polars_minimal_plugin):
[3/3] author (anonymous): Maja Anima
This will create a folder call minimal_plugin. Please navigate to it with cd minimal_plugin.

Next, create a Python3.8+ virtual environment, and install:

  • polars>=0.20.0
  • maturin>=1.4.0

Finally, you'll also need to install Rust.

That's it! However, you are highly encouraged to also install rust-analyzer if you want to improve your Rust-writing experience by exactly 120%.

What's in a Series?

If you take a look at a Series such as

In [9]: s = pl.Series([None, 2, 3]) + 42

In [10]: s
Out[10]:
shape: (3,)
Series: '' [i64]
[
        null
        44
        45
]
you may be tempted to conclude that it contains three values: [null, 44, 45].

However, if you print out s._get_buffers(), you'll see something different:

  • s._get_buffers()["values"]: [42, 44, 45]. These are the values.
  • s._get_buffers()["validity"]: [False, True, True]. These are the validities.

So we don't really have integers and null mixed together into a single array - we have a pair of arrays, one holding values and another one holding booleans indicating whether each value is valid or not. If a value appears as null to you, then there's no guarantee about what physical number is behind it! It was 42 here, but it could well be 43, or any other number, in another example.

What's a chunk?

A Series is backed by chunked arrays, each of which holds data which is contiguous in memory.

Here's an example of a Series backed by multiple chunks:

In [27]: s = pl.Series([1,2,3])

In [28]: s = s.append(pl.Series([99, 11]))

In [29]: s
Out[29]:
shape: (5,)
Series: '' [i64]
[
        1
        2
        3
        99
        11
]

In [30]: s.get_chunks()
Out[30]:
[shape: (3,)
 Series: '' [i64]
 [
        1
        2
        3
 ],
 shape: (2,)
 Series: '' [i64]
 [
        99
        11
 ]]
Chunked arrays will come up in several examples in this tutorial.