Polars is an open-source project that provides in-memory dataframes for Python and Rust. Despite its young age (its first commit was a mere two years ago, in the middle of the COVID-19 pandemic) it has already gained lots of popularity due to its "lightning-fast" performance and the expressiveness of its API.
One of the most interesting things about Polars is that it offers two modes of operation:
These ideas are not new: in fact, in our blog post about Vaex we covered its lazy computation capabilities. However, Polars takes them one step further by offering a functional API that is delightful to use.
The other secret sauce of Polars is Apache Arrow. While other libraries use Arrow for things like reading Parquet files, Polars is tightly coupled with it: by using a Rust-native implementation of the Arrow memory format for its columnar storage, Polars can leverage the highly optimized Arrow data structures and focus on the data manipulation operations.
Interested? Read on!
For this example, we will use a sample of Stack Overflow questions and their tags obtained from Kaggle. Our generic goal is to display the most highly voted Python questions.
We have published an Orchest pipeline that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.
Click the button below to import the project to your Orchest account:
You can install Polars with conda/mamba or pip:
-- CODE language-text --
mamba install -y "polars=0.13.37"
pip install "polars==0.13.37"
Even though Polars is written in Rust, it distributes precompiled binary wheels on PyPI, so pip install will just work on all major Python versions from 3.6 onwards.
Let's load the Questions and Tags CSV files using
-- CODE language-python --
import polars as pl
df = pl.read_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")
tags = pl.read_csv("/data/stacksample/Tags.csv")
The type of both objects is `polars.internals.frame.DataFrame`, "a two-dimensional data structure that represents data as a table with rows and columns" (reference docs). Both dataframes have millions of rows, and the Questions one takes almost 2 GB of memory:
-- CODE language-python --
In [7]: len(df), len(tags)
Out[7]: (1264216, 3750994)
In [8]: print(f"Estimated size: {df.estimated_size() >> 20} MiB")
Estimated size: 1865 MiB
Polars dataframes have some typical methods we know from pandas to inspect the data. Notice that calling the print function on a DataFrame produces a tidy ASCII representation, in addition to the fancy HTML representation available in Jupyter:
-- CODE language-python --
In [9]: print(df.head(3)) # No `print` needed on Jupyter
shape: (3, 7)
┌─────┬─────────────┬─────────────────┬─────────────────┬───────┬─────────────────┬────────────────┐
│ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │
╞═════╪═════════════╪═════════════════╪═════════════════╪═══════╪═════════════════╪════════════════╡
│ 80 ┆ 26 ┆ 2008-08-01T13:5 ┆ NA ┆ 26 ┆ SQLStatement.ex ┆ <p>I've │
│ ┆ ┆ 7:07Z ┆ ┆ ┆ ecute() - ┆ written a │
│ ┆ ┆ ┆ ┆ ┆ multipl... ┆ database │
│ ┆ ┆ ┆ ┆ ┆ ┆ gener... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 90 ┆ 58 ┆ 2008-08-01T14:4 ┆ 2012-12-26T03:4 ┆ 144 ┆ Good branching ┆ <p>Are there │
│ ┆ ┆ 1:24Z ┆ 5:49Z ┆ ┆ and merging ┆ any really │
│ ┆ ┆ ┆ ┆ ┆ tutor... ┆ good tut... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 120 ┆ 83 ┆ 2008-08-01T15:5 ┆ NA ┆ 21 ┆ ASP.NET Site ┆ <p>Has anyone │
│ ┆ ┆ 0:08Z ┆ ┆ ┆ Maps ┆ got experience │
│ ┆ ┆ ┆ ┆ ┆ ┆ cre... │
└─────┴─────────────┴─────────────────┴─────────────────┴───────┴─────────────────┴────────────────┘
In [10]: print(df.describe())
shape: (5, 8)
┌──────────┬─────────────┬─────────────┬──────────────┬────────────┬───────────┬───────┬──────┐
│ describe ┆ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞══════════╪═════════════╪═════════════╪══════════════╪════════════╪═══════════╪═══════╪══════╡
│ mean ┆ 2.1327e7 ┆ null ┆ null ┆ null ┆ 1.781537 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ std ┆ 1.1514e7 ┆ null ┆ null ┆ null ┆ 13.663886 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ min ┆ 80.0 ┆ null ┆ null ┆ null ┆ -73.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ max ┆ 4.014338e7 ┆ null ┆ null ┆ null ┆ 5190.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ median ┆ 2.1725415e7 ┆ null ┆ null ┆ null ┆ 0.0 ┆ null ┆ null │
└──────────┴─────────────┴─────────────┴──────────────┴────────────┴───────────┴───────┴──────┘
[11]: print(tags["Tag"].value_counts().head())
shape: (5, 2)
┌────────────┬────────┐
│ Tag ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪════════╡
│ javascript ┆ 124155 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ java ┆ 115212 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c# ┆ 101186 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ php ┆ 98808 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ android ┆ 90659 │
└────────────┴────────┘
Following a terminology similar to pandas, Polars dataframes contain several columns of type polars.internals.series.Series , each of them with a different data type:
-- CODE language-python --
In [12]: df["Title"].head(5)
Out[12]: shape: (5,)
Series: 'Title' [str]
[
"SQLStatement.e...
"Good branching...
"ASP.NET Site M...
"Function for c...
"Adding scripti...
]
In [13]: df.dtypes
Out[13]: [polars.datatypes.Int64,
polars.datatypes.Utf8,
polars.datatypes.Utf8,
polars.datatypes.Utf8,
polars.datatypes.Int64,
polars.datatypes.Utf8,
polars.datatypes.Utf8]
The essential building blocks in Polars are expressions: functions that receive a Series and transform it into another Series. Expressions start with a root, and then you can chain more operations:
-- CODE language-text --
(
pl.col("Score") # Root of the Expression (a single column)
.mean() # Returns another Expression
)
The most interesting feature is that expressions are not bound to a specific object, but instead they are generic. Chains of expressions define the computation, which is materialized by a DataFrame method (acting as an execution context).
Sounds too abstract? See it in action:
-- CODE language-python --
In [20]: print(df.select(pl.col("Score").mean()))
shape: (1, 1)
┌──────────┐
│ Score │
│ --- │
│ f64 │
╞══════════╡
│ 1.781537 │
└──────────┘
The df.select method can do much more than just selecting columns: it can execute any column-wise expression. In fact, when passed a list of such expressions, it can broadcast them automatically if the dimensions are coherent, and it will execute them in parallel:
-- CODE language-python --
In [21]: print(df.select([
...: pl.col("Id").n_unique().alias("num_unique_users"),
...: pl.col("Score").mean().alias("mean_score"),
...: pl.col("Title").str.lengths().max().alias("max_title_length"),
...: # To run the above in all text columns,
...: # you can filter by data type:
...: # pl.col(Utf8).str.lengths().max().suffix("_max_length"),
...: ]))
shape: (1, 3)
┌──────────────────┬────────────┬──────────────────┐
│ num_unique_users ┆ mean_score ┆ max_title_length │
│ --- ┆ --- ┆ --- │
│ u32 ┆ f64 ┆ u32 │
╞══════════════════╪════════════╪══════════════════╡
│ 1264216 ┆ 1.781537 ┆ 204 │
└──────────────────┴────────────┴──────────────────┘
It is now time to start narrowing down the analysis a bit and focus on the questions that are related to Python. Notice that Polars algorithms require all the data to live in memory, and therefore when using the eager API you have to apply the usual caveats about large datasets. As a result, since the questions dataset is already quite big, performing a .join operation with the tags data can crash the kernel:
-- CODE language-python --
# Don't try this at home unless you have enough RAM!
# (
# df
# .join(tags, on="Id")
# .filter(pl.col("Tag").str.contains(r"(i?)python"))
# .sort("Id")
# )
But fear not, because Polars has the perfect solution: switching to lazy mode! By prefixing our chain of operations by .lazy() and calling .collect() at the end, you can leverage Polars optimization capabilities to its fullest potential, and perform operations that would be otherwise impossible:
-- CODE language-python --
In [22]: q_python = (
...: df.lazy() # Notice the .lazy() call
...: # The input of a lazy join needs to be lazy
...: # We use a 'semi' join, like 'inner' but discarding extra columns
...: .join(tags.lazy(), on="Id", how="semi")
...: .filter(pl.col("Tag").str.contains(r"(i?)python"))
...: .sort("Id")
...: ).collect() # Call .collect() at the end
...: print(q_python.head(3))
shape: (3, 7)
┌───────┬─────────────┬──────────────────┬────────────┬───────┬──────────────────┬─────────────────┐
│ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │
╞═══════╪═════════════╪══════════════════╪════════════╪═══════╪══════════════════╪═════════════════╡
│ 11060 ┆ 912 ┆ 2008-08-14T13:59 ┆ NA ┆ 18 ┆ How should I ┆ <p>This is a │
│ ┆ ┆ :21Z ┆ ┆ ┆ unit test a ┆ difficult and │
│ ┆ ┆ ┆ ┆ ┆ code-ge... ┆ open-... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 17250 ┆ 394 ┆ 2008-08-20T00:16 ┆ NA ┆ 24 ┆ Create an ┆ <p>I'm creating │
│ ┆ ┆ :40Z ┆ ┆ ┆ encrypted ZIP ┆ an ZIP file │
│ ┆ ┆ ┆ ┆ ┆ file in ... ┆ with... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 19030 ┆ 745 ┆ 2008-08-20T22:50 ┆ NA ┆ 2 ┆ How to check set ┆ <p>I have a │
│ ┆ ┆ :55Z ┆ ┆ ┆ of files ┆ bunch of files │
│ ┆ ┆ ┆ ┆ ┆ confor... ┆ (TV e... │
In fact, if your raw CSV is so big that it doesn't fit in RAM to start, Polars offers a lazy way of reading the file too using scan_csv :
-- CODE language-python --
# We create the query plan separately
plan = (
# scan_csv returns a lazy dataframe already
pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")
.join(tags.lazy(), on="Id", how="semi")
.filter(pl.col("Tag").str.contains(r"(i?)python"))
.sort("Score", reverse=True)
.limit(1_000)
)
top_voted_python_qs = plan.collect()
If you are curious about how Polars is doing all this work under the hood, notice that you can visualize the query plan!
Notice that, in the previous section, we did a "semi" join to filter the questions, but we still don't have the list of tags associated with such questions. To achieve that, we will use one of the most surprisingly pleasant features of Polars: its list-column handling.
-- CODE language-python --
In [30]: tag_list_lazy = (
...: tags.lazy()
...: .groupby("Id").agg(
...: pl.col("Tag")
...: .list() # Convert to a list of strings
...: .alias("TagList")
...: )
...: )
...: print(tag_list_lazy.limit(5).collect())
shape: (5, 2)
┌──────────┬─────────────────────────────────────┐
│ Id ┆ TagList │
│ --- ┆ --- │
│ i64 ┆ list [str] │
╞══════════╪═════════════════════════════════════╡
│ 994990 ┆ ["spring"] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 29087440 ┆ ["android", "android-intent"] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 12093870 ┆ ["asp.net", ".net", "sqldatasour... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 32889780 ┆ ["c", "extern", "function-declar... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 22436290 ┆ ["mysql", "sql", ... "multiple-t... │
└──────────┴─────────────────────────────────────┘
After grouping by "Id" and turning each row into a list of tags, it's time to add a boolean column "ContainsPython" that signals whether any of the tags in the list contains the substring "python". For that', let's use the `.arr.eval` context (also known as the List context):
-- CODE language-python --
tag_list_extended_lazy = tag_list_lazy.with_column(
pl.col("TagList")
.arr.eval(
pl.element()
.str.contains(r"(i?)python")
.any()
).flatten().alias("ContainsPython")
)
The final join will provide the answer we are looking for:
-- CODE language-python --
top_python_questions = (
pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")
.join(tag_list_extended_lazy, on="Id")
.filter(pl.col("ContainsPython"))
.sort("Score", reverse=True)
).limit(1_000).collect()
And the result:
Very neat!
Similarly to what happens with Vaex, Polars DataFrames don't have an index. The user guide goes as far as saying this:
Indexes are not needed! Not having them makes things easier - convince us otherwise!
The discussion of this contentious stance will be the subject of a future blog post. In any case, this allows Polars to simplify indexing operations, since strings will always refer to column names, and numbers in the first axis will always refer to row numbers:
-- CODE language-python --
In [36]: print(df[0]) # First row
shape: (1, 7)
┌─────┬─────────────┬───────────────────┬────────────┬───────┬──────────────────┬──────────────────┐
│ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │
╞═════╪═════════════╪═══════════════════╪════════════╪═══════╪══════════════════╪══════════════════╡
│ 80 ┆ 26 ┆ 2008-08-01T13:57: ┆ NA ┆ 26 ┆ SQLStatement.exe ┆ <p>I've written │
│ ┆ ┆ 07Z ┆ ┆ ┆ cute() - ┆ a database │
│ ┆ ┆ ┆ ┆ ┆ multipl... ┆ gener... │
└─────┴─────────────┴───────────────────┴────────────┴───────┴──────────────────┴──────────────────┘
[37]: df[0, 0] # First row, first column
Out[37]: 80
In [38]: df[0, "Id"] # First row, column by name
Out[38]: 80
In [39]: df["Id"].head(5) # Column by name
Out[39]: shape: (5,)
Series: 'Id' [i64]
[
80
90
120
180
260
]
On the other hand, even though indexing with boolean masks is supported in Polars as a way to bridge the gap with Pandas users, its use is discouraged in favor of select and filter, and "the functionality may be removed in the future". However, as you could see in the examples above, direct indexing is not needed as often as in pandas.
Beyond this short introduction, Polars has much more to offer, from window functions and complex aggregations to time-series processing, and much more.
As a downside, since it is a young project and it's evolving quite fast, you will notice that some areas of the documentation are a bit lacking, or that there are no comprehensive release notes yet. Fortunately, Ritchie Vink, the Polars creator and current maintainer, quickly answers Stack Overflow questions and GitHub issues, and releases with bug fixes and new features are frequent.
On the other hand, if you are looking for an ultimate solution for your larger-than-RAM datasets, Polars might not be for you. Its lazy processing capabilities can take you quite far, but at some point you will have to confront the fact that Polars is an in-memory dataframe library, similar to pandas.
In summary:
In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!
Polars is an excellent tool to use when building data pipelines. Orchest is a pipeline orchestrator with first class support for the full Python open source ecosystem. Import this example project on your Orchest Cloud containing the Jupyter notebooks with the code from this article.
Still no Orchest account? Get started for free!