Back arrow
Go back to all blog posts
SQL on Python, part 1: The simplicity of DuckDB

SQL on Python, part 1: The simplicity of DuckDB

This post is the first part of our series “SQL on Python”, in which we will explore different Python libraries that help you manipulate and query your data using SQL or a SQL-inspired syntax.

  • Part 1 (this one): The simplicity of DuckDB
  • (Stay tuned for more!)

Why SQL, after all?

SQL (the initials for Structured Query Language, also known as ISO/IEC 9075-1:2016) was originally designed in the 70s for managing relational databases, but nowadays, it is being used for analytics workloads as well.

SQL has lots of benefits for analytics, to name a few:

  • It’s easy to pick up: SQL is a domain-specific language, rather than a general-purpose language, and as such it has more limited scope and fewer syntax elements to learn.
  • It’s everywhere: SQL is a family of query languages available in many systems, and all of them share some core common characteristics. When you learn a particular SQL dialect (PostgreSQL, SQL Server, Google Standard SQL, others), you can easily transfer your skills from other dialects with ease.
  • It’s fast: SQL is a statically typed language, which allows query planning systems to perform sophisticated optimizations. This, along with the decades of accumulated knowledge about relational databases, allow SQL implementations to have difficult to beat performance.

However, if you are used to the Python or R ecosystems (pandas, Polars, data.table, dplyr), you are probably spoiled by how easy it is to download a CSV or Parquet file from somewhere, launch a Python or R process, read it, and start querying and manipulating it.

Comparatively, this bootstrapping process is a bit more tedious with SQL: assuming you have, say, a local PostgreSQL database up and running and a CSV file, you would need to create a table with the appropriate schema, import the data using COPY, and hope that there are no inconsistencies, missing data, or weird date formats. If the file happened to be Parquet, you would need to work a bit more.

To try to make the process a bit more lightweight, you could try to convert your CSV or Parquet to SQLite, a widely available, in-process SQL database. However, SQLite was designed with transactional use cases in mind, and therefore might not scale well with some analytical workloads.

In summary: SQL is appealing, but the boilerplate not so much. What if you could run SQL for your analytics workloads without having to configure a database, just by importing a module in your Python or R process, and make your queries blazing fast? What if, rather than having to choose between Python or SQL, you could use both?

Enter DuckDB

DuckDB is an open source (MIT) high-performance, in-process SQL database for analytics. It is a relatively new project (the first public release was in June 2019), but got tremendously popular in a short period of time.

DuckDB popularity is growing (we like this image so much)

DuckDB can read data from different sources:

  • From CSV or Parquet files
  • From pandas DataFrame or Arrow Table objects in the process memory
  • From PostgreSQL tables (by reading the binary data directly!)

Some of the DuckDB operations have out-of-core capabilities (similar to Vaex or the new streaming mode of Polars), which means that it can read data that is larger than RAM!

Finally, DuckDB offers some additions on top of standard SQL that make it very pleasant to use, for example friendlier error messages or, behold, trailing commas!

Trying out DuckDB in Orchest

For this example, we will use a dataset containing all mentions of climate change on Reddit before September 2022 obtained from Kaggle. Our generic goal is to understand the sentiment of these mentions.

We have published an Orchest pipeline that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.

Click the button below to import the project to your Orchest account:

First steps with DuckDB

You can install DuckDB with conda/mamba or pip:

-- CODE language-text --
mamba install -y "python-duckdb=0.5.1"
# Or, alternatively, with pip
# pip install "duckdb==0.5.1"

The first step to start using DuckDB is creating a connection object. This mimics the Python Database API 2.0, also implemented by other projects like SQLite and psycopg2:

-- CODE language-python --
import duckdb
conn = duckdb.connect()

By default, duckdb.connect() will return a connection to an in-memory database, which will be perfectly fine for reading data from external files. In fact, you can run a SQL query directly on the CSV file straight away!

-- CODE language-python --
In [3]: conn.execute("""
  ...: SELECT COUNT(*)
  ...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv'
  ...: """).fetchall()
Out[3]: [(4600698,)]

As you can see, the comments CSV file contains 4.6+ million rows. This took about 50 seconds on an Orchest instance though, which is not very impressive for just a COUNT(*) operation. What about converting the CSV to Parquet, as we did in our blog post about Arrow? This time, we can use DuckDB for that:

-- CODE language-python --
import os

csv_files = !ls /data/reddit-climate/*.csv

for filename in csv_files:
   print(f"Reading {filename}...")
   destination_file = os.path.splitext(filename)[0] + ".parquet"
   if os.path.isfile(destination_file):
   COPY (SELECT * FROM '{filename}')
   TO '{destination_file}' (FORMAT 'parquet')

And now, let’s repeat the query on the Parquet file:

-- CODE language-python --
In [8]: %%timeit
  ...: conn.execute("""
  ...: SELECT COUNT(*)
  ...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
  ...: """).fetchall()
234 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Approximately a 200x speedup over the same operation using CSV! That is a better baseline for running the rest of the queries.

Querying Parquet files with DuckDB

Since you will be referring to the same file several times, it’s a good moment to create a view for it. This will allow you to query the Parquet file without copying all the data to memory:

-- CODE language-python --
SELECT * FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'

Next, let’s find out which subreddits had the most number of comments about climate change:

-- CODE language-python --
In [11]: conn.query("""
   ...: SELECT
   ...:   "" AS subreddit_name,
   ...:   COUNT(*) AS num_comments,
   ...: FROM comments
   ...: GROUP BY subreddit_name
   ...: ORDER BY num_comments DESC
   ...: LIMIT 10
   ...: """).fetchall()
[('politics', 370018),
('worldnews', 351195),
('askreddit', 259848),
('collapse', 94696),
('news', 94558),
('futurology', 89945),
('science', 71453),
('environment', 70444),
('canada', 66813),
('australia', 60239)]

Unsurprisingly, /r/politics, /r/worldnews, and /r/collapse were among the subreddits with the largest number of comments about climate change.

What about the overall sentiment of those comments?

-- CODE language-python --
In [12]: conn.query("""
   ...: SELECT
   ...:   AVG(sentiment) AS average_sentiment,
   ...:   STDDEV(sentiment) AS stddev_sentiment,
   ...: FROM comments
   ...: """).fetchall()
Out[12]: [(-0.005827451977706203, 0.6581439484369691)]

In [13]: conn.query("""
   ...: SELECT
   ...:   "" AS subreddit_name,
   ...:   COUNT(*) AS num_comments,
   ...:   AVG(sentiment) AS average_sentiment,
   ...:   STDDEV(sentiment) AS stddev_sentiment,
   ...: FROM comments
   ...: WHERE subreddit_name IN (
   ...:   SELECT "" AS subreddit_name
   ...:   FROM comments
   ...:   GROUP BY subreddit_name
   ...:   ORDER BY COUNT(*) DESC
   ...:   LIMIT 10
   ...: )
   ...: GROUP BY subreddit_name
   ...: ORDER BY num_comments DESC
   ...: """).fetchall()
[('politics', 370018, -0.018118589649651674, 0.6600297061408),
('worldnews', 351195, -0.058001587387908435, 0.6405990095462681),
('askreddit', 259848, -0.068637218639235, 0.6089748718101456),
('collapse', 94696, -0.1332661626390419, 0.6667106776062662),
('news', 94558, -0.09367126059175682, 0.6276134461239258),
('futurology', 89945, 0.0018637489115630797, 0.6506820198836241),
('science', 71453, 0.04588216852922973, 0.6248484283076333),
('environment', 70444, -0.015670189810189843, 0.6467846578160414),
('canada', 66813, 0.021118244331091468, 0.6408319443539487),
('australia', 60239, -0.021869519296548085, 0.6405803819103508)]

While the overall sentiment is slightly negative (with a large standard deviation), some subreddits like /r/askreddit and /r/collapse exhibited a sentiment more negative than average. Others like /r/science and /r/canada were slightly positive.

Most interestingly, all these queries ran in about 2 seconds!

DuckDB also has integration with Jupyter through the ipython-sql extension and the DuckDB SQLAlchemy driver, which allows you to query your data using an even more compact syntax:

-- CODE language-python --
In [1]: %load_ext sql
In [2]: %sql duckdb:///:memory:
In [3]: %%sql
  ...: CREATE VIEW comments AS
  ...: SELECT * FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
* duckdb:///:memory:
Out[3]: []

In [4]: %sql SELECT COUNT(*) FROM comments
* duckdb:///:memory:
Out[4]: [(4600698,)]

Interoperability with Python dataframe libraries

Did you notice how we were using conn.execute() all the time? As we said above, this method follows the widely used Python DBAPI 2.0. However, DuckDB can return richer objects by using conn.query() instead:

-- CODE language-python --
rel = conn.query("""
 "" AS subreddit_name,
 COUNT(*) AS num_comments,
FROM comments
GROUP BY subreddit_name
ORDER BY num_comments DESC

This method returns an instance of DuckDBPyRelation, wich can be pretty printed in Jupyter:

-- CODE language-python --
In [5]: type(rel)
Out[5]: duckdb.DuckDBPyRelation

In [6]: rel
--- Relation Tree ---

-- Result Columns  --
- subreddit_name (VARCHAR)
- num_comments (BIGINT)

-- Result Preview  --
subreddit_name  num_comments
[ Rows: 10]
politics        370018
worldnews       351195
askreddit       259848
collapse        94696
news    94558
futurology      89945
science 71453
environment     70444
canada  66813
australia       60239

Moreover, you can efficiently retrieve the data from this relation and convert it to several Python objects:

  • A dictionary of masked NumPy arrays using .fetchnumpy()
  • A pandas DataFrame using .df() or its aliases (.fetchdf(), .fetch_df())
  • An Arrow Table using .arrow() or .fetch_arrow_table()
  • An Arrow record batch reader using .fetch_record_batch(chunk_size)

Therefore, you can easily convert query results to a pandas DataFrame, and also a Polars one (since you can pass an Arrow table directly):

-- CODE language-python --
In [6]: rel.df()  # pandas
  subreddit_name  num_comments
0       politics        370018
1      worldnews        351195
2      askreddit        259848
3       collapse         94696
4           news         94558
5     futurology         89945
6        science         71453
7    environment         70444
8         canada         66813
9      australia         60239

In [7]: import polars as pl

In [8]: data = rel.arrow()  # Arrow data

In [9]: pl.DataFrame(data)  # Polars
Out[9]: shape: (10, 2)
│ subreddit_name ┆ num_comments │
│ ---            ┆ ---          │
│ str            ┆ i64          │
│ politics       ┆ 370018       │
│ worldnews      ┆ 351195       │
│ askreddit      ┆ 259848       │
│ collapse       ┆ 94696        │
│ ...            ┆ ...          │
│ science        ┆ 71453        │
│ environment    ┆ 70444        │
│ canada         ┆ 66813        │
│ australia      ┆ 60239        │

Note: Result objects returned by conn.execute() also have these methods, but they consume the data after they are called and therefore are not so convenient.

Conversely, you can transfer data from pandas or Arrow to DuckDB. Or, more precisely: you can use DuckDB to query pandas or Arrow objects that live in memory! Moreover, DuckDB can read local variables without having to do anything:

-- CODE language-python --
In [13]: df_most_comments = rel.df()

In [14]: df_most_comments.head()  # pandas
  subreddit_name  num_comments
0       politics        370018
1      worldnews        351195
2      askreddit        259848
3       collapse         94696
4           news         94558

In [15]: conn.execute("""
   ...: SELECT subreddit_name
   ...: FROM df_most_comments  -- Sorcery!
   ...: LIMIT 5
   ...: """).fetchall()
Out[15]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]

You can also manually register a compatible object with a given name:

-- CODE language-python --
In [17]: conn.register("most_comments_arrow", data)
Out[17]: <duckdb.DuckDBPyConnection at 0x7f9be41434f0>

In [18]: conn.execute("""
   ...: SELECT subreddit_name
   ...: FROM most_comments_arrow
   ...: LIMIT 5
   ...: """).fetchall()
Out[18]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]

Or, using the %sql magic as before:

-- CODE language-python --
In [21]: %sql output << SELECT subreddit_name FROM df_most_comments LIMIT 5
* duckdb:///:memory:
Returning data to local variable output

In [22]: output.DataFrame()  # pandas
0       politics
1      worldnews
2      askreddit
3       collapse
4           news

In other words: you can transparently go back and forth between DuckDB and your favourite Python dataframe library. Cool!

Other features

Without extending ourselves too much, there are a few extra interesting things about DuckDB you should check out:

-- CODE language-python --
In [27]: rel.filter("num_comments > 100000").order("subreddit_name").df()
  subreddit_name  num_comments
0      askreddit        259848
1       politics        370018
2      worldnews        351195

The documentation is still in progress, but potentially the DuckDB team will expand it in the future!

  • The CLI: DuckDB has a command-line client you can use directly from your terminal, without even launching a Python or Jupyter interpreter:

-- CODE language-text --
$ ./duckdb -c '
>   SELECT "" AS subreddit_name,
>   COUNT(*) AS num_comments
> FROM "/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet"
> GROUP BY subreddit_name
> ORDER BY num_comments DESC
> LIMIT 10
> '
│ subreddit_name │ num_comments │
│ politics       │ 370018       │
│ worldnews      │ 351195       │
│ askreddit      │ 259848       │
│ collapse       │ 94696        │
│ news           │ 94558        │
│ futurology     │ 89945        │
│ science        │ 71453        │
│ environment    │ 70444        │
│ canada         │ 66813        │
│ australia      │ 60239        │

Should you use DuckDB?

Although DuckDB is a wonderful piece of technology, “there is no silver bullet” and there might be cases in which you might want to use something else. The project homepage itself hints at some of these cases:

  • For transactional workloads, you might want to use SQLite, or a more sophisticated transactional database like PostgreSQL. Remember, DuckDB was created for analytics!
  • When several people are reading or writing the same data, using a warehouse might make more sense.

Other than that, as you saw above if you are looking for a lightweight and fast solution for in-process analytics, and you want to leverage both your general-purpose language of choice (Python, R, others) as well as SQL, DuckDB might be exactly what you want.

In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!

Thanks to Alex Monahan and Elliana May for reviewing early drafts of this blog post. All remaining errors are my own.

DuckDB is an excellent tool to use when building data pipelines. Orchest is a pipeline orchestrator with first class support for the full Python open source ecosystem. Import this example project on your Orchest Cloud containing the Jupyter notebooks with the code from this article.

Still no Orchest account? Get started for free!