This post is the first part of our series “SQL on Python”, in which we will explore different Python libraries that help you manipulate and query your data using SQL or a SQL-inspired syntax.
SQL (the initials for Structured Query Language, also known as ISO/IEC 9075-1:2016) was originally designed in the 70s for managing relational databases, but nowadays, it is being used for analytics workloads as well.
SQL has lots of benefits for analytics, to name a few:
However, if you are used to the Python or R ecosystems (pandas, Polars, data.table, dplyr), you are probably spoiled by how easy it is to download a CSV or Parquet file from somewhere, launch a Python or R process, read it, and start querying and manipulating it.
Comparatively, this bootstrapping process is a bit more tedious with SQL: assuming you have, say, a local PostgreSQL database up and running and a CSV file, you would need to create a table with the appropriate schema, import the data using COPY, and hope that there are no inconsistencies, missing data, or weird date formats. If the file happened to be Parquet, you would need to work a bit more.
To try to make the process a bit more lightweight, you could try to convert your CSV or Parquet to SQLite, a widely available, in-process SQL database. However, SQLite was designed with transactional use cases in mind, and therefore might not scale well with some analytical workloads.
In summary: SQL is appealing, but the boilerplate not so much. What if you could run SQL for your analytics workloads without having to configure a database, just by importing a module in your Python or R process, and make your queries blazing fast? What if, rather than having to choose between Python or SQL, you could use both?
DuckDB is an open source (MIT) high-performance, in-process SQL database for analytics. It is a relatively new project (the first public release was in June 2019), but got tremendously popular in a short period of time.
DuckDB can read data from different sources:
Some of the DuckDB operations have out-of-core capabilities (similar to Vaex or the new streaming mode of Polars), which means that it can read data that is larger than RAM!
Finally, DuckDB offers some additions on top of standard SQL that make it very pleasant to use, for example friendlier error messages or, behold, trailing commas!
For this example, we will use a dataset containing all mentions of climate change on Reddit before September 2022 obtained from Kaggle. Our generic goal is to understand the sentiment of these mentions.
We have published an Orchest pipeline that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.
Click the button below to import the project to your Orchest account:
You can install DuckDB with conda/mamba or pip:
-- CODE language-text --
mamba install -y "python-duckdb=0.5.1"
# Or, alternatively, with pip
# pip install "duckdb==0.5.1"
The first step to start using DuckDB is creating a connection object. This mimics the Python Database API 2.0, also implemented by other projects like SQLite and psycopg2:
-- CODE language-python --
import duckdb
conn = duckdb.connect()
By default, duckdb.connect() will return a connection to an in-memory database, which will be perfectly fine for reading data from external files. In fact, you can run a SQL query directly on the CSV file straight away!
-- CODE language-python --
In [3]: conn.execute("""
...: SELECT COUNT(*)
...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv'
...: """).fetchall()
Out[3]: [(4600698,)]
As you can see, the comments CSV file contains 4.6+ million rows. This took about 50 seconds on an Orchest instance though, which is not very impressive for just a COUNT(*) operation. What about converting the CSV to Parquet, as we did in our blog post about Arrow? This time, we can use DuckDB for that:
-- CODE language-python --
import os
csv_files = !ls /data/reddit-climate/*.csv
for filename in csv_files:
print(f"Reading {filename}...")
destination_file = os.path.splitext(filename)[0] + ".parquet"
if os.path.isfile(destination_file):
continue
conn.execute(f"""
COPY (SELECT * FROM '{filename}')
TO '{destination_file}' (FORMAT 'parquet')
""")
And now, let’s repeat the query on the Parquet file:
-- CODE language-python --
In [8]: %%timeit
...: conn.execute("""
...: SELECT COUNT(*)
...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
...: """).fetchall()
234 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Approximately a 200x speedup over the same operation using CSV! That is a better baseline for running the rest of the queries.
Since you will be referring to the same file several times, it’s a good moment to create a view for it. This will allow you to query the Parquet file without copying all the data to memory:
-- CODE language-python --
conn.execute("""
CREATE VIEW comments AS
SELECT * FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
""")
Next, let’s find out which subreddits had the most number of comments about climate change:
-- CODE language-python --
In [11]: conn.query("""
...: SELECT
...: "subreddit.name" AS subreddit_name,
...: COUNT(*) AS num_comments,
...: FROM comments
...: GROUP BY subreddit_name
...: ORDER BY num_comments DESC
...: LIMIT 10
...: """).fetchall()
Out[11]:
[('politics', 370018),
('worldnews', 351195),
('askreddit', 259848),
('collapse', 94696),
('news', 94558),
('futurology', 89945),
('science', 71453),
('environment', 70444),
('canada', 66813),
('australia', 60239)]
Unsurprisingly, /r/politics, /r/worldnews, and /r/collapse were among the subreddits with the largest number of comments about climate change.
What about the overall sentiment of those comments?
-- CODE language-python --
In [12]: conn.query("""
...: SELECT
...: AVG(sentiment) AS average_sentiment,
...: STDDEV(sentiment) AS stddev_sentiment,
...: FROM comments
...: """).fetchall()
Out[12]: [(-0.005827451977706203, 0.6581439484369691)]
In [13]: conn.query("""
...: SELECT
...: "subreddit.name" AS subreddit_name,
...: COUNT(*) AS num_comments,
...: AVG(sentiment) AS average_sentiment,
...: STDDEV(sentiment) AS stddev_sentiment,
...: FROM comments
...: WHERE subreddit_name IN (
...: SELECT "subreddit.name" AS subreddit_name
...: FROM comments
...: GROUP BY subreddit_name
...: ORDER BY COUNT(*) DESC
...: LIMIT 10
...: )
...: GROUP BY subreddit_name
...: ORDER BY num_comments DESC
...: """).fetchall()
Out[13]:
[('politics', 370018, -0.018118589649651674, 0.6600297061408),
('worldnews', 351195, -0.058001587387908435, 0.6405990095462681),
('askreddit', 259848, -0.068637218639235, 0.6089748718101456),
('collapse', 94696, -0.1332661626390419, 0.6667106776062662),
('news', 94558, -0.09367126059175682, 0.6276134461239258),
('futurology', 89945, 0.0018637489115630797, 0.6506820198836241),
('science', 71453, 0.04588216852922973, 0.6248484283076333),
('environment', 70444, -0.015670189810189843, 0.6467846578160414),
('canada', 66813, 0.021118244331091468, 0.6408319443539487),
('australia', 60239, -0.021869519296548085, 0.6405803819103508)]
While the overall sentiment is slightly negative (with a large standard deviation), some subreddits like /r/askreddit and /r/collapse exhibited a sentiment more negative than average. Others like /r/science and /r/canada were slightly positive.
Most interestingly, all these queries ran in about 2 seconds!
DuckDB also has integration with Jupyter through the ipython-sql extension and the DuckDB SQLAlchemy driver, which allows you to query your data using an even more compact syntax:
-- CODE language-python --
In [1]: %load_ext sql
In [2]: %sql duckdb:///:memory:
In [3]: %%sql
...: CREATE VIEW comments AS
...: SELECT * FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
...:
...:
* duckdb:///:memory:
Done.
Out[3]: []
In [4]: %sql SELECT COUNT(*) FROM comments
* duckdb:///:memory:
Done.
Out[4]: [(4600698,)]
Did you notice how we were using conn.execute() all the time? As we said above, this method follows the widely used Python DBAPI 2.0. However, DuckDB can return richer objects by using conn.query() instead:
-- CODE language-python --
rel = conn.query("""
SELECT
"subreddit.name" AS subreddit_name,
COUNT(*) AS num_comments,
FROM comments
GROUP BY subreddit_name
ORDER BY num_comments DESC
LIMIT 10
""")
This method returns an instance of DuckDBPyRelation, wich can be pretty printed in Jupyter:
-- CODE language-python --
In [5]: type(rel)
Out[5]: duckdb.DuckDBPyRelation
In [6]: rel
Out[6]:
---------------------
--- Relation Tree ---
---------------------
Subquery
---------------------
-- Result Columns --
---------------------
- subreddit_name (VARCHAR)
- num_comments (BIGINT)
---------------------
-- Result Preview --
---------------------
subreddit_name num_comments
VARCHAR BIGINT
[ Rows: 10]
politics 370018
worldnews 351195
askreddit 259848
collapse 94696
news 94558
futurology 89945
science 71453
environment 70444
canada 66813
australia 60239
Moreover, you can efficiently retrieve the data from this relation and convert it to several Python objects:
Therefore, you can easily convert query results to a pandas DataFrame, and also a Polars one (since you can pass an Arrow table directly):
-- CODE language-python --
In [6]: rel.df() # pandas
Out[6]:
subreddit_name num_comments
0 politics 370018
1 worldnews 351195
2 askreddit 259848
3 collapse 94696
4 news 94558
5 futurology 89945
6 science 71453
7 environment 70444
8 canada 66813
9 australia 60239
In [7]: import polars as pl
In [8]: data = rel.arrow() # Arrow data
In [9]: pl.DataFrame(data) # Polars
Out[9]: shape: (10, 2)
┌────────────────┬──────────────┐
│ subreddit_name ┆ num_comments │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪══════════════╡
│ politics ┆ 370018 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ worldnews ┆ 351195 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ askreddit ┆ 259848 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ collapse ┆ 94696 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ science ┆ 71453 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ environment ┆ 70444 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ canada ┆ 66813 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ australia ┆ 60239 │
└────────────────┴──────────────┘
Note: Result objects returned by conn.execute() also have these methods, but they consume the data after they are called and therefore are not so convenient.
Conversely, you can transfer data from pandas or Arrow to DuckDB. Or, more precisely: you can use DuckDB to query pandas or Arrow objects that live in memory! Moreover, DuckDB can read local variables without having to do anything:
-- CODE language-python --
In [13]: df_most_comments = rel.df()
In [14]: df_most_comments.head() # pandas
Out[14]:
subreddit_name num_comments
0 politics 370018
1 worldnews 351195
2 askreddit 259848
3 collapse 94696
4 news 94558
In [15]: conn.execute("""
...: SELECT subreddit_name
...: FROM df_most_comments -- Sorcery!
...: LIMIT 5
...: """).fetchall()
Out[15]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]
You can also manually register a compatible object with a given name:
-- CODE language-python --
In [17]: conn.register("most_comments_arrow", data)
Out[17]: <duckdb.DuckDBPyConnection at 0x7f9be41434f0>
In [18]: conn.execute("""
...: SELECT subreddit_name
...: FROM most_comments_arrow
...: LIMIT 5
...: """).fetchall()
Out[18]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]
Or, using the %sql magic as before:
-- CODE language-python --
In [21]: %sql output << SELECT subreddit_name FROM df_most_comments LIMIT 5
* duckdb:///:memory:
Done.
Returning data to local variable output
In [22]: output.DataFrame() # pandas
Out[22]:
subreddit_name
0 politics
1 worldnews
2 askreddit
3 collapse
4 news
In other words: you can transparently go back and forth between DuckDB and your favourite Python dataframe library. Cool!
Without extending ourselves too much, there are a few extra interesting things about DuckDB you should check out:
-- CODE language-python --
In [27]: rel.filter("num_comments > 100000").order("subreddit_name").df()
Out[27]:
subreddit_name num_comments
0 askreddit 259848
1 politics 370018
2 worldnews 351195
The documentation is still in progress, but potentially the DuckDB team will expand it in the future!
-- CODE language-text --
$ ./duckdb -c '
> SELECT "subreddit.name" AS subreddit_name,
> COUNT(*) AS num_comments
> FROM "/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet"
> GROUP BY subreddit_name
> ORDER BY num_comments DESC
> LIMIT 10
> '
┌────────────────┬──────────────┐
│ subreddit_name │ num_comments │
├────────────────┼──────────────┤
│ politics │ 370018 │
│ worldnews │ 351195 │
│ askreddit │ 259848 │
│ collapse │ 94696 │
│ news │ 94558 │
│ futurology │ 89945 │
│ science │ 71453 │
│ environment │ 70444 │
│ canada │ 66813 │
│ australia │ 60239 │
└────────────────┴──────────────┘
Although DuckDB is a wonderful piece of technology, “there is no silver bullet” and there might be cases in which you might want to use something else. The project homepage itself hints at some of these cases:
Other than that, as you saw above if you are looking for a lightweight and fast solution for in-process analytics, and you want to leverage both your general-purpose language of choice (Python, R, others) as well as SQL, DuckDB might be exactly what you want.
In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!
Thanks to Alex Monahan and Elliana May for reviewing early drafts of this blog post. All remaining errors are my own.
DuckDB is an excellent tool to use when building data pipelines. Orchest is a pipeline orchestrator with first class support for the full Python open source ecosystem. Import this example project on your Orchest Cloud containing the Jupyter notebooks with the code from this article.
Still no Orchest account? Get started for free!