This is the second part of our series "The great Python dataframe showdown", where we set to uncover powerful Python libraries to work with dataframes other than pandas that are less widespread but interesting and useful.
Vaex is an open-source Python library that provides lazy, out-of-core dataframes "to visualize and explore big tabular datasets". Initially, it was a GUI tool created to visualize the massive Gaia star catalog, and later on, it evolved to become a powerful data manipulation library. But what do "lazy" and "out-of-core" mean in this context?
Vaex is not the only lazy, out-of-core dataframe library: other projects like Dask also apply these concepts, although in a very different way. We will cover Dask in a future post!
(If you're unsure about the pronunciation, the maintainers call it /vəks/, hence with a short, neutral vowel sound)
For this example we will use a dataset of Airline Delay and Cancellation Data from 2009 to 2018 obtained from Kaggle, which itself originates from the U.S. Department of Transportation.
We have published an Orchest pipeline that contains all the necessary files so you can run these code snippets yourself easily. You will only need a Kaggle API key to download the data: the first two steps of the pipeline will download it using the Kaggle CLI and then convert it to Apache Arrow (if you want to know more, check out the first part of this series!).
Click the button below to import the project to your Orchest account:
You can install Vaex with conda/mamba or pip:
-- CODE language-text --
mamba install -y "vaex=4.9"
# Or, alternatively
pip install "vaex==4.9"
vaex is actually a metapackage, so you might want to pick exactly which parts of it you are interested in. For this tutorial, you will only need vaex-core and vaex-viz:
-- CODE language-text --
mamba install -y "vaex-core=4.9" vaex-viz
# Or, alternatively
pip install "vaex-core==4.9" vaex-viz
You have more detailed installation instructions in the official documentation.
Let's first load the last .parquet file of our dataset by doing
-- CODE language-python --
import vaex
df_2018 = vaex.open("/data/airline-delays/2018.parquet")
The type of the returned object is vaex.dataframe.DataFrameLocal, which is the "Base class for DataFrames that work with local file/data" (docs). You will notice that the open call finishes almost immediately, even though the dataframe has more than 7 million rows:
-- CODE language-python --
In [2]: len(df_2018)
Out[2]: 7213446
The reason is that, with binary file formats, Vaex uses memory-mapping, hence allowing you to manipulate datasets that are larger than RAM (similar to what PyArrow does, as covered in the first part of the series).
Vaex supports several binary file formats (Feather, Parquet, and some domain-specific formats like HDF5 and FITS) as well as text-based formats (CSV, JSON, ASCII). However, the latter cannot be memory-mapped, and therefore you need to be a bit more careful when using them:
Now, back to our dataframe: to display more information about it, you can call the .info() method, which will show:
In addition, .describe() will work like its pandas counterpart and display a summary of statistical information of every column:
As you can see, Vaex, like pandas, also leverage the interactive capabilities of Jupyter to display the information in a visually appealing way.
Like in pandas, you can index a specific column of a Vaex dataframe using indexing:
-- CODE language-python --
In [10]: df_2018["OP_CARRIER"] # Or df_2018.col.OP_CARRIER
Out[10]: Expression = OP_CARRIER
Length: 7,213,446 dtype: string (column)
----------------------------------------
0 UA
1 UA
2 UA
3 UA
4 UA
...
7213441 AA
7213442 AA
7213443 AA
7213444 AA
7213445 AA
But unlike pandas, this returns an Expression: a representation of a computation step that has not been evaluated yet. Expressions are one of the power features of Vaex, since they allow you to chain several operations one after another in a lazy way, hence without actually computing them:
-- CODE language-python --
# Normal subtraction of two columns
df_2018["ACTUAL_ELAPSED_TIME"] - df_2018["CRS_ELAPSED_TIME"]
# Creation of an equivalent expression by using indexing
df_2018["ACTUAL_ELAPSED_TIME - CRS_ELAPSED_TIME"]
# NumPy functions also create expressions
import numpy as np
np.sqrt(df_2018["DISTANCE"])
# Or they can be used inside the indexing itself!
df_2018["sqrt(DISTANCE)"]
There are two ways to fully evaluate an expression:
-- CODE language-python --
In [12]: df_2018["CANCELLED"].sum()
Out[12]: array(116584.)
For more complex expressions and filters, Vaex supports displaying interactive progress bars:
-- CODE language-python --
In [15]: # We exclude "CANCELLED" flights and ones that arrived on time
...: # to compute the mean delay time
...: delayed = df_2018[(df_2018["CANCELLED"] == 0.0) & (df_2018["DEP_DELAY"] > 0)]
...: delayed["DEP_DELAY"].mean(progress="rich") # Fancy progress bar!
mean ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s
└── vaex.agg.mean('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s
├── vaex.agg.sum('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s[1]
└── vaex.agg.count('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s[1]
Out[15]: array(38.18255826)
Now that you know how it's done, let's read all the .parquet files at once and perform the same computation:
-- CODE language-python --
In [16]: df = vaex.open("/data/airline-delays/*.parquet")
In [17]: len(df)
Out[17]: 61556964
In [18]: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)]
In [19]: delayed["DEP_DELAY"].mean(progress="rich")
mean ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s
└── vaex.agg.mean('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s
├── vaex.agg.sum('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s[1]
└── vaex.agg.count('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s[1]
Out[19]: array(32.15845342)
Vaex crunches more than 60 million rows in less than 10 seconds 🔥
Finally, Vaex has some built-in visualization capabilities that intelligently wrap its aggregation and binning functions to offer a higher level interface:
One of the interesting differences between Vaex and pandas is that the concept of the index does not exist. This means that, when converting data from pandas using vaex.from_pandas, you will need to decide whether to include the index as a normal column (passing copy_index=True) or discard it entirely (copy_index=False, the default behavior).
On the other hand, Vaex does not have an equivalent of the .loc accessor, so to filter by rows and columns you will have to chain several indexing operations. This is not a problem though, because Vaex does not copy the data.
If you compare Vaex and pandas running times on an interactive setting (for example, on JupyterLab), you might observe that sometimes Vaex is slightly slower. However, if you perform proper microbenchmarking, you will notice that Vaex is actually faster than pandas, even for data that fits in RAM:
-- CODE language-python --
In [20]: %%timeit -n1 -r1
...: df = vaex.open("/data/airline-delays/2018.parquet")
...: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)]
...: mean_delayed = delayed["DEP_DELAY"].mean()
...:
1.31 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [21]: %%timeit -n1 -r1
...: df_pandas = pd.read_parquet("/data/airline-delays/2018.parquet")
...: delayed_pandas = df_pandas.loc[(df_pandas["CANCELLED"] == 0.0) & (df_pandas["DEP_DELAY"] > 0)]
...: mean_delayed_pandas = delayed_pandas["DEP_DELAY"].mean()
...:3.81 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
And this is without taking into account that reading all the data with pandas in a RAM-constrained environment is simply impossible:
-- CODE language-python --
# Equivalent of
# df = vaex.open("/data/airline-delays/*.parquet")
# Jupyter shortcut to read all files
fnames = !ls /data/airline-delays/*.parquet
# Loads everything in memory,
# and therefore it might blow up!
df = pd.concat([
pd.read_parquet(fname) for fname in fnames
])
Moral of the story: benchmarking is difficult!
In addition to what we saw in this short introduction, Vaex has powerful visualization capabilities, as well as some features that are not found in other libraries that can prove useful if you are working with scientific data, like propagation of uncertainties, just-in-time compilation of math-heavy expressions, and much more.
If you already have your data in a supported binary format on disk, or if it's in a text-based format and you have enough space to convert it, Vaex is an excellent solution to process it in an efficient way. As you can see, Vaex mimics the pandas API, but it is not based on it and it deviates in some small places. On the other hand, it is a younger library with a smaller user base, and some parts of the documentation might not be as complete - fortunately, the maintainers compensate by quickly responding to issues and feedback.
In summary:
In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!
Vaex is an excellent tool to use when building data pipelines. Orchest is a pipeline orchestrator with first class support for the full Python open source ecosystem. Import this example project on your Orchest Cloud containing the Jupyter notebooks with the code from this article.
Still no Orchest account? Get started for free!