Back arrow
Go back to all blog posts
The great Python dataframe showdown, part 2: out-of-core processing with Vaex

The great Python dataframe showdown, part 2: out-of-core processing with Vaex

This is the second part of our series "The great Python dataframe showdown", where we set to uncover powerful Python libraries to work with dataframes other than pandas that are less widespread but interesting and useful.

Vaex: a new approach to dataframes

Vaex is an open-source Python library that provides lazy, out-of-core dataframes "to visualize and explore big tabular datasets". Initially, it was a GUI tool created to visualize the massive Gaia star catalog, and later on, it evolved to become a powerful data manipulation library. But what do "lazy" and "out-of-core" mean in this context?

  • "Lazy" means that certain operations are not fully executed immediately: instead, their evaluation is delayed or postponed until either the result is explicitly requested or aggregation is performed.
  • "Out-of-core" refers to a set of techniques that allow the user to manipulate data that is larger than the available RAM by reading chunks from the disk. This happens transparently.

Vaex is not the only lazy, out-of-core dataframe library: other projects like Dask also apply these concepts, although in a very different way. We will cover Dask in a future post!

(If you're unsure about the pronunciation, the maintainers call it /vəks/, hence with a short, neutral vowel sound)

Vaex logo

Trying out Vaex in Orchest

For this example we will use a dataset of Airline Delay and Cancellation Data from 2009 to 2018 obtained from Kaggle, which itself originates from the U.S. Department of Transportation.

We have published an Orchest pipeline that contains all the necessary files so you can run these code snippets yourself easily. You will only need a Kaggle API key to download the data: the first two steps of the pipeline will download it using the Kaggle CLI and then convert it to Apache Arrow (if you want to know more, check out the first part of this series!).

Click the button below to import the project to your Orchest account:

First steps with Vaex

You can install Vaex with conda/mamba or pip:

-- CODE language-text --
mamba install -y "vaex=4.9"
# Or, alternatively
pip install "vaex==4.9"

vaex  is actually a metapackage, so you might want to pick exactly which parts of it you are interested in. For this tutorial, you will only need vaex-core and vaex-viz:

-- CODE language-text --
mamba install -y "vaex-core=4.9" vaex-viz
# Or, alternatively
pip install "vaex-core==4.9" vaex-viz

You have more detailed installation instructions in the official documentation.

Let's first load the last .parquet file of our dataset by doing

-- CODE language-python --
import vaex
df_2018 ="/data/airline-delays/2018.parquet")

The type of the returned object is vaex.dataframe.DataFrameLocal, which is the "Base class for DataFrames that work with local file/data" (docs). You will notice that the open call finishes almost immediately, even though the dataframe has more than 7 million rows:

-- CODE language-python --
In [2]: len(df_2018)
Out[2]: 7213446

The reason is that, with binary file formats, Vaex uses memory-mapping, hence allowing you to manipulate datasets that are larger than RAM (similar to what PyArrow does, as covered in the first part of the series).

Vaex supports several binary file formats (Feather, Parquet, and some domain-specific formats like HDF5 and FITS) as well as text-based formats (CSV, JSON, ASCII). However, the latter cannot be memory-mapped, and therefore you need to be a bit more careful when using them:

  • If the data fits in memory, call the function as normal, taking into account that the whole dataset will be loaded.
  • If the data does not fit in memory, pass convert=True and a chunk_size  parameter to open so that Vaex converts the data to a binary format behind the scenes, using extra disk space in the process.

Now, back to our dataframe: to display more information about it, you can call the .info()  method, which will show:

  • Number of rows
  • Name and type information of the columns
  • First five and last five rows of the dataframe
Beginning of `.info()` result

In addition, .describe()  will work like its pandas counterpart and display a summary of statistical information of every column:

Output of `.describe()` result

As you can see, Vaex, like pandas, also leverage the interactive capabilities of Jupyter to display the information in a visually appealing way.

The power of expressions

Like in pandas, you can index a specific column of a Vaex dataframe using indexing:

-- CODE language-python --
In [10]: df_2018["OP_CARRIER"]  # Or df_2018.col.OP_CARRIER
Out[10]: Expression = OP_CARRIER
Length: 7,213,446 dtype: string (column)
     0  UA
     1  UA
     2  UA
     3  UA
     4  UA
7213441  AA
7213442  AA
7213443  AA
7213444  AA
7213445  AA

But unlike pandas, this returns an Expression: a representation of a computation step that has not been evaluated yet. Expressions are one of the power features of Vaex, since they allow you to chain several operations one after another in a lazy way, hence without actually computing them:

-- CODE language-python --
# Normal subtraction of two columns

# Creation of an equivalent expression by using indexing

# NumPy functions also create expressions
import numpy as np


# Or they can be used inside the indexing itself!

There are two ways to fully evaluate an expression:

  • Requesting an in-memory representation, by retrieving the .values  property (which will return a NumPy array) or calling .to_pandas_series() . Make sure the data fits in RAM!
  • Performing an aggregation (like .mean() , .max() , and so forth), like this:

-- CODE language-python --
In [12]: df_2018["CANCELLED"].sum()
Out[12]: array(116584.)

For more complex expressions and filters, Vaex supports displaying interactive progress bars:

-- CODE language-python --
In [15]: # We exclude "CANCELLED" flights and ones that arrived on time
   ...: # to compute the mean delay time
   ...: delayed = df_2018[(df_2018["CANCELLED"] == 0.0) & (df_2018["DEP_DELAY"] > 0)]
   ...: delayed["DEP_DELAY"].mean(progress="rich")  # Fancy progress bar!
 mean                                          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s
   └──   vaex.agg.mean('DEP_DELAY')                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s
    ├──   vaex.agg.sum('DEP_DELAY')             ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s[1]
    └──   vaex.agg.count('DEP_DELAY')           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s[1]
Out[15]: array(38.18255826)

Now that you know how it's done, let's read all the .parquet  files at once and perform the same computation:

-- CODE language-python --
In [16]: df ="/data/airline-delays/*.parquet")

In [17]: len(df)
Out[17]: 61556964

In [18]: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)]

In [19]: delayed["DEP_DELAY"].mean(progress="rich")
 mean                                          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s
   └──   vaex.agg.mean('DEP_DELAY')                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s
       ├──   vaex.agg.sum('DEP_DELAY')             ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s[1]
    └──   vaex.agg.count('DEP_DELAY')           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s[1]
Out[19]: array(32.15845342)

Vaex crunches more than 60 million rows in less than 10 seconds 🔥

Finally, Vaex has some built-in visualization capabilities that intelligently wrap its aggregation and binning functions to offer a higher level interface:

Fast histogram of 60+ million rows

Some differences with pandas

One of the interesting differences between Vaex and pandas is that the concept of the index does not exist. This means that, when converting data from pandas using vaex.from_pandas, you will need to decide whether to include the index as a normal column (passing copy_index=True) or discard it entirely (copy_index=False, the default behavior).

On the other hand, Vaex does not have an equivalent of the .loc accessor, so to filter by rows and columns you will have to chain several indexing operations. This is not a problem though, because Vaex does not copy the data.

A note on performance

If you compare Vaex and pandas running times on an interactive setting (for example, on JupyterLab), you might observe that sometimes Vaex is slightly slower. However, if you perform proper microbenchmarking, you will notice that Vaex is actually faster than pandas, even for data that fits in RAM:

-- CODE language-python --
In [20]: %%timeit -n1 -r1
   ...: df ="/data/airline-delays/2018.parquet")
   ...: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)]
   ...: mean_delayed = delayed["DEP_DELAY"].mean()
1.31 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [21]: %%timeit -n1 -r1
   ...: df_pandas = pd.read_parquet("/data/airline-delays/2018.parquet")
   ...: delayed_pandas = df_pandas.loc[(df_pandas["CANCELLED"] == 0.0) & (df_pandas["DEP_DELAY"] > 0)]
   ...: mean_delayed_pandas = delayed_pandas["DEP_DELAY"].mean()
   ...:3.81 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

And this is without taking into account that reading all the data with pandas in a RAM-constrained environment is simply impossible:

-- CODE language-python --
# Equivalent of
# df ="/data/airline-delays/*.parquet")

# Jupyter shortcut to read all files
fnames = !ls /data/airline-delays/*.parquet

# Loads everything in memory,
# and therefore it might blow up!
df = pd.concat([
   pd.read_parquet(fname) for fname in fnames

Moral of the story: benchmarking is difficult!

Should you use Vaex?

In addition to what we saw in this short introduction, Vaex has powerful visualization capabilities, as well as some features that are not found in other libraries that can prove useful if you are working with scientific data, like propagation of uncertainties, just-in-time compilation of math-heavy expressions, and much more.

If you already have your data in a supported binary format on disk, or if it's in a text-based format and you have enough space to convert it, Vaex is an excellent solution to process it in an efficient way. As you can see, Vaex mimics the pandas API, but it is not based on it and it deviates in some small places. On the other hand, it is a younger library with a smaller user base, and some parts of the documentation might not be as complete - fortunately, the maintainers compensate by quickly responding to issues and feedback.

In summary:

  • Use Vaex if you have large amounts of data on disk that don't fit into memory, if you want to leverage fast visualization capabilities, if you have a scientific use case, or if you are not worried about learning an API slightly different from pandas.
  • Don't use Vaex if you are looking for solutions to quickly migrate a large pandas codebase, or if you are on a storage-constrained environment reading lots of data from the network or in CSV format.

In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!

Vaex is an excellent tool to use when building data pipelines. Orchest is a pipeline orchestrator with first class support for the full Python open source ecosystem. Import this example project on your Orchest Cloud containing the Jupyter notebooks with the code from this article.

Still no Orchest account? Get started for free!