7 Airflow alternatives for every type of data team
January 9, 2023
Data teams have a love/hate relationship with Apache Airflow. We appreciate Airflow’s development, scheduling, and monitoring of batch-oriented data workflows. However, data team needs and skillsets are changing. This means that a one-size-fits-all solution just doesn’t cut it anymore.
Let’s look at some exciting Airflow alternatives and see how they cater to the needs of different data teams…
Who’s it for?
Orchest’s tagline is “Build data pipelines, the easy way”. It’s designed for small data teams who want an easy-to-learn tool, with a friendly UI, and fast time to value. Orchest Pro offers of a fully-managed service that lets users scale their data pipelines without having to worry about managing any infrastructure.
Fast turnaround times: Orchest lets you move from an idea to scheduled pipeline in minutes by leveraging existing code and automating complicated infrastructure tasks for you.
Friendly browser-based UI: There are no complicated framework to learn, just an intuitive UI to build your DAGs and a powerful JupyterLab editor to code your tasks.
Stable & scalable pipelines: Orchest Pro runs on a fully-managed, industry-grade Kubernetes cluster; providing stability and scalability.
Prefect describes itself as the “air traffic control for the modern data stack”. It’s designed for data engineers who like Airflow’s development paradigm but are fed up with its various shortcomings. It’s primarily open source with some proprietary services.
Code as workflows: Prefect lets you orchestrate with Python code using their Orion engine. It has first class functions, type annotations, and async support.
Prefect Orion UI: Prefect’s UI lets you to set up notifications, view run history, and schedule workflows.
Security-first hybrid execution model: Prefect OSS's orchestration and execution layers can be managed independently. While Prefect Cloud manages orchestration of your workflow with your data remaining in your own environment.
Dagster describes itself as “The single plane of glass for your data team to observe, optimize, and debug even the most complex data workflows.” It’s designed for data teams who prefer to take a heavy software engineering-oriented approach to their data pipelines. It covers the entire end-to-end process of developing and deploying data applications. Be prepared for formal typing/explicit coding and some advanced Python language features.
Productivity platform: Dagster lets you define “Software-defined assets” and use CI/CD to build reusable components, spot data quality issues, and flag bugs early.
Robust orchestration engine: Dagster has flexible architecture for real-time/event-based, fault-tolerant run scheduling.
Unified control plane: Dagster lets you centralize metadata in a tool with built-in observability, diagnostics, cataloging, and lineage.
Databricks Pipelines is designed for data teams who have a broader need than just workflow orchestration and want to use Spark and a Lakehouse data architecture. Be prepared for needing to use Java/Scala.
Batch and streaming: Use a single platform and unified API to ingest, transform and incrementally process batch and streaming data at scale.
Infrastructure automation: Databricks automatically manages your infrastructure and the operational components of your production workflows.
Lakehouse Platform: The Lakehouse Platform provides a foundation to build and share centrally governed data assets.
Connectivity: Connect your preferred data engineering tools for data ingestion, ETL/ELT and orchestration.
Azure data factory describes itself as “a fully managed, serverless data integration service”. It is designed for data teams who are looking for a serverless solution that has a strong integration with Microsoft-specific solutions like Azure Blob Storage or Microsoft SQL Server. Be prepared for a heavy emphasis on no-code pipeline components.
No code platform: Rehost SQL Server Integration Services (SSIS) and build ETL/ELT pipelines with built-in Git and CI/CD with no code.
Fully managed serverless: Pay-as-you-go, fully managed serverless cloud service that scales on demand.
Built-in connectors: Ingest all your on-premises, and software as a service (SaaS) data with more than 90 built-in connectors. Orchestrate and monitor at scale.
Microsoft Azure Platform: Strong integrations with the broader Microsoft Azure platform.
Amazon SageMaker describes itself as the “First purpose-built CI/CD service for machine learning”. It’s designed for users who want ML workflows with easy integration to AWS services. Be prepared to sacrifice on fewer workflow orchestration capabilities compared to other solutions on this list for the advantage of running everything in AWS.
Manage ML workflows: Create ML workflows with a Python SDK, and visualise your workflow with Amazon SageMaker Studio.
ML model registry: Track and choose the best version of your ML model in a central repository.
Automatic model tracking: Log workflow steps and create an audit trail of model components such as training data, platform configurations, model parameters, and learning gradients.
ML CI/CD: Dev/prod environment parity, version control, on-demand testing, and end-to-end automation.
Google Vertex AI Workbench describes itself as “The single development environment for the entire data science workflow.” It’s designed for users who run heavy ML/GPU workflows and want an easy integration with GCP services. Note: this product is still quite new, so expect breaking changes, missing features and the occasional bug. Be prepared for working with custom Docker image-based packaged code.
Fully managed compute: A Jupyter-based fully managed, scalable, enterprise-ready compute infrastructure with security controls and user management capabilities.
Interactive data and ML experience: Explore data and train ML models with connections to Google Cloud's big data solutions.
Portal to complete end-to-end ML training: Develop and deploy AI solutions on Vertex AI with minimal transition.
Google Cloud Platform integration: Natively analyze your data with a reduction in context switching between services on the Google Cloud Platform.
There’s a lot of great data orchestration tools out there. Each one is designed with specific data teams’ needs and skillsets in mind. A big shoutout to Airflow for kickstarting such a variety of amazing data engineering tooling.
If you sound like the sort of person who Orchest is designed for, we’d really appreciate you taking a look! You can either play around with an Orchest Cloud Free instance or try a 1 week free trial of Orchest Cloud Pro.