1. Airflow Redis Operator
  2. Airflow Redis Operator
  3. Airflow Tutorial
  4. Airflow Celery Redis

This guide walks through installing Airflow on a Ubuntu 18.04 server. It assumes you can access a web server with a fresh Ubuntu install as the root user.

  • Scaling Out with Celery¶. CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, ) and change your airflow.cfg to point the executor parameter to CeleryExecutor and provide the related Celery settings.
  • Redis – The redis – broker that forwards messages from scheduler to worker. Initialize the Environment. To initialize the environment, run the following command: docker-compose up airflow-init After initialization is complete, you should see a message similar to that below: Run the Airflow. You can start running the Airflow by running the.
  • Anaconda / packages / airflow-with-redis 1.10.12. 0 Airflow is a platform to programmatically author, schedule and monitor workflows Conda.

Introduction This post uses Redis and celery to scale-out airflow. Redis is a simple caching server and scales out quite well. It can be made resilient by deploying it as a cluster. In my previous post, the airflow scale-out was done using celery with rabbitmq as the message broker. On the whole, I found the idea of maintaining a rabbitmq a bit fiddly unless you happen to be an expert in. One of the dependencies of Apache Airflow by default pulls in a GPL library (‘unidecode’). In case this is a concern you can force a non GPL library by issuing export SLUGIFYUSESTEXTUNIDECODE=yes and then proceed with the normal installation.

This guide also expects that you're generally familiar with the concepts behind Airflow, like executors, DAGs, and Hooks, but it's not necessary to be an Airflow expert. If you've just started using Airflow, it's helpful to complete the project's quickstart tutorial first, before attempting to deploy a production instance of Airflow.

By the time you've finished this guide, you'll have created the following:

  • An Airflow Webserver that's accessible through a web browser. Optionally, we'll show how to enforce HTTPS connections to the Airflow Webserver if you've purchased a custom domain name.
  • An Airflow Scheduler that uses the Celery executor. We'll use a single-node setup, but using the Celery executor will make it easier for you to adapt to a multiple node setup later, and it will also enable you to trigger DAG runs manually through a web browser.
Airflow Redis

Depending on the cloud hosting provider you've chosen for your Ubuntu server, the exact steps for configuring the Webserver will be slightly different, and we can't provide detailed instructions for all the possible vendors. This guide will provide instructions if you're using AWS EC2 instance, but the steps are similar if you're on Digital Ocean, Linode, or another provider.

We recommend using a virtual server with at least 2 CPUs and 4G of RAM, but the instructions below should work on smaller instances, too.

So, let's get started!

Install Airflow dependencies: Postgres, Python3, and Redis

For this tutorial we'll use Python 3 and a Postgres database for Airflow's metadata.

The Postgres metadata database is necessary for Airflow to record the state of each task that it runs. The database can also securely store credentials that allow Airflow to connect to other systems, such as Salesforce, S3, or Redshift.

Redis is necessary to allow the Airflow Celery Executor to orchestrate its jobs across multiple nodes and to communicate with the Airflow Scheduler.

Nginx will be used as a reverse proxy for the Airflow Webserver, and is necessary if you plan to run Airflow on a custom domain, such as airflow.corbettanalytics.com.

Configure Postgres

When you installed Postgres it also installed commands for creating new Postgres databases and users. Create a new database and user for Airflow, and name them both airflow.

The createuser command will prompt you for a password for the airflow user. Later, we'll need to store that password in Airflow's configuration file in order for Airflow to function properly.

Create a non-root user for Airflow

Running Airflow as root is dangerous, since you'll be executing code from anyone at your company. If they make a mistake and damage the server by accidentally deleting any important files, you'll have to re-provision an entirely new instance. To reduce the chance that happens, we'll create a non-root user to run Airflow.

Airflow redis vs rabbitmq

Switch users to the airflow user and cd into their home directory.

Airflow Redis Operator

Install Airflow with pip

We're now ready to install Airflow. Note that the package's name is actually apache-airflow and that we're including the optional Postgres and Celery dependencies by appending [postgres,celery] to the package name. We've also included the package email_validator, which is necessary for authenticating with the Airflow webserver.

On Ubuntu18 pip will install the Airflow executable in /home/ubuntu/.local/bin, but by default that directory is not searched when running commands.

To fix this, add the .local/bin directory name to your $PATH variable.

You should now be able to access the Airflow cli.

Note that most of Airflow CLI's commands will be broken at this point because the Airflow metadata database isn't configured. To fix this, we'll need edit Airflow's configuration file.

Configure Airflow

By default, Airflow will look for its configuration in a file located at $HOME/airflow/airflow.cfg. We need to make several changes to that file in order for Airflow to function correctly.

  1. Metadata database connection: We need to tell Airflow how to access its metadata database, which we do by setting the sql_alchemy_conn value. That will point to the local Postgres installation we previously configured.
  2. Webserver authentication: Airflow supports a few different authentication schemes, but we'll use the default which just requires a username and password.
  3. Celery Executor selection: Airflow uses the SequentialExecutor by default, but we'll change that to use the more useful CeleryExecutor.

To make these changes, update the following values in the airflow.cfg file.

Airflow Redis Operator

Redis

Note that it's important you make these changes in the correct sections - ensure you're replacing the existing values in the configuration, rather than pasting in new lines. If you place the authenticate = True value the [core] section, Airflow won't know to require authentication for the Webserver.

Initialize the Airflow metadata database

Airflow is now ready to initialize its metadata database.

You'll see Airflow log some information about the migrations that it's running on the database. To see the list of tables that Airflow created, use the following command.

At this point the Airflow installation is working correctly, and we're ready to add the first Airflow user.

Create an Airflow user

The Airflow CLI provides a create_user command to create a user that can log into the airflow Webserver.

Airflow

You'll now be able to log in with the username and password just configured. Make sure to modify the email and password in the command above to your own values.

Start the airflow services

For our installation, we'll need to run three Airflow services: the Webserver, the Scheduler, and the Worker.

  • The scheduler is responsible for determining whether a DAG should be executed.
  • The webserver is responsible for creating the DAG visualizations and 'front end' that you can access from the web browser. If your DAGs are not running, it is likely an issue with the scheduler and not the webserver.
  • The worker is responsible for actually executing the jobs in a DAG. After a working finishes running a DAG's job, it will log the status of the job in the Airflow metadata database.

Start the Airflow services now. We'll add the --daemon flag to run the processes as daemons, so they'll continue running even after you log off.

You can test that the webserver is running locally. Use curl to test that the Webserver replies with a valid response: you should get a 301 status code redirecting you to /home or /login.

At this point, you won't be able to access the Airflow Webserver from your web browser. To access the webserver through your browser, you'll need to have Nginx forward requests to the application.

Configure Nginx

Switch back to the Ubuntu root user by typing exit. Run the following command to confirm you're no longer using the non-root airflow user.

Create a new Nginx config file in the /etc/nginx/sites-available directory named airflow.

Paste the following into that file. Update the values for server_name with your own server's domain or IP address.

To enable Nginx to use that site's configuration, link the file into the /etc/nginx/sites-enabled directory.

Additionally, you'll want to update the /etc/nginx/nginx.conf file if you're using a custom domain to host your Airflow server. The follow parameters enables you to use a custom domain, such as airflow.corbettanalytics.com.

Restart the Nginx service for the new configuration to take effect.

Airflow Tutorial

You should now be able to access the Airflow web interface by typing your server's IP address or domain name into your browser's address bar.

Note #1: to access Airflow through the custom domain name, ensure you've already set up an A-record with your DNS provider that points to your server's IP address. If you're using AWS's Route 53, create those records now in the Route 53 console.

Note #2: if you're running the server on AWS, you'll also need to update your server's security group to allow traffic to allow public access to port 80 and port 443. Follow the instructions from AWS's knowledge center to allow access to your server over the public internet.

Optional: enforce HTTPS only

To enforce HTTPS connections, you'll need to have configured a custom domain that correctly points to your server. Ensure that you can load the Airflow Webserver at your custom domain before proceeding. For example, I was able load Airflow at http://airflow.corbettanalytics.com before configuring HTTPS only connections.

Install the Let's Encrypt certbot.

Start the process to automatically configure HTTPS connections to your server.

The Certbot command will prompt you with a series of questions. It should automatically detect that the certificates should be generated for your domain custom domain. For example, Certbot displayed the following details for airflow.corbettanalytics.com.

You should also ensure that you select the option for Certbot to update the Nginx configuration to redirect HTTP requests to HTTPS requests. Certbot will make the necessary changes to the etc/nginx/nginx.conf file and then restart Nginx automatically.

Airflow Celery Redis

If everything has been configured correctly, you should now be able to access your Airflow Webserver over HTTPS.

Coments are closed
Scroll to top