Run Apache Airflow on Windows 10
Apache Airflow is a great tool to manage and schedule all steps of a data pipeline. However, running it on Windows 10 can be challenging. Airflow’s official Quick Start suggests a smooth start, but solely for Linux users. What about us Windows 10 people if we want to avoid Docker ? These steps worked for me and hopefully will work for you, too.
Photo by Geran de Klerk on Unsplash
After struggling with incorrect configuration, I eventually found a way to install and launch my first Airflow instance. With high spirits I applied it to a data pipeline with Spark EMR clusters . I am happy to share my insights and list the steps that worked for me. If this also works for you - the better!
TLDR;
How to install and run Airflow locally with Windows subsystem for Linux (WSL) with these steps:
-
Open Microsoft Store, search for
Ubuntu
, install it then restart -
Open cmd and type
wsl
-
Update everything:
sudo apt update && sudo apt upgrade
-
Install pip3 like this
sudo apt-get install software-properties-common sudo apt-add-repository universe sudo apt-get update sudo apt-get install python3-pip
-
Install Airflow:
pip3 install apache-airflow
-
Run
sudo nano /etc/wsl.conf
, insert the block below, save and exit withctrl+s
ctrl+x
[automount] root = / options = "metadata"
-
Run
nano ~/.bashrc
, insert the block below, save and exit withctrl+s
ctrl+x
export AIRFLOW_HOME=/c/users/YOURNAME/airflowhome
- Restart terminal, activate
wsl
, runairflow info
- Everything is fine if you see something like
Apache Airflow [1.10.12]
- If you get errors due to missing packages, install them with
pip3 install [package-name]
- Try
airflow info
again - If it does not work by now, try to follow instructions by the error message. You might want to revert to Docker .
- Everything is fine if you see something like
Airflow on Windows WSL
I managed to make it work with a Windows subsystem for Linux (WSL) which was recommended on blogs or Stack Overflow . However, even these resources lead into dead ends.
After a lot of try and error I want to help you with an approach that worked for me. Try to follow these steps. If you get stuck, try to resolve the error by installing missing dependencies, restart terminal or carefully check the instructions.
- Open Microsoft Store, search for
Ubuntu
, install it then restart
Run the following commands run in terminal:
-
everything up to date with
sudo apt update && sudo apt upgrade
-
install
pip3
by runningsudo apt-get install software-properties-common sudo apt-add-repository universe sudo apt-get update sudo apt-get install python3-pip
-
Install Airflow:
pip3 install apache-airflow
-
type
sudo nano /etc/wsl.conf
-
To access directories like
/c/users/philipp
instead of/mnt/c/users/philipp
insert the code block, save and exit withctrl+s
ctrl+x
[automount] root = / options = "metadata"
-
Type
nano ~/.bashrc
-
Define the environment variable
AIRFLOW_HOME
by adding the code below, then save and exit withctrl+s
,ctrl+x
export AIRFLOW_HOME=/c/Users/philipp/AirflowHome
-
Close terminal, open cmd again, type
wsl
-
Install missing packages with
pip3 install [package-name]
-
Restart terminal, activate
wsl
, runairflow info
- Everything is fine if you see something like
Apache Airflow [1.10.12]
- If you get errors due to missing packages, install them with
pip3 install [package-name]
- Try
airflow info
again - If it does not work by now, try to follow instructions by the error message. You might want to revert to Docker .
- Everything is fine if you see something like
Photo by Zhipeng Ya on Unsplash
Other ways to install Airflow
Docker offers a controlled environment (container) to run applications. Since Airflow solely runs on Linux it is a great candidate to use a Docker container. However, Docker is sometimes hard to debug, clunky and could add another layer of confusion. If you want to run Airflow with Docker see this tutorial .
How to run an Airflow instance
Now it is time to have a look at Airflow! Is AIRFLOW_HOME
where you expect it to be? Open two cmd windows, activate wsl
and run:
# check whether AIRFLOW_HOME was set correctly
env | grep AIRFLOW_HOME
# initialize database in AIRFLOW_HOME
airflow initdb
# initialize scheduler
airflow scheduler
# use the second cmd window to run
airflow webserver
# access the UI on localhost:8080 in your browser
Unfortunately, WSL does not support background tasks (daemon
). This is why we have to open one terminal for airflow webserver
and one for airflow scheduler
.
Setup Airflow in a project setting
Copying your DAGs back and forth from a project folder to Airflow home directory is cumbersome. Fortunately, we can automate this with a bash script. For example, my project root directory is in /c/users/philipp/projects/project_name/
and contains one folder with all scripts related to data collection and processing named ./src/data/
. I also have one folder for all Airflow-related files in ./src/airflow/
. In this folder
Have a look at my project Run Spark EMR clusters with Airflow
on Github
to see the project structure. You find the script deploy.sh
in ./src/airflow
.
I am thankful for Cookiecutter data science for inspiration about the project structure.