We had the Spring hackathon in the past two days at SurveyMonkey, my team designed and demoed a general data pipeline with Airflow to help testing and delivering some features quickly, and we are the runner up for the Best Platform Innovation prize. Glad this project got some recognitions.
What is Airflow
Apache Airflow is a workflow management tool, you can use it to easily create, schedule and monior your workflow, a common and popular use case is use it to host data pipeline.
This is a project originated from Airbnb then transfered to Apache and now still in incubating.
DAGs
Airflow uses DAGs(Directed Acyclic Graph) to organize workflow, basically a collection of tasks designed in certain way(sequential, branching, parallel), and those form a task graph. This means that you can design even very complex pipeline with DAGs.
Operators
Every task(node in DAG) is an operator, airflow lib provides some basic operators that you can use, and you can also extend the base operators to create customized operators to fit your own requirements. Besides operators, Airflow also provides hooks to manage external connections such as database connection, message bus connection.
Plugins
Airflow plugins are plugins, they are reusable modules. There are some plugins already there, but since the project is still young, there are still a lot room to provide more powerful plugins.
How we want to use it
We start by designing a general data pipeline which has data collector, data processor and actions. our idea is that people in the company can has a easy way to access the data and play with the data and take actions fast.
Data Collector is a plugin we created that can connect to database, you just need to provide the table name or event table schema, you will be able to get all the data and pass on to next task.
Data Processor is the place you can mangling the data, experiment your ideas, get insights. it’s a simple python script, you just throw it into the data processor, it will executed by airflow scheduler.
Actions, there are many actions we can take, we make email and slack notification a plugin, based on your data analysis, you may want to take some actions. you just plug it to your data pipeline.
With this simple pipeline, we can tackle a lot use cases that has business values. One is that we use it to track the survey response rate for users, we can find low response rate users and recommending them tools and tips to improve the response rate.
We also designed another pipeline for automation tests which can integrate with Selenium, we can have a overview of web page navigation(we can call it user journey in other way) of our website from real user perspective, it can also help us find the slowness part on the site.
The goods and bads of micro services
The reason we want to utilize this data pipeline is that in a micro services architecture, we have a lot of pain doing some cross service works, for a very simple task, we need to touch a lot of services and even create a new services to do it, many releases, coordinations, operations work come in along the way.
Our micro services are design based on separation of responsibilities, each micro services provides a set of REST APIs, why we not just coding serverlessly, the API is already there and that is all what you need.
Best practices in other companies
More and more big companies are using Airflow, some key take aways I got:
- Have a good for running and monitoring airflow itself. it’s easy but it can be complicate, airflow itself is a service that you need to care about SLA.
- Don’t put too complex pipelines on it, it’s just a project still in incubating.
Some good articles
https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee
https://www.zillow.com/data-science/airflow-at-zillow/