Our data pipelines are built on top of Amazon’s AWS ecosystem, and we ingest our data directly into an S3-backed data lake. To efficiently process all of our datasets, we leverage the distributed computation provided by PySpark, the Python API to Apache Spark. Spark is an open-source, unified analytics engine that is optimized for large scale data processing, and we take advantage of its in-memory computation, parallelization, and fault tolerance capabilities.
We use CloudFormation to dynamically and repeatably provision scalable clusters on AWS, where containerized Spark is then deployed and our data processing and machine learning jobs run.
Airflow has been a critical part of our visibility into the state of our data pipelines and is particularly well-suited to handle challenges of complex dependencies, resilience, and monitoring that arise with the extraction and transformation of large amounts of raw data from disparate sources. Throughout the system, our workloads are designed to be fault tolerant and handle resource availability and inter-task dependencies.