A Data Lakes Guide — Modern Batch Data Warehousing
written by Daniel Mateus Pires
Redefining the batch data extraction patterns and the data lake using “Functional Data Engineering”
The last decades have been greatly transformative for the Data Analytics landscape. With the decrease in storage costs and the adoption of cloud computing, the restrictions that guided the design of data tools were made obsolete and so — the data engineering tools and techniques had to evolve.
The author of this article, Daniel Mateus Pires, suspects that many data teams are taking on complex projects to modernize their data engineering stack and use the new technologies at their disposal. Many others are designing new data ecosystems from scratch in companies pursuing new business opportunities made possible by advances in Machine Learning and Cloud computing.
A pre-Hadoop batch data infrastructure was typically made of a Data Warehouse (DW) appliance tightly coupled with its storage (e.g. Oracle or Teradata DW), an Extract Transform Load (ETL) tool (e.g. SSIS or Informatica) and a Business Intelligence (BI) tool (e.g. Looker or MicroStrategy). The philosophy and design principles of the Data organization, in this case were driven by well-established methodologies as outlined in books such as Ralph Kimball’s The Data Warehouse Toolkit (1996) or Bill Inmon’s Building The Data Warehouse (1992).
Daniel contrasts this approach to its modern version that was born of Cloud technology innovations and reduced storage costs. In a modern stack, the roles that were handled by the Data Warehouse appliance are now handled by specialized components like, file formats (e.g. Parquet, Avro, Hudi), cheap cloud storage (e.g. AWS S3, GS), metadata engines (e.g. Hive metastore), query/compute engines (e.g. Hive, Presto, Impala, Spark) and optionally, cloud-based DWs (e.g. Snowflake, Redshift). Drag and drop ETL tools are less common, instead, a scheduler/orchestrator (e.g. Airflow, Luigi) and “ad-hoc” software logic take on this role. The “ad-hoc” ETL software is sometimes found in separate applications and sometimes within the scheduler framework which is extensible by design (Operator in Airflow, Task in Luigi). It often relies on external compute systems such as Spark clusters or DWs for heavy Transformation. The BI side also saw the rise of an open source alternative called Superset sometimes complemented by Druid to create roll-ups, Online analytical processing (OLAP) cubes and provide a fast read-only storage and query engine.
Daniel found himself working on a migration from a pre-Hadoop stack to a modern stack. It is not only a technology shift but also a major paradigm shift. In some cases, there are modeling and architectural decisions where we should distance ourselves from outdated wisdom and best practices, and it is important to understand why we are doing it. He found Maxime Beauchemin’s resources to be extremely helpful, and studying/understanding Apache Airflow’s design choices brings a lot of practical understanding in implementing the approach he advocates for (Airflow was created by Maxime). This guide aims to take an opinionated approach to defining and designing a Data Lake.
Daniel picks specific technologies to make this guide more practical, and hopes most of it is applicable to other tools in a modern stack. The motivations to pick a technology over another are generally a result of his experience (or lack thereof). For example, he mentions AWS tools because that is the cloud provider he has experience with.
This blog post defines the E of ETL, and describes the role of the Data Lake.
I'm a senior data engineer working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>
comments powered by Disqus