Beginner’s Guide to Data Engineering - A 3-Part Series
written by Robert Chang
This is an excellent 3-part series of articles written from the vantage point of a data scientist looking to learn more about the field of data engineering.
Organization of This Guide
The scope of the discussion is not be exhaustive and is designed heavily around Airflow, batch data processing, and SQL-like languages. That said, this focus should not prevent readers from getting a basic understanding of data engineering and may pique interest to learn more about this fast-growing, emerging field.
Part I is designed to be a high-level introductory post. Using a combination of personal anecdotes and expert insights, the author contextualizes what data engineering is, why it is challenging, and how it can help you or your organization to scale. The primary audience of this post is for aspiring data scientists who need to learn the basics to evaluate job opportunities or early-stage founders who are about to build the company’s first data team.
Part II is more technical in nature. This post focuses on Airflow, Airbnb’s open-sourced tool for programmatically author, schedule, and monitor workflows. Specifically, Robert demonstrates how to write a Hive batch job in Airflow, how to design table schemas using techniques such as star schema, and finally highlights some best practices around building ETLs. This post is suitable for starting data scientists and starting data engineers who are trying to hone their data engineering skills.
Part III is final post of the series, where Robert describes advanced data engineering patterns, higher level abstractions, and extended frameworks that would make building ETLs a lot easier and more efficient. A lot of these patterns have been taught by Airbnb’s experienced data engineers who learned the hard way. These insights may be particularly useful for seasoned data scientists and seasoned data engineers who are trying to further optimize their workflows.
I'm a data engineer working to advance data-driven cultures by integrating disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>
comments powered by Disqus