Keeping Up with Data Engineering — A Resource Guide
Photo by Kevin Ku on Unsplash
written by Sheel Choksi
The data engineering space is continuously evolving — there are new open-source projects and tools released all the time, “best practices” are constantly changing, and data engineers are stretched thinner than ever before as businesses increase their data demands. Not only that, but data engineers have recently seen firsthand the hybridization of the data lake and data warehouse, with modern warehouses blurring the distinction by separating out compute from storage. This hybridization has evolved basic “ETL” pipelines of the past into more complex orchestrations, oftentimes both reading from and writing to warehouses. Additionally, as data scientist and data analyst colleagues are operationalizing their work, data engineers need to work more collaboratively with these functions and further empower these roles. With all of this pressure and constant change, it can be difficult to keep up with the space (or even enter it!).
Below is a roundup of resources — both for new data engineers and seasoned professionals — looking to stay up to date with the latest in the world of data engineering. These follow the software community’s trend of creating GitHub repositories with curated lists of resources focused on a specific area (aka the awesome list). Luckily, there is an “awesome” for Big Data, which includes subsections for Data Engineering and Public Datasets (those always seem to come in handy). The curation seemed a bit tighter before the project exploded in popularity but fret not, even having a giant list of public data sets — in, for example, the energy space — is incredibly valuable.
Data Engineering ‘Awesome’ Lists
- Big Data
- Public Datasets
- Hadoop - Framework for distributed storage and processing of very large data sets.
- Data Engineering
- Apache Spark - Unified engine for large-scale data processing.
- Qlik - Business intelligence platform for data visualization, analytics, and reporting apps.
- Splunk - Platform for searching, monitoring, and analyzing structured and unstructured machine-generated big data in real-time.
Databases ‘Awesome’ Lists
- MongoDB - NoSQL database.
- TinkerPop - Graph computing framework.
- PostgreSQL - Object-relational database.
- CouchDB - Document-oriented NoSQL database.
- HBase - Distributed, scalable, big data store.
- NoSQL Guides - Help on using non-relational, distributed, open-source, and horizontally scalable databases.
- Contexture - Abstracts queries/filters and results/aggregations from different backing data stores like ElasticSearch and MongoDB.
- Database Tools - Everything that makes working with databases easier.
Data Exchange ‘Awesome’ Lists
- JSON - Text based data interchange format.
- CSV - A text file format that stores tabular data and uses a comma to separate values.
Analytics ‘Awesome’ Lists
The data engineering space is moving quickly, whether measured by hiring growth rate, business needs, or toolset evolution. The resources available are fortunately keeping pace. Although it’s easy to get lost in the day-to-day grind of projects, I’ve been able to rekindle enthusiasm and keep up with the space by taking a step back to learn about newer paradigms/technologies. With this list of resources, I hope you all find the jumping-off point to do the same!
View original article
I'm a data engineering manager working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>
comments powered by Disqus