Data Engineering Technologies 2021
written by Vijay Ram
- Getdbt, It’s hitting the sweet spot of Apache Spark by bringing a simplified SQL-based pipeline!
- Prefect, Designed to make workflow management easier & better compared to Apache Airflow
- DVC, Open-source version control System for machine learning projects & desired for MLOps
- Great_Expectations, Data science testing framework, it’s already amazing!
- Amundsen, An open-sourced data discovery and metadata engine
- Marquez, Open source metadata with an amazing UI
- Dagster, A data orchestrator for machine learning, very programming-based & in the similar space of Airflow/Prefect but emphasizes state flow
- Apache Calcite, Framework for building SQL databases and data management systems without owning data. Hive, Flink, and others use Calcite.
- maiot-ZenML, Open-sourced MLOps framework, having a bit of everything.
- Apache Superset , Open source BI with many connectors available
- Metabase, An Open source BI with amazing visualizations
- Hopswork, Open-sourced MLOPs feature store
- Feast, open-source feature store, now with Tecton
- MLFlow, Machine learning platform, first of its kind
- Pachyderm, MLOps platform, in the space of MLFlow
- Montecarlodata, Data governance, data discovery, or data observability
- Tecton, Enterprise feature store
- Fiddler, Enterprise explainable AI
- Cnvrg, Enterprise MLOps
- RAPIDS, Data science on GPU
- DASK, Data science purely on Python
- Trino, aka PrestoSQL. With clear separation from Presto, now Trino can focus heavily on features.
- Apache Pinot, real-time distributed OLAP datastore. Its growth is amazing & its in a similar space to Druid, but not exactly!
- Databricks, with new SQL analytics and lakehouse paper, expecting more amazing OSS
- Delta Lake, ACID on Apache Spark
- Koalas, Pandas on Apache Spark
- Apache Beam, Simplify stream processing, is gaining lots of attention, and it’s slowly moving away from the ONLY GCP but more generic.
- Apache Arrow, Essential because of non-JVM, in-memory columnar format & vectorized
- Ray, Distributed machine learning and now streaming
- Anodot, Monitors all your data in real-time for lightning-fast detection of the incidents
- Data Robot, Solid ML platform with a strong focus in enterprise MLOps
- Dataiku, Enterprise AI/MLOps platform
- Fivetran, Data integration pipeline
- DataFrame Whale, Extremely simple data discovery tool
- Nextflow, Data-driven computational pipelines designed for BioInformatics but can go beyond
- Confluent, Apache Kafka & accompanying ecosystem
- Papermill, Parameterizing a notebook, makes data science more interesting and easier.
- Algorithmia, Enterprise MLOps
- Abacus AI, Enterprise AI with AutoML, similar space of Data Robot
*Source: Medium
About Me
I'm a data leader working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>
comments powered by Disqus