Stack
The Modern Data Stack: Open-source edition
- Collection & Integration:- Getting data in the data warehouse/lake for further processing some text
 - Event Processing: Collecting events and getting them in the warehouse
- Data Integration: Adding data from various sources into the warehouse and vice-versa (Reverse ETL)- Airbyte
 
 
- Data Lake
- Data Warehousing- Similar to a data lake but an all-in-one solution with all the layers more tightly integrated.
- Clickhouse- Distributed
- PBs of data
 
- Hydra- Single node
- PostgreSQL extension #sql
- Hundreds of GBs
 
 
- Streaming & Realtime- Emerging category for low-latency or operational use cases
- Use cases- Collecting data (e.g., events) and ingesting it into a data warehouse/lake
- Processing data in-flight for enrichment before loading into the long-term storage
- Real-time analytics and machine learning, e.g. fraud detection, operational analysis, etc. – where latency matters.
 
- Kafka
- Transformations & Compute
 
- Data Orchestration- Managing how data pipelines are defined, structured, and orchestrated.
- SQL Centric #sql- dbt
 
- All-rounder- Apache Airflow- Legacy
 
- Dagster- Resulting product focus
 
- Prefect- Modern Airflow
 
 
- Apache Airflow
 
- Data Catalogs- Finding the right data asset for the problem
- Amundsen
- DataHub
 
- BI & Data Apps- The consumption layer of the data stack with subcategories, including Notebooks, Self-serve, Dashboards, Code-first Data Apps, and Product/Web Analytics.
- Self-Serve & Dashboards (Tableu-Like)- Metabase
- Lightdash
- Superset
 
- Notebooks- Jupyter #py
 
- End-to-end Product Analytics- PostHog
- Plausible
 
 
