Stack
The Modern Data Stack: Open-source edition
- Collection & Integration:
- Getting data in the data warehouse/lake for further processing some text
- Event Processing: Collecting events and getting them in the warehouse
- Data Integration: Adding data from various sources into the warehouse and vice-versa (Reverse ETL)
- Airbyte
- Data Lake
- Data Warehousing
- Similar to a data lake but an all-in-one solution with all the layers more tightly integrated.
- Clickhouse
- Distributed
- PBs of data
- Hydra
- Single node
- PostgreSQL extension #sql
- Hundreds of GBs
- Streaming & Realtime
- Emerging category for low-latency or operational use cases
- Use cases
- Collecting data (e.g., events) and ingesting it into a data warehouse/lake
- Processing data in-flight for enrichment before loading into the long-term storage
- Real-time analytics and machine learning, e.g. fraud detection, operational analysis, etc. – where latency matters.
- Kafka
- Transformations & Compute
- Data Orchestration
- Managing how data pipelines are defined, structured, and orchestrated.
- SQL Centric #sql
- dbt
- All-rounder
- Apache Airflow
- Legacy
- Dagster
- Resulting product focus
- Prefect
- Modern Airflow
- Apache Airflow
- Data Catalogs
- Finding the right data asset for the problem
- Amundsen
- DataHub
- BI & Data Apps
- The consumption layer of the data stack with subcategories, including Notebooks, Self-serve, Dashboards, Code-first Data Apps, and Product/Web Analytics.
- Self-Serve & Dashboards (Tableu-Like)
- Metabase
- Lightdash
- Superset
- Notebooks
- Jupyter #py
- End-to-end Product Analytics
- PostHog
- Plausible