Data Lake
- Store large volumes of data in raw form
- Data processed and used as basis for analytics
- Store all types of data
- Structured
- Database tables
- Excel sheets
- Semi-structured
- XML files
- Webpages
- Unstructured
- Images
- Audio files
- Structured
- Typically stored in staged zones
- Raw
- Cleansed
- Curated
- Use cases
- Big data analytics
- Machine learning
- Predictive analytics
- Workloads
- Big data processing
- SQL queries #sql
- Text mining
- Streaming analytics
- Machine learning
- Eliminates silos
Vs
lake | warehouse | lakehouse | |
---|---|---|---|
type | structured, semi-structured, unstructured | structured | structured, semi-structured, unstructured |
relational, non-relational | relational | relational, non-relational | |
schema | schema on read | schema on write | schema on read/write |
format | raw, unfiltered | processed, vetted | raw, unfiltered, processed, curated, delta format files |
sources | big data, iot, social media, streaming data | application, business, transactional data, batch reporting | big data, iot, social media, streaming data, application, business, transactional data, batch reporting |
scalability | easy to scale at low cost | difficult and expensive to scale | easy to scale at low cost |
users | data scientists/engineers | data warehouse professionals, business analysts | business analysts, data engineers, data scientists |
use cases | machine learning predictive analytics, real-time analytics | core reporting, bi | core reporting, bi, machine learning, predictive analytics |
Data Lakehouse
- Layers on top of the data lake
- Delta lake storage layer
- Handles ACID transactions for data reliability, streaming integrations
- Data versioning and schema enforcement
Architectures
- Resource management and orchestration
- Consistently execute tasks
- Connectors for easy access
- Workflows to allow users to access and share data in the right form
- Reliable analytics
- Fast, scalable and distributed
- Diverse range of workload categories across multiple languages
- Data classification
- Profiling, cataloging and archiving
- Track
- Content, quality, location and history
- Extract, load, transform processes
- Extracted from multiple sources
- Loaded into raw zone
- Cleaned and transformed after extraction
- Security & support
- Masking
- Auditing
- Encryption
- Access monitoring
- Governance & stewardship
- Users educated on architecture
Components
- Storage Layer
- operates the physical storage
- File Layer
- determines in what format the data is stored in the storage layer
- Metadata Layer
- defines structure (e.g. what files correspond to what tables and simplifies data ops operations and governance)
- Compute Layer
- performs data processing and querying