Data Lake

Data Lake

  • Store large volumes of data in raw form
  • Data processed and used as basis for analytics
  • Store all types of data
    • Structured
      • Database tables
      • Excel sheets
    • Semi-structured
      • XML files
      • Webpages
    • Unstructured
      • Images
      • Audio files
  • Typically stored in staged zones
    • Raw
    • Cleansed
    • Curated
  • Use cases
    • Big data analytics
    • Machine learning
    • Predictive analytics
  • Workloads
    • Big data processing
    • SQL queries #sql
    • Text mining
    • Streaming analytics
    • Machine learning
  • Eliminates silos

Vs

lakewarehouselakehouse
typestructured, semi-structured, unstructuredstructuredstructured, semi-structured, unstructured
relational, non-relationalrelationalrelational, non-relational
schemaschema on readschema on writeschema on read/write
formatraw, unfilteredprocessed, vettedraw, unfiltered, processed, curated, delta format files
sourcesbig data, iot, social media, streaming dataapplication, business, transactional data, batch reportingbig data, iot, social media, streaming data, application, business, transactional data, batch reporting
scalabilityeasy to scale at low costdifficult and expensive to scaleeasy to scale at low cost
usersdata scientists/engineersdata warehouse professionals, business analystsbusiness analysts, data engineers, data scientists
use casesmachine learning predictive analytics, real-time analyticscore reporting, bicore reporting, bi, machine learning, predictive analytics

Data Lakehouse

  • Layers on top of the data lake
    • Delta lake storage layer
    • Handles ACID transactions for data reliability, streaming integrations
      • Data versioning and schema enforcement

Architectures

  • Resource management and orchestration
    • Consistently execute tasks
  • Connectors for easy access
    • Workflows to allow users to access and share data in the right form
  • Reliable analytics
    • Fast, scalable and distributed
    • Diverse range of workload categories across multiple languages
  • Data classification
    • Profiling, cataloging and archiving
    • Track
      • Content, quality, location and history
  • Extract, load, transform processes
    • Extracted from multiple sources
    • Loaded into raw zone
    • Cleaned and transformed after extraction
  • Security & support
    • Masking
    • Auditing
    • Encryption
    • Access monitoring
  • Governance & stewardship
    • Users educated on architecture

Components

  1. Storage Layer
    • operates the physical storage
  2. File Layer
    • determines in what format the data is stored in the storage layer
  3. Metadata Layer
    • defines structure (e.g. what files correspond to what tables and simplifies data ops operations and governance)
  4. Compute Layer
    • performs data processing and querying