Skip to content

Reference Architecture: Hadoop to Google Cloud Data Lakehouse Migration

This document presents the reference architecture for migrating from a legacy Hadoop environment to a modern Data Lakehouse on Google Cloud, as implemented in this demo.

Architecture Diagram

graph LR
    subgraph source_env ["Source Environment"]
      hive[(Hive)]
      onprem_spark[Spark]
      hdfs[(HDFS)]
    end

    subgraph migration_tools ["Migration Tools"]
        bts[BigQuery Translation Service]
        ma[Migration Assessment]
        md[dwh-migration-dumper]
        sts[Storage Transfer Service]
    end


  subgraph target_env ["Target Environment"]
    bz[(Bronze Zone / BQ + Cloud Storage)]
    bronze_spark["Managed Spark"]
    sz[(Silver Zone / BQ + Cloud Storage)]
    silver_spark["Managed Spark"]
    gz[(Gold Zone / BQ + Cloud Storage)]
    lakehouse[(Lakehouse Runtime Catalog)]
    bigquery[(BigQuery)]
    managed_spark[Managed Spark]

  end

  subgraph serving ["Serving Layer"]
    bi[Analytics & BI]
    ml["Machine Learning & Notebooks"]
    agents[Agents]
    operational[Operational Data]
  end

    %% Flow
    hive --> bts
  hive --> md
  hive --> ma
  hdfs --> ma
  hdfs --> sts
  onprem_spark --> ma

  bz <--> lakehouse
  sz <--> lakehouse
  gz <--> lakehouse

  sts --> bz
  bz --> bronze_spark
  bronze_spark --> sz
  sz --> silver_spark
  silver_spark --> gz

  gz --> managed_spark
  gz --> bigquery

  bigquery --> bi
  bigquery --> operational

  managed_spark --> ml
  managed_spark --> agents

Component Descriptions

Source Environment

  • Hadoop Cluster: Simulated by a standalone cluster of "Managed Service for Apache Spark" (formally Dataproc)
    • HDFS: Distributed file system storing the raw data.
    • Hive Metastore: Manages metadata for Hive tables.

Migration Tools

  • migration-assesment-tool: Tool for assessing usage in hadoop, and predict usage on Google Cloud services.
  • dwh-migration-dumper: Extracts DDL and metadata from the source Hive system for assessment.
  • Storage Transfer Service (STS): Cloud-native service used to transfer data from HDFS to Google Cloud Storage.
  • BigQuery Translation Service: Translates legacy HiveQL queries to modern GoogleSQL for BigQuery.

Target Environment

Divided into the classic 3 tier "Medallion" architecture:

  • Bronze Zone: Used for storing raw data, as is.
  • Silver Zone: Used for storing processed data, cleaned and verified.
  • Gold Zone: Used for storing enriched data, ready for analytics and serving.

Each zone is comprised of:

  • Cloud Storage: Serves as the storage layer, from CSVs and binary files, to Parquet files with Apache Iceberg format.
  • Lakehouse Runtime Iceberg Catalog: Managed service for managing datasets in the Apache Iceberg format.
  • Managed Service for Apache Spark Serverless: Runs Spark jobs to convert process data from one zone to the next.
  • BigQuery: Enables querying Iceberg tables directly in Cloud Storage with BigQuery performance and security.

Serving Layer

The serving layer is responsible for serving data to end users. It is comprised of:

  • BI & Analytics: Tools for data visualization and analysis.
  • Machine Learning processes and Notebook: Tools for data scientists to develop ML models.
  • Agents: Agents deployed that need grounding in enterprise data.
  • Operational Data: Data that is being used by operational, user facing systems.