Reference Architecture: Hadoop to Google Cloud Data Lakehouse Migration¶
This document presents the reference architecture for migrating from a legacy Hadoop environment to a modern Data Lakehouse on Google Cloud, as implemented in this demo.
Architecture Diagram¶
graph LR
subgraph source_env ["Source Environment"]
hive[(Hive)]
onprem_spark[Spark]
hdfs[(HDFS)]
end
subgraph migration_tools ["Migration Tools"]
bts[BigQuery Translation Service]
ma[Migration Assessment]
md[dwh-migration-dumper]
sts[Storage Transfer Service]
end
subgraph target_env ["Target Environment"]
bz[(Bronze Zone / BQ + Cloud Storage)]
bronze_spark["Managed Spark"]
sz[(Silver Zone / BQ + Cloud Storage)]
silver_spark["Managed Spark"]
gz[(Gold Zone / BQ + Cloud Storage)]
lakehouse[(Lakehouse Runtime Catalog)]
bigquery[(BigQuery)]
managed_spark[Managed Spark]
end
subgraph serving ["Serving Layer"]
bi[Analytics & BI]
ml["Machine Learning & Notebooks"]
agents[Agents]
operational[Operational Data]
end
%% Flow
hive --> bts
hive --> md
hive --> ma
hdfs --> ma
hdfs --> sts
onprem_spark --> ma
bz <--> lakehouse
sz <--> lakehouse
gz <--> lakehouse
sts --> bz
bz --> bronze_spark
bronze_spark --> sz
sz --> silver_spark
silver_spark --> gz
gz --> managed_spark
gz --> bigquery
bigquery --> bi
bigquery --> operational
managed_spark --> ml
managed_spark --> agents
Component Descriptions¶
Source Environment¶
- Hadoop Cluster: Simulated by a standalone cluster of "Managed Service for
Apache Spark" (formally Dataproc)
- HDFS: Distributed file system storing the raw data.
- Hive Metastore: Manages metadata for Hive tables.
Migration Tools¶
- migration-assesment-tool: Tool for assessing usage in hadoop, and predict usage on Google Cloud services.
- dwh-migration-dumper: Extracts DDL and metadata from the source Hive system for assessment.
- Storage Transfer Service (STS): Cloud-native service used to transfer data from HDFS to Google Cloud Storage.
- BigQuery Translation Service: Translates legacy HiveQL queries to modern GoogleSQL for BigQuery.
Target Environment¶
Divided into the classic 3 tier "Medallion" architecture:
- Bronze Zone: Used for storing raw data, as is.
- Silver Zone: Used for storing processed data, cleaned and verified.
- Gold Zone: Used for storing enriched data, ready for analytics and serving.
Each zone is comprised of:
- Cloud Storage: Serves as the storage layer, from CSVs and binary files, to Parquet files with Apache Iceberg format.
- Lakehouse Runtime Iceberg Catalog: Managed service for managing datasets in the Apache Iceberg format.
- Managed Service for Apache Spark Serverless: Runs Spark jobs to convert process data from one zone to the next.
- BigQuery: Enables querying Iceberg tables directly in Cloud Storage with BigQuery performance and security.
Serving Layer¶
The serving layer is responsible for serving data to end users. It is comprised of:
- BI & Analytics: Tools for data visualization and analysis.
- Machine Learning processes and Notebook: Tools for data scientists to develop ML models.
- Agents: Agents deployed that need grounding in enterprise data.
- Operational Data: Data that is being used by operational, user facing systems.