Skill: Zero-Copy Data Exploration & Ingestion¶
1. Description & Rationale¶
This skill enforces the Zero-Copy Architecture of the modern Google Cloud Lakehouse. It eliminates the overhead of copying data to external compute engines or creating intermediate files (CSVs/JSONs) on GCS.
It standardizes:
- Reading from BigQuery directly into Spark DataFrames using the Spark-BigQuery connector.
- Reading from BigQuery using BigQuery DataFrames (
bigframes.pandas) for in-warehouse processing.
2. Environment Prerequisites¶
- Spark: Managed Service for Apache Spark Serverless (Runtime 3.0+ recommended) with the Spark BigQuery Connector pre-bundled.
- BigQuery:
bigframesPython package installed.
3. Agent Execution Guidelines (System Prompts)¶
When generating data loading or exploration code:
- Never export to GCS first: Do not write code that exports BigQuery tables to GCS as CSV/Avro/Parquet before loading them into Spark or Pandas.
- Use BigQuery Storage API: Ensure Spark reads use the
"bigquery"format, which utilizes the high-speed BigQuery Storage Read API. - Use BigFrames for BQ: Ensure BigQuery-centric pipelines use
bigframes.pandas.read_gbqto create references to the tables.
4. Opinionated Code Patterns¶
A. Spark-Centric Zero-Copy Load¶
Use the Spark-BigQuery connector to load tables directly.
# Load BigQuery table directly into Spark DataFrame
users_df = (
spark.read.format("bigquery")
.option("table", "bigquery-public-data.thelook_ecommerce.users")
.load()
)
# Create temporary view for Spark SQL queries
users_df.createOrReplaceTempView("users")
B. BigQuery-Centric Zero-Copy Load¶
Use BigQuery DataFrames to create a virtual DataFrame.
import bigframes.pandas as bpd
# Configure BigQuery DataFrames global options
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = "US" # Or your dataset location
# Load table reference (zero data movement)
users_df = bpd.read_gbq("bigquery-public-data.thelook_ecommerce.users")
5. Verification Checklist¶
- No GCS Export: Verify the code does not contain
bq extractor GCS export steps before loading. - Spark Format: Ensure the Spark read format is explicitly
"bigquery". - BigFrames Config: Ensure
bpd.options.bigquery.projectis set before callingread_gbq.