Skill: Zero-Copy Pipeline Generator (Coordinator)¶

This is the coordinator skill that orchestrates the end-to-end creation of Spark-centric and BigQuery-centric data pipelines for a new use case. It manages the user interview, schema investigation, architecture recommendation, and delegates code generation to specialized sub-skills.

Workflow¶

1. User Interview¶

Prompt the user to understand the new use case. Ask:

Business Goal: What is the prediction target? (e.g., churn, recommendation).
Datasets: What are the input BigQuery tables?
Features & Label: What are the features and the target label?
Data Prep Logic: How should the data be prepared/joined?
Model Complexity: Do they need a simple model (like Logistic Regression) or a complex one (like Random Forest or XGBoost)?

2. Schema Investigation¶

Use the bq CLI tool to inspect the schemas of the tables provided by the user.

For each table, run:

bq show --format=prettyjson <project_id>:<dataset_id>.<table_id>

Analyze the schema to confirm column names, types, and nullability. Present a summary to the user.

3. Solution Recommendation¶

Based on the interview and schema, propose the architecture:

Pipeline Options:
- Spark-Centric: Recommend if the user has existing Spark workloads, needs custom PySpark logic, or prefers Spark MLlib.
- BigQuery-Centric: Recommend if the user wants a NoOps, SQL-backed pipeline using BigQuery DataFrames.
Serving Options:
- Lean JSON: Recommend if using Spark + Simple Model (Logistic Regression).
- ONNX: Recommend if using Spark + Complex Model (Random Forest/XGBoost).
- Native BQML: Recommend if using BigQuery-Centric pipeline.

Get the user's approval on the recommendation before proceeding.

4. Orchestrating Sub-Skills for Code Generation¶

Once approved, generate the notebooks by activating the relevant sub-skills:

Spark-Centric Pipeline Generation¶

If generating the Spark-centric notebook (spark_centric_<use_case>.ipynb), apply the following sub-skills:

spark-lightning-optimization: Use to generate the Dataproc Serverless session with Lightning Engine.
zero-copy-ingestion: Use to generate the BigQuery data loading code using the connector.
model-evaluation: Use to generate the evaluation section (Accuracy or AUC-PR/Confusion Matrix).
unified-model-registry: Use to generate the model saving and serving code (Lean JSON or ONNX export, Vertex AI registration).

BigQuery-Centric Pipeline Generation¶

If generating the BigQuery-centric notebook (bq_centric_<use_case>.ipynb), apply the following sub-skills:

zero-copy-ingestion: Use to generate the BigFrames data loading code.
bigframes-bqml: Use to generate the feature engineering and BQML training code.
model-evaluation: Use to generate the evaluation section.
unified-model-registry: Use to generate the BQML export and Vertex AI Endpoint deployment code.

Batch Prediction Script¶

Generate predict_job_<use_case>.py (PySpark) if the Spark pipeline is chosen, ensuring it loads the model from GCS and writes predictions back to BigQuery.