Data Ingestion Pipeline for RAG ​
The Agent Starter Pack simplifies incorporating data ingestion into your agent projects. This is especially useful for agents requiring document processing and retrieval, such as Retrieval Augmented Generation (RAG) applications.
Overview ​
The data ingestion approach depends on the datastore type:
- Vertex AI Search: Uses a GCS Data Connector with built-in scheduling. Upload documents to a GCS bucket and the connector automatically syncs them to the search engine.
- Vertex AI Vector Search 2.0: Uses a Kubeflow pipeline that loads data, chunks documents, and ingests them into a Vector Search 2.0 Collection. Embeddings are auto-generated by the Collection's configured embedding model.
When to Include Data Ingestion ​
Consider data ingestion if:
- Your agent needs to search or reference extensive documentation.
- You're developing a RAG-based application.
- Your agent's knowledge base requires periodic updates.
- You want to keep your agent's content fresh and searchable.
Usage ​
Project Creation ​
Include data ingestion during project creation in two ways:
Automatic for agentic_rag: The
agentic_ragagent automatically includes data ingestion. You will be prompted to select a datastore (vertex_ai_searchorvertex_ai_vector_search) if not specified via--datastore.Via --datastore flag: For any agent, specify the datastore with
--datastore(or-ds) to enable data ingestion:bash# Using Vertex AI Search agent-starter-pack create my-agent-project -ds vertex_ai_search # Using Vertex AI Vector Search agent-starter-pack create my-agent-project -ds vertex_ai_vector_search
Infrastructure Setup ​
The Terraform IaC configures the necessary infrastructure based on your chosen datastore:
- Vertex AI Search: GCS bucket, Data Connector, and Search Engine.
- Vertex AI Vector Search 2.0: Collection with auto-embedding configuration, GCS bucket for pipeline artifacts.
- Necessary service accounts and permissions.
Getting Started ​
Create your project with data ingestion, specifying your datastore:
bash# Example with Vertex AI Search agent-starter-pack create my-project -ds vertex_ai_search # Example with Vertex AI Vector Search agent-starter-pack create my-project -ds vertex_ai_vector_searchSet up the datastore and load sample data:
bashmake setup-datastoreFor Vertex AI Search, this creates the GCS bucket, data connector, and search engine, uploads sample data, and triggers an initial sync.
For Vector Search 2.0, this creates the Collection, GCS bucket for pipeline artifacts, and configures service account permissions.
Run data ingestion:
For Vertex AI Search, sync new data after modifying the
sample_data/directory:bashmake sync-dataFor Vector Search 2.0, run the data ingestion pipeline locally:
bashmake data-ingestion
Vertex AI Search: Data Format ​
The GCS Data Connector is configured by default to ingest unstructured content (PDF, HTML, TXT) using data_schema: "content". Each file in the GCS bucket becomes a separate document in the data store.
If your data is in a different format, set the data_connector_data_schema Terraform variable in deployment/terraform/dev/vars/env.tfvars:
data_connector_data_schema = "document" # for NDJSON/JSONL filesdata_schema | Format | Description |
|---|---|---|
content (default) | PDF, HTML, TXT | Unstructured files. Each file becomes a document. |
document | NDJSON / JSONL | One JSON document per line with a valid id field. |
csv | CSV | CSV with a header row conforming to the data store schema. |
custom | JSON | Custom JSON format conforming to the data store schema. |
Note: When changing
data_schema, you must delete and recreate the data connector (delete the collection via the Cloud Console or API, then re-runmake setup-datastore).
For full API details, see the Discovery Engine API reference.
Learn More ​
- Vertex AI Search documentation for search capabilities.
- Vertex AI Vector Search documentation for vector database capabilities.
- Discovery Engine API reference for the underlying GCS Data Connector API.