Data Ingestion Pipeline for RAG

The Agent Starter Pack simplifies incorporating data ingestion into your agent projects. This is especially useful for agents requiring document processing and retrieval, such as Retrieval Augmented Generation (RAG) applications.

Overview

The data ingestion approach depends on the datastore type:

Vertex AI Search: Uses a GCS Data Connector with built-in scheduling. Upload documents to a GCS bucket and the connector automatically syncs them to the search engine.
Vertex AI Vector Search 2.0: Uses a Kubeflow pipeline that loads data, chunks documents, and ingests them into a Vector Search 2.0 Collection. Embeddings are auto-generated by the Collection's configured embedding model.

When to Include Data Ingestion

Consider data ingestion if:

Your agent needs to search or reference extensive documentation.
You're developing a RAG-based application.
Your agent's knowledge base requires periodic updates.
You want to keep your agent's content fresh and searchable.

Usage

Project Creation

Include data ingestion during project creation in two ways:

Automatic for agentic_rag: The agentic_rag agent automatically includes data ingestion. You will be prompted to select a datastore (vertex_ai_search or vertex_ai_vector_search) if not specified via --datastore.

Via --datastore flag: For any agent, specify the datastore with --datastore (or -ds) to enable data ingestion:

bash

# Using Vertex AI Search
agent-starter-pack create my-agent-project -ds vertex_ai_search

# Using Vertex AI Vector Search
agent-starter-pack create my-agent-project -ds vertex_ai_vector_search

Infrastructure Setup

The Terraform IaC configures the necessary infrastructure based on your chosen datastore:

Vertex AI Search: GCS bucket, Data Connector, and Search Engine.
Vertex AI Vector Search 2.0: Collection with auto-embedding configuration, GCS bucket for pipeline artifacts.
Necessary service accounts and permissions.

Getting Started

Create your project with data ingestion, specifying your datastore:

bash

# Example with Vertex AI Search
agent-starter-pack create my-project -ds vertex_ai_search

# Example with Vertex AI Vector Search
agent-starter-pack create my-project -ds vertex_ai_vector_search

Set up the datastore and load sample data:
bash
```
make setup-datastore
```
For Vertex AI Search, this creates the GCS bucket, data connector, and search engine, uploads sample data, and triggers an initial sync.
For Vector Search 2.0, this creates the Collection, GCS bucket for pipeline artifacts, and configures service account permissions.
Run data ingestion:
For Vertex AI Search, sync new data after modifying the sample_data/ directory:
bash
```
make sync-data
```
For Vector Search 2.0, run the data ingestion pipeline locally:
bash
```
make data-ingestion
```

Vertex AI Search: Data Format

The GCS Data Connector is configured by default to ingest unstructured content (PDF, HTML, TXT) using data_schema: "content". Each file in the GCS bucket becomes a separate document in the data store.

If your data is in a different format, set the data_connector_data_schema Terraform variable in deployment/terraform/dev/vars/env.tfvars:

hcl

data_connector_data_schema = "document"  # for NDJSON/JSONL files

`data_schema`	Format	Description
`content` (default)	PDF, HTML, TXT	Unstructured files. Each file becomes a document.
`document`	NDJSON / JSONL	One JSON document per line with a valid `id` field.
`csv`	CSV	CSV with a header row conforming to the data store schema.
`custom`	JSON	Custom JSON format conforming to the data store schema.

Note: When changing data_schema, you must delete and recreate the data connector (delete the collection via the Cloud Console or API, then re-run make setup-datastore).

For full API details, see the Discovery Engine API reference.

Learn More

Vertex AI Search documentation for search capabilities.
Vertex AI Vector Search documentation for vector database capabilities.
Discovery Engine API reference for the underlying GCS Data Connector API.

Data Ingestion Pipeline for RAG ​

Overview ​

When to Include Data Ingestion ​

Usage ​

Project Creation ​

Infrastructure Setup ​

Getting Started ​

Vertex AI Search: Data Format ​

Learn More ​