Skip to content

Data Ingestion Pipeline for RAG ​

The Agent Starter Pack simplifies incorporating data ingestion into your agent projects. This is especially useful for agents requiring document processing and retrieval, such as Retrieval Augmented Generation (RAG) applications.

Overview ​

The data ingestion approach depends on the datastore type:

  • Vertex AI Search: Uses a GCS Data Connector with built-in scheduling. Upload documents to a GCS bucket and the connector automatically syncs them to the search engine.
  • Vertex AI Vector Search 2.0: Uses a Kubeflow pipeline that loads data, chunks documents, and ingests them into a Vector Search 2.0 Collection. Embeddings are auto-generated by the Collection's configured embedding model.

When to Include Data Ingestion ​

Consider data ingestion if:

  • Your agent needs to search or reference extensive documentation.
  • You're developing a RAG-based application.
  • Your agent's knowledge base requires periodic updates.
  • You want to keep your agent's content fresh and searchable.

Usage ​

Project Creation ​

Include data ingestion during project creation in two ways:

  1. Automatic for agentic_rag: The agentic_rag agent automatically includes data ingestion. You will be prompted to select a datastore (vertex_ai_search or vertex_ai_vector_search) if not specified via --datastore.

  2. Via --datastore flag: For any agent, specify the datastore with --datastore (or -ds) to enable data ingestion:

    bash
    # Using Vertex AI Search
    agent-starter-pack create my-agent-project -ds vertex_ai_search
    
    # Using Vertex AI Vector Search
    agent-starter-pack create my-agent-project -ds vertex_ai_vector_search

Infrastructure Setup ​

The Terraform IaC configures the necessary infrastructure based on your chosen datastore:

  • Vertex AI Search: GCS bucket, Data Connector, and Search Engine.
  • Vertex AI Vector Search 2.0: Collection with auto-embedding configuration, GCS bucket for pipeline artifacts.
  • Necessary service accounts and permissions.

Getting Started ​

  1. Create your project with data ingestion, specifying your datastore:

    bash
    # Example with Vertex AI Search
    agent-starter-pack create my-project -ds vertex_ai_search
    
    # Example with Vertex AI Vector Search
    agent-starter-pack create my-project -ds vertex_ai_vector_search
  2. Set up the datastore and load sample data:

    bash
    make setup-datastore

    For Vertex AI Search, this creates the GCS bucket, data connector, and search engine, uploads sample data, and triggers an initial sync.

    For Vector Search 2.0, this creates the Collection, GCS bucket for pipeline artifacts, and configures service account permissions.

  3. Run data ingestion:

    For Vertex AI Search, sync new data after modifying the sample_data/ directory:

    bash
    make sync-data

    For Vector Search 2.0, run the data ingestion pipeline locally:

    bash
    make data-ingestion

Vertex AI Search: Data Format ​

The GCS Data Connector is configured by default to ingest unstructured content (PDF, HTML, TXT) using data_schema: "content". Each file in the GCS bucket becomes a separate document in the data store.

If your data is in a different format, set the data_connector_data_schema Terraform variable in deployment/terraform/dev/vars/env.tfvars:

hcl
data_connector_data_schema = "document"  # for NDJSON/JSONL files
data_schemaFormatDescription
content (default)PDF, HTML, TXTUnstructured files. Each file becomes a document.
documentNDJSON / JSONLOne JSON document per line with a valid id field.
csvCSVCSV with a header row conforming to the data store schema.
customJSONCustom JSON format conforming to the data store schema.

Note: When changing data_schema, you must delete and recreate the data connector (delete the collection via the Cloud Console or API, then re-run make setup-datastore).

For full API details, see the Discovery Engine API reference.

Learn More ​

Released under the Apache 2.0 License.