Skip to content

Build and Deploy Dataflow Flex Template for Avro to Cloud Spanner with Slowly Changing Dimensions (SCD)

The following are instructions to build and deploy this Dataflow Flex Template.

This article provides instructions for creating a Dataflow Flex template for the pipeline, followed by deploying a Dataflow pipeline to insert Avro records into a Cloud Spanner database using one of the supported SCD Types.

Cloud costs

This solution uses billable components of Google Cloud, including the following:

Consider cleaning up the resources when they are no longer needed.

Getting started

Steps to set up your project

  1. Open the Cloud Console for your project.

  2. Activate Cloud Shell. At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Enable the required Google Cloud services.

    gcloud services enable \
      artifactregistry.googleapis.com \
      cloudbuild.googleapis.com \
      cloudresourcemanager.googleapis.com \
      compute.googleapis.com \
      dataflow.googleapis.com \
      servicenetworking.googleapis.com \
      spanner.googleapis.com \
      storage.googleapis.com
    

This tutorial assumes that you already have:

Please set them up before proceeding ahead. Consider using the files under src/test/resources/AvroToSpannerScdPipelineITTest if you do not have a specific schema or Avro files with which to test this pipeline.

Steps to build the template

This is a Dataflow Flex template, which means that the pipeline code will be containerized and the container will be used to launch the Dataflow pipeline.

  1. Access this project folder.

    git clone https://github.com/GoogleCloudPlatform/cloud-solutions.git
    cd cloud-solutions/projects/dataflow-gcs-avro-to-spanner/
    
  2. Configure the environment variables by editing the file dataflow_template_variables.sh using the text editor of your choice.

  3. Set the environment variables.

    source dataflow_template_variables.sh
    gcloud config set project ${PROJECT_ID}
    
  4. Create the Artifact Registry repository where the template image will be uploaded.

    gcloud artifacts repositories create $REPOSITORY_NAME \
      --repository-format=docker \
      --location=$REGION
    
  5. Build the Dataflow Flex template.

    gcloud builds submit \
      --substitutions=_DATAFLOW_TEMPLATE_GCS_PATH="${DATAFLOW_TEMPLATE_GCS_PATH}",_DATAFLOW_TEMPLATE_IMAGE="${DATAFLOW_TEMPLATE_IMAGE}" \
      .
    

Steps to launch and run the template

Once the template is deployed, it can be launched from the Dataflow UI which will allow you to see the configuration parameters and their descriptions.

Alternatively, the template can be run from the same Cloud Shell. The parameters below will depend on your configuration and pipeline requirements.

  1. If you have not yet done it, configure the environment variables by editing the file dataflow_template_variables.sh using the text editor of your choice.

  2. Set the environment variables.

    source dataflow_template_variables.sh
    
  3. Launch the Dataflow Flex Template.

    gcloud dataflow flex-template run "${JOB_NAME}" \
      --project "${PROJECT_ID}" \
      --region "${REGION}" \
      --template-file-gcs-location "${DATAFLOW_TEMPLATE_GCS_PATH}" \
      --parameters "inputFilePattern=${INPUT_FILE_PATTERN}" \
      --parameters "spannerProjectId=${SPANNER_PROJECT_ID}" \
      --parameters "instanceId=${SPANNER_INSTANCE_ID}" \
      --parameters "databaseId=${SPANNER_DATABASE_ID}" \
      --parameters "spannerPriority=${SPANNER_PRIORITY}" \
      --parameters "spannerBatchSize=${SPANNER_BATCH_SIZE}" \
      --parameters "tableName=${SPANNER_TABLE_NAME}" \
      --parameters "scdType=${SCD_TYPE}" \
      --parameters "primaryKeyColumnNames=${SPANNER_PRIMARY_KEY_COLUMN_NAMES}" \
      --parameters "startDateColumnName=${SPANNER_START_DATE_COLUMN_NAME}" \
      --parameters "endDateColumnName=${SPANNER_END_DATE_COLUMN_NAME}" \
      --parameters "orderByColumnName=${SPANNER_ORDER_BY_COLUMN_NAME}" \
      --parameters "sortOrder=${SPANNER_SORT_ORDER}"
    

    Remove any optional pipeline parameters that you do not require. Check the metadata.json or dataflow_template_variable.sh for more details on these configuration variables.