Tune Dataflow (Optional)

This is relevant especially if you want to run Dataflow inside a VPC.

In case of minimal downtime migration, the dataflow jobs launched by the Spanner Migration Tool can be optionally tuned with custom runtime environment variables such as MaxWorkers, NumWorkers, specifying networks and subnetworks etc. Tuning refers to tweaking these parameters to run dataflow is a custom configuration.

Tuning use cases

SMT by default launches dataflow with a preset configuration. However, this may not be applicable to all use cases. Some use cases when the user would want to tweak the jobs are:

  • Dataflow machines should run inside a VPC.
  • Dataflow and Spanner should run in separate projects for cost tracking.
  • Use a custom service account to launch the Dataflow job.
  • Apply labels for better cost tracking for the jobs.

To tune dataflow, first specify the target database in the ‘Configure Spanner Database’ step. This enables the configure button for the remaining steps.

Table of contents
  1. Tuning use cases
    1. VPC Host ProjectId
    2. VPC Network
    3. VPC Subnetwork
    4. Max Workers
    5. Number of Workers
    6. Machine Type
    7. Service Account Email
    8. Additional User Labels
    9. KMS Key Name
  2. Preset Flags
    1. Dataflow Project
    2. Dataflow Location
    3. GCS Template Path

SMT exposes the most frequently changed dataflow configurations to the user. Please reach out to us if you have a use-case that is not satisfied by the provided configurations.

VPC Host ProjectId

Specify the project id of the VPC that you want to use. This is required in order to use private connectivity. By default, this is assumed to be the same as Spanner project. Ensure this is specified if also specifying a network and subnetwork.

If using a shared VPC, a common practice is to have it in a separate project. Ensure this field specifies the correct host project for shared VPC use cases.

Present under the Networking section of the form.

VPC Network

Specify the name of the VPC network to use. For private connectivity, specify both the VPC network and subnetwork. If no network and subnetwork is provided, the default network is used.

Present under the Networking section of the form.

VPC Subnetwork

Specify the name of the VPC subnetwork to use. For private connectivity, specify both the VPC network and subnetwork. If no network and subnetwork is provided, the default network is used.

Present under the Networking section of the form.

SMT sets the IP configuration based on VPC network and subnetwork. If either network or subnetwork is provided (running inside a VPC), the public IPs are disabled (IPConfiguration is private). If neither are provided, the IP configuration is set to PUBLIC.

Max Workers

Specify the max workers for the dataflow job(s). By default, set to 50.

Present under the Performance section of the form.

Number of Workers

Specify the initial number of workers for the dataflow job(s). By default, set to 1.

Present under the Performance section of the form.

Machine Type

The machine type to use for the job, eg: n1-standard-2. Use default machine type if not specified.

Present under the Performance section of the form.

Service Account Email

Specify a custom service account email to run the job as. Uses the default compute engine service account if not specified. For more details, click here.

Additional User Labels

Additional user labels to be specified for the job via a JSON string. Example: { “name”: “wrench”, “mass”: “1kg”, “count”: “3” }.

KMS Key Name

Name for the Cloud KMS key for the job. Key format is: projects/my-project/locations/us-central1/keyRings/keyring-name/cryptoKeys/key-name. Omit this field to use Google Managed Encryption Keys.

Preset Flags

These flags are set by SMT by default and SHOULD NOT BE modified unless running Dataflow in a non-standard configuration. To edit these parameters, click the edit button in the form next to the preset flags header.

Dataflow Project

Specify the project to run the dataflow job in.

Dataflow Location

Specify the region to run the dataflow job in. It is recommended to keep the region same as Spanner region for performance. Example: us-central1

GCS Template Path

Cloud Storage path to the template spec. Use this to run launch dataflow with custom templates. Example: gs://my-bucket/path/to/template

Checkout how to build the Datastream To Spanner template here.