Error handling
This section gives information about how to track errors that have occurred during minimal downtime migration via metrics and how to retry the errored records.
Table of contents
Minimal Downtime Migration
The Dataflow job that handles minimal downtime migration runs in two mode:
- regular: This is the default mode, where the events streamed by Datastream are picked up and converted to Spanner compatible data types and applied to Spanner. It also does automatic retry of retryable errors and once the retry is exhausted, moves them to a dead letter queue (DLQ) directory in GCS. Permanent errors are also moved to the dead letter queue.
- retryDLQ: This mode looks at the DLQ and retries the events. This mode is ideal to run when all the permanent and/or retryable errors are fixed - for example any bug fix/ dependent data migration is complete.This mode only reads from DLQ and not from Datastream output.
Current error scenarios
The following error scenarios are possible currently when doing low downtime migration:
- If there is a foreign key constraint on a table - and that constraint got applied successfully on Spanner - then due to unordered processing by Datastream - the child table record comes before the parent table record and it fails with foreign key constraint violation.
- Due to unordered processing of Datastream - for interleaved tables - it can happen that child table records arrive before the parent table record and it fails with the parent record not found error.
- There could also be some intermittent errors from Spanner like deadline exceeded due a temporary resource impact.
- Other SpannerExceptions - which are marked for retry
- In addition, there is a possibility of severe errors that would require manual intervention. Examples of severe error could be error during transformation.
Points 1 to 4 above are retryable errors - the Dataflow job automatically retries them at intervals of 10 minutes for 500 times. In most cases, this should be good enough for the retryable records to succeed, however, even if after exhausting all the retries, these are not successful - then these records are marked as ‘severe’ error category. Such ‘severe’ errors can be retried later with a ‘retryDLQ’ mode of the Dataflow job (discussed below).
The following scenarios results in skipping of records, they are not really errors:
- Invalid structure of records read from Datastream output
- Table that existed in source but was dropped during schema conversion
Note there can be exceptions like invalid arguments to the Dataflow pipeline - these cause the pipeline to halt.
Metrics
Migration progress can be tracked by monitoring the Dataflow job and following custom metrics are exposed:
Metrics for regular run
Metric Name | Description |
---|---|
Successful events | Total number of events successfully processed and applied to Spanner database |
Retryable errors | The count of events that were errored out but will be retried |
Total permanent errors | The number of events that are errored out with non-retriable errors in addition to the number of errors after exhausting retries |
Conversion errors | Number of events that could not be converted to Spanner. This is a permanent error category. |
Skipped events | The events that are skipped from migration since the table was dropped from migration |
Other permanent errors | The remaining permanent errors. |
Transformed events | The number of events that were successfully transformed, including retries and permanent errors. |
Filtered events | The number of events that were were filtered as a part of custom transformation. |
Custom Transformation Exceptions | The number of events that were errored out due to some exception in custom transformation jar. |
Total events processed | The number of events that were tried for forward migration, including retries and permanent errors. |
apply_custom_transformation_impl_latency_ms | Latency of applying custom transformation to the event. |
elementsReconsumedFromDeadLetterQueue | The total number of events consumed from DLQ for retry. |
Metrics for retryDLQ run
Metric Name | Description |
---|---|
Successful events | Total number of events successfully processed and applied to Spanner database |
elementsReconsumedFromDeadLetterQueue | The total number of events consumed from DLQ for retry |
Elements requeued for retry | The total number of events that were re queued for retry |
Conversion errors | Number of events that could not be converted to Spanner.This is a permanent error category. |
Skipped events | The events that are skipped from migration since the table was dropped from migration |
Other permanent errors | The remaining permanent errors. |
Total events processed | The number of events that were tried for forward migration, including retries and permanent errors. |
Transformed events | The number of events that were successfully transformed, including retries and permanent errors. |
Filtered events | The number of events that were were filtered as a part of custom transformation. |
Custom Transformation Exceptions | The number of events that were errored out due to some exception in custom transformation jar. |
Total events processed | The number of events that were tried for forward migration, including retries and permanent errors. |
apply_custom_transformation_impl_latency_ms | Latency of applying custom transformation to the event. |
It can happen that in retryDLQ mode, there are still permanent errors. To identify that all the retryable errors have been processed and only permanent errors remain for reprocessing - one can look at the ‘Successful events’ count - it would remain constant after every retry iteration. Each retry iteration, the ‘elementsReconsumedFromDeadLetterQueue’ would increment.
Dataflow metrics are approximate. In the event that there is Dataflow worker restart, the same set of events might be reprocessed and the counters may reflect excess/lower values. In such scenarios, it is possible that counters like Successful events might have values greater than the number of records written to Spanner.Similarly, it is possible that the Retryable errors is negative since the same retry record got successfully processed by different workers.
Re-run commands
To rerun regular flow
To rerun the regular flow, the same command as original needs to be fired. Note: This will only work when not using the PubSub subscriptions for GCS files.The processing starts all over again, meaning the same Datastream outputs get reprocessed.
gcloud dataflow flex-template run <jobName> \
--project=<project-name> --region=<region-name> \
--template-file-gcs-location=gs://dataflow-templates-southamerica-west1/2023-09-12-00_RC00/flex/Cloud_Datastream_to_Spanner \
--num-workers 1 --max-workers 50 \
--enable-streaming-engine \
--parameters databaseId=<database id>,deadLetterQueueDirectory=<GCS location of the DLQ directory>,gcsPubSubSubscription=<pubsub subscription being used in a gcs notification policy>,dlqGcsPubSubSubscription=<pubsub subscription being used in a dlq gcs notification policy>,instanceId=<spanner-instance-id>,sessionFilePath=<GCS location of the session json>,streamName=<data stream name>,transformationContextFilePath=<path to transformation context json>
These job parameters can be taken from the original job.
To re-run for reprocessing DLQ directory
This will reprocess the records marked as ‘severe’ error records from the DLQ.
Before running the Dataflow job, check if the main Dataflow job has non-zero retryable error count. In case there are referential error records - check that the dependent table data is populated completely from the source database.
Sample command to run the Dataflow job in retryDLQ mode is
gcloud dataflow flex-template run <jobname> \
--region=<the region where the dataflow job must run> \
--template-file-gcs-location=gs://dataflow-templates/latest/flex/Cloud_Datastream_to_Spanner \
--additional-experiments=use_runner_v2 \
--parameters gcsPubSubSubscription=<pubsub subscription being used in a gcs notification policy>,streamName=<Datastream name>, \
instanceId=<Spanner Instance Id>,databaseId=<Spanner Database Id>,sessionFilePath=<GCS path to session file>, \
dlqGcsPubSubSubscription=<pubsub subscription being used in a dlq gcs notification policy>, \
deadLetterQueueDirectory=<GCS path to the DLQ>,runMode=retryDLQ
The following parameters can be taken from the regular forward migration Dataflow job:
region
gcsPubSubSubscription
streamName
instanceId
databaseId
sessionFilePath
deadLetterQueueDirectory
dlqGcsPubSubSubscription