Deployment and Configuration of Database Archival and Pruning¶
Deployment¶
These instructions and Terraform code are for a demo deployment to test the Database Archival solution. For Production-ready deployment, check our recommendations on the Productionization section.
Costs¶
This demo uses billable components of Google Cloud, including the following:
Consider cleaning up the resources when you no longer need them.
Steps to Deploy the demo¶
-
Create a Google Cloud project with billing enabled.
-
Open the Cloud Console.
-
Activate Cloud Shell. At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.
-
Set the active project to your desired project.
Where
$PROJECT_ID
is the name of the project you just created. -
Set the Application Default Credentials.
-
Update Organization Policy, these constraints must be allowed:
- compute.restrictVpcPeering - Allowed All
-
compute.requireShieldedVm - Not enforced:
-
cloudbuild.disableCreateDefaultServiceAccount - Not enforced:
-
Manually add roles/iam.serviceAccountTokenCreator to the user that runs this script.
-
Enable cloudresourcemanager.googleapis.com manually:
-
Clone this repository.
-
Go to the project Terraform folder for the Demo deployment.
-
Run the Terraform script.
terraform init terraform apply \ -var "project_id=<project_id>" \ -var "region=<region>" \ -var "service_account_name=<service_account_name>" \ -var "database_user_name=<sql_user_name>" \ -var "database_user_password=<sql_user_password>"
Where:
project_id
is the project where the resources will be created. Example:my-database-archival-project
.region
is the Cloud region where the resources will be created. Example:asia-southeast1
.service_account_name
is the name of the service account that will be used to run this database archival project. Example:database-archival-service-account
.database_user_name
is the name of the user which will be created as admin of the Cloud SQL instance.database_user_password
is the password of the user which will be created as admin of the Cloud SQL instance.
-
If needed, edit the configuration file that was created on the bucket called
<project-id>_database_archival_config
. See the Configuration section below for more details. -
Once you are done with the demo, consider removing the resources.
terraform destroy \ -var "project_id=<project_id>" \ -var "region=<region>" \ -var "service_account_name=<service_account_name>" \ -var "database_user_name=<sql_user_name>" \ -var "database_user_password=<sql_user_password>"
Use the same variables than on the previous step.
Next steps after deployment¶
After running the Terraform, the following main components will be created:
-
A Cloud SQL MySQL database with sample data.
-
A BigQuery dataset, which contains a live copy of the Cloud SQL data.
-
A Datastream job which continuously copies the Cloud SQL data to BigQuery.
-
A Cloud Run Function, which can prune the required data from the Cloud SQL database.
-
A Cloud Storage bucket to host the Composer Directed Acyclic Graph (DAGs), and the configuration file.
-
A secret on Secret Manager to store the database password.
-
A configuration file, stored on Google Cloud Storage, which configures two jobs (in the same Composer pipeline):
-
An archival and pruning job for the Cloud SQL Historical Data table called
Transaction
. -
An archival-only job for the Cloud SQL Main Data table called
User
.
-
-
A Composer instance, and its DAG (pipeline) using the above configuration. The Composer DAG is set to run manually only.
-
Networking and IAM to allow all these components to interact with each other.
In order to run the Database Archival solution:
-
Access your Cloud Console on the project where the solution was deployed.
-
(Optional) Verify the data on Cloud SQL and BigQuery to see the data state before running the solution. You should see:
-
Two Cloud SQL tables called
Transaction
andUser
. -
Two BigQuery tables called
db_archival_demo.Transaction
anddb_archival_demo.User
with the same data.
-
-
Run Composer DAG manually:
-
Go to Composer UI.
-
Click the name of the Composer instance (e.g.
db-archival-composer-env
). -
Click
DAGS
. -
Click the name of the DAG (
database_archival_dag
). -
Click
Trigger DAG
.
-
-
The DAG will start running. Monitor the progress on the
Run History
until the newly created run has a green checkmark indicating its successful completion. -
After the DAG has completed its run, verify the data on Cloud SQL and BigQuery to see the data state after running the solution. You should see:
-
Two Cloud SQL tables:
-
User
table will contain the same amount of rows as it originally had, as it was set to archive only. -
Transaction
table will contain a smaller amount of rows as the row whosetransaction_date
was older than 730 days were pruned.
-
-
Five BigQuery tables:
-
db_archival_demo.Transaction
anddb_archival_demo.User
will have the same data as in Cloud SQL with the changes described above. -
db_archival_demo.User_snapshot
will contain a full copy of the data of theUser
table. -
db_archival_demo.Transaction_snapshot
will contain the data fromTransaction
that was marked for deletion - this is, the data whosetransaction_date
was older than 730 days. -
db_archival_demo.Transaction_snapshot_prune_progress
will contain the primary keys of the data that was deleted, plus metadata indicating the run, date and its confirmed pruned status.
-
-
If you want to to move the solution to work on your own data and on a schedule:
-
Read how to configure Database Archival own data, including how to run it on a schedule and other variables.
-
Read how to productionize the solution.
Configuration¶
Composer per-table workflow configuration¶
The configuration for the tables is created as a list of JSON objects and stored in Google Cloud Storage. Each object contains the following elements:
For both Historical and Main Data¶
-
database_table_name
: string, the name of the table that will be archived and, if required, pruned. Example:flight
. -
bigquery_location
: string, the Google Cloud location where the BigQuery data is located and jobs will be created. Example:us-central1
. -
bigquery_table_name
: string, the name of the table in BigQuery that is a copy of thedatabase_table_name
created by Datastream. Example:project.dataset.flight
. -
bigquery_days_to_keep
: number, the number of days for which the newly copied data will be stored in BigQuery. The partitions get deleted after this amount of days. Example:3650
days = ~10 years. -
database_prune_data
: boolean, whether the data in the database needs to be pruned (deleted). Set totrue
if this is a Historical Data table which needs pruning, otherwise set tofalse
.
For Historical Data only¶
-
database_prune_batch_size
: number, the size of the batches for data pruning. When deleting data from the database, each transaction of the database will delete this amount of rows. Example:1000
. Recommended to be kept between100
and10000
rows. This field is optional. Default:1000
rows. -
table_primary_key_columns
: list of strings, the names of the columns of all the primary keys for this table. Must have at least one primary key, but can have as many as required by the schema. Example:["flight_id"]
. -
table_date_column
: string, the name of the date column which determines when the data should be pruned. When the date in this column is older thandatabase_days_to_keep
, the data gets archived and pruned. Example:flight_date
. -
table_date_column_data_type
: enum, as a string, the type of column thattable_date_column
is. Must be one ofDATE
,DATETIME
orTIMESTAMP
. -
database_days_to_keep
: number, the number of days for which the data should be stored in the database. Data older than this will be archived and pruned (deleted). Example:365
for one year. -
database_type
: enum, as a string, the type of the database where this table is located. Must be one of the following:MYSQL
,POSTGRES
,SQL_SERVER
,ALLOY_DB
orSPANNER
. See Database Support for the list of supported databases. -
database_host
: string, the database host (or IP) where the database is located. Example:10.8.0.32
. It can be an internal or external IP, provided that the right networking is configured. We recommend using internal IP where possible. If the database uses non-standard port numbers, the port must be added to thedatabase_host
. Example:10.8.0.32:9432
. Onlydatabase_host
ordatabase_instance_name
must be provided. Whendatabase_host
is provided, the connection to the database is done via TCP. -
database_instance_name
: string, the database connection name, which represents the instance. Onlydatabase_host
ordatabase_instance_name
must be provided. Whendatabase_instance_name
is provided, the connection to the database is done via Python connector using the Private IP. If you must use the Public IP, usedatabase_host
instead.-
AlloyDB example:
projects/<project_id>/locations/<region_id>/clusters/<cluster_id>/instances/<instance_id>
. -
Cloud SQL example:
project:region:database-sample
.
-
-
database_name
: string, the name of the database where the table is located. Example:dataset
. -
database_username
: string, the name of the user to use to connect to the database. Example:user
. The user must have permission to read and remove records on the given database and table. -
database_password
: string, the password for the user which will be used to connect to the database. Onlydatabase_password
ordatabase_password_secret
must be provided. We recommend usingdatabase_password_secret
instead. -
database_password_secret
: string, the full name of the secret stored in Secret Manager, which contains the password for the user which will be used to connect to the database. Example:projects/project-id/secrets/secret-key-db-password/versions/1
. Onlydatabase_password
ordatabase_password_secret
must be provided.
Sample per-table workflow configuration¶
[
{
"bigquery_location": "us-central1",
"bigquery_table_name": "project.dataset.flight",
"bigquery_days_to_keep": 3000,
"database_table_name": "flight",
"database_prune_data": true,
"database_prune_batch_size": 1000,
"table_primary_key_columns": ["flight_id"],
"table_date_column": "departure",
"table_date_column_data_type": "DATE",
"database_days_to_keep": 365,
"database_type": "MYSQL",
"database_instance_name": "project:region:database-sample",
"database_name": "dataset",
"database_username": "user",
"database_password_secret": "projects/project-id/secrets/secret-key-db-password/versions/1"
},
{
"bigquery_location": "us-central1",
"bigquery_table_name": "project.dataset.airline",
"bigquery_days_to_keep": 3000,
"database_table_name": "airline",
"database_prune_data": false
},
]
The first example (flight
) represents a Historical Data table, which needs to
be archived and pruned. The data has been continuously replicated by Datastream
into project.dataset.flight
which is a dataset located in us-central1
. The
archived data and metadata will be kept for 3000
days. The data needs to be
pruned (true
) and will be pruned in batches of size 1000
. The table has one
primary key (flight_id
) and needs to be pruned in the basis of the column
departure
which is a DATE
column. Data older than 365
days will be
archived and deleted from the database. The database is a MYSQL
database on
Cloud SQL (instance: project:region:database-sample
). The tables are hosted
on the database dataset
which can be accessed using user user
whose password
is stored on Secret Manager under the secret
projects/project-id/secrets/secret-key-db-password/versions/1
.
The second example (airline
) represents a Main Data table, which needs to be
archived but does not need to be pruned. The data has been continuously
replicated by Datastream into project.dataset.airline
which is a dataset
located in us-central1
.
The tables in the configuration can belong to multiple databases, instances and database types, hosted in Cloud SQL or self-hosted - provided that the right permissions (IAM) and networking are configured to enable access from Composer and the Cloud Run Function to the database.
Validate the configuration¶
If the JSON configuration file is not succesfully parsed or contains wrong values, the DAG will not be built and none of the tables will run. In order to avoid any issues, validate your configuration file before uploading a new version to Google Cloud Storage. There is a tool on this repository to achive that. In order to validate your configuration:
-
Access this project repository.
-
Install the project and its required dependencies.
-
Run the configuration tool.
python3 tools/config_validator/config_validator.py \ --filename <path_to_config_file_to_validate>.json
Where:
<path_to_config_file_to_validate>.json
is the path to the JSON config file that you want to validate.
Other variables and configuration¶
When deploying the Composer pipeline (also referred as Directed Acyclic Graph or
DAG), you must configure some variables. These can be configured as environment
or by editing src/database_archival/dag/config.py
directly.
Required configuration variables¶
-
DATA_ARCHIVAL_CONFIG_PATH
: string, the Google Cloud Storage URI of the JSON file which contains the per-table configuration. Example:gs://bucket/file.json
. -
CLOUD_FUNCTION_URL_DATA_DELETION
: string, the URL to the Cloud Run Function deployed to prune the data. Example:https://region-project.cloudfunctions.net/prune-data
.
Optional configuration variables¶
These are optional variables, you may configure them, but it is not required to run the pipeline (DAG):
-
DAG_NAME
: string, name of the DAG. This will be used in the Airflow and Composer UI. Default:database_archival
. -
DAG_SCHEDULE_INTERVAL
: string, the schedule interval for the DAG. See Scheduling & Triggers for more details. Default:None
, which means that the pipeline must be run manually. -
TIME_BETWEEN_DELETE_BATCHES_IN_SECONDS
: number, the time in seconds to wait in between the deletion of two batches for the same table. Default value:120
seconds (2 minutes). -
DATABASE_TIMEOUT
: number, the timeout in seconds for the Cloud Run Function to to establish a database connection and perform data deletion. Default:3000
seconds.
Common Troubleshooting Issues¶
- Error on
private_vpc_connection
Terraform resource step. Solution: run manuallygcloud
command to update VPC peerings after private-ip-alloc is created. Follow this article for the command to be used.
Productionization¶
The following steps are recommended for productionizing deployment of the Database Archival solution:
-
Begin by deploying the demo, moving towards a dev/test environment and progress your use of Database ARchival safely towards your Production environments.
-
As you are testing the solution, run it first on test instances and databases.
-
Ensure you have backups and point in time recovery of the data while you use Database Archival to minimize any accidental data loss.
-
-
Incorporate the relevant portions of the supplied Terraform demo configuration into your own Terraform codebase. You may choose to use part of the supplied modules with the relevant changes, or select portions of the modules to use in your own projects. For more details, check the Terraform code and/or read the Architecture section. For a full deployment, you will require to:
-
Have an existing Cloud SQL or AlloyDB database.
-
Have or deploy a live copy of the data, with continuous Change Data Capture (CDC), to BigQuery. You may choose to use Datastream or an alternative solution for this.
-
Deploy the Database Archival Cloud Run Function which deletes/prunes data from the Database.
-
Deploy the Composer pipeline which coordinates the pipeline, including the Google Cloud Storage bucket where the code lives.
-
Configure the Composer pipeline to act on your desired instances, databases and tables. Host the file on Google Cloud Storage (GCS), and create a new GCS bucket if necessary.
-
Configure networking and IAM so that:
-
Cloud Run Function can connect to the database to read and remove rows.
-
Cloud Run Function can connect to the BigQuery dataset to read and update rows.
-
Composer can connect to the BigQuery dataset to read and update rows, and create new tables.
-
Composer can connect to the Cloud Run Function.
-
-
-
Decouple the lifecycle of the Database Archival components from the lifecycles of the Memorystore instances being scaled. In particular, it should be possible to completely tear down and redeploy all components of the Database Archival solution without affecting your database instances.
-
Pay particular attention to the management and permissions of the service accounts you configure the Database to use. We recommend assigning minimally permissioned service accounts.
-
Store your Database Archival configuration files in your source control system, along with the Terraform and application codebase.
-
Automate updating the Database Archival configuration using a deployment pipeline separate from deploying the solution itself. This will allow you to incorporate policy and other checks according to your organizational requirements (e.g. change freeze periods), as well as decoupling updates to the configuration from updates to the code solution itself.
-
Define alerts to be notified of archival and pruning events that may affect your platform or your application. You can use log-based-alerts to configure alerts that will notify you whenever a specific message appears in the logs.