Development Documentation¶

The documentation is targeted to developers. The guide below will help you to run certain tasks focused on contributors and project maintainers.

Run Unit Tests¶

When making any changes, run Unit Tests to ensure the pipeline is working as intended. You may want to extend these tests if the changes impact the underlying pipeline logic and use cases.

Run Tests Directly via Python¶

Pre-Requirements to Run Tests¶

Python 3.7 or higher, with pip installed.
JDK 17 or higher.

Run Tests with Python¶

Clone this repository.

git clone https://github.com/GoogleCloudPlatform/cloud-solutions.git

Access this project folder.

cd cloud-solutions/projects/dataflow-gcs-to-alloydb/

Install the requirements-dev.txt. (Optional) Consider using a virtual environment.

python3 -m pip install --upgrade pip && \
python3 -m pip install --require-hashes -r ./requirements-dev.txt

Run the unit tests.

python3 ./src/dataflow_gcs_to_alloydb_test.py

Run Tests via Container¶

Pre-Requirements to run the test container¶

Docker Engine or equivalent installed.
Docker Compose or equivalent installed.

Build and run the test container¶

Clone this repository.

git clone https://github.com/GoogleCloudPlatform/cloud-solutions.git

Access this project folder.

cd cloud-solutions/projects/dataflow-gcs-to-alloydb/

Build the test container image.

docker build -f Dockerfile.dev . -t test_container

Run the test container.

docker run --net host -v /var/run/docker.sock:/var/run/docker.sock test_container

Run Pipeline Locally¶

For development purposes, you may want to run the pipeline locally using Apache Beam's Direct Runner.

Pre-Requirements¶

Python 3.7 or higher, with pip installed.
JDK 17 or higher.
An AlloyDB (or PostgreSQL compatible) database and a data file to upload it. If you do not have one, follow the steps under Run Dataflow Template first.

Run the Dataflow pipeline locally¶

Clone this repository.

git clone https://github.com/GoogleCloudPlatform/cloud-solutions.git

Access this project folder.

cd cloud-solutions/projects/dataflow-gcs-to-alloydb/

Set up the following variables to your project values.
```
BUCKET_NAME=""
ALLOYDB_IP=""
ALLOYDB_PASSWORD=""
```
The variables mean the following:
BUCKET_NAME is the name of the Google Cloud Storage bucket that will be used to read the data files.
ALLOYDB_IP is the IP or hostname for the AlloyDB instance. Your machine needs to be able to access this IP. You may need to use a Public IP for this.
ALLOYDB_PASSWORD is password for the AlloyDB instance.

Install the requirements.txt. (Optional) Consider using a virtual environment.

python3 -m pip install --upgrade pip && \
python3 -m pip install --require-hashes -r ./requirements.txt

Run the pipeline locally.

python3 ./src/dataflow_gcs_to_alloydb.py \
  --input_file_format=csv \
  --input_file_pattern "gs://$BUCKET_NAME/dataflow-template/*.csv" \
  --input_schema "id:int64;first_name:string;last_name:string;department:string;salary:float;hire_date:string" \
  --alloydb_ip "$ALLOYDB_IP" \
  --alloydb_password "$ALLOYDB_PASSWORD" \
  --alloydb_table "employees"

To learn how to customize these flags, read the Configuration section.