Skip to content

Accelerating the Path from Lab to Life

Demonstrations that illustrate leveraging Google Cloud for Life Sciences Applications

Overview

The drug discovery process involves a complex workflow called "Target and Lead Identification". Target and lead identification is the foundational, early-stage drug discovery process that identifies a specific biological molecule causing a disease (target) and finds a promising small molecule (lead) that interacts with it to provide therapeutic benefits. It involves rigorous validation of the target followed by high-throughput screening of chemical libraries, typically narrowing down hundreds of potential "hits" into a single, optimized drug candidate. The Target Identification process involves the collection of genomic data and translates that into gene expression data. This data is relational and very large, which creates a challenge for scientists that need to access which particular genes can be targeted for treatment, a process called "Biomarker Identification". After the relevant biomarkers are identified, further exploration is done at the molecular level. At this stage, a process called "Protein Docking" is used to simulate how the target protein binds to smaller moleculed called "ligands". In a following phase, molecular simulation is used to determine the stability of candidate targets.

Use Cases

Biomarker Identification

This repository addresses the Biomarker Identification process, using BigQuery and the Data Science Agent for Colab Enterprise. This repository provides notebooks that download and interpret the dataset from the clinical trial of the renal cancer drug Avelumab. This clinical is known in the literature as the JAVELIN trial, as described in the paper below: Motzer et. al.

Protein Docking and Molecular Simulation

After the main Biomarkers are identified, the relevant protein segments are used in subsequent molecular docking and simulation processes. These processes involve the deployment of the protein folding models such as Alphafold and molecular simulation applications such as GROMACS. These applications can be efficiently deployed using the Google Cloud Cluster Development toolkit, as described in the links below:

Alphafold Blueprint for Google Cloud Cluster Toolkit

GROMACS Blueprint for Google Cloud Cluster Toolkit

Pre-requisites

The following are the pre-requisites to deploy the Biomarker Identification use case:

A Project ID of an existing or new Google Cloud project The user must have the IAM role BigQuery User in order to create and use BigQuery tables. It is recommended to have a Vertex AI Workbench instance to download and process the clinical trial data.

Quota Requirements

The notebooks proviuded in this repo use only a fraction of the default quota for BigQuery utilization of 200 Tebibytes (TiB) of data processed per project per day.

Project Structure

This repository provides two Notebooks as follows:

Loading the JAVELIN Clinical Trial Data Into Big Query

Biomarker Identification with the Data Scientist Agent

Getting Started

Obtain the Project ID of the project that will host the BigQuery tables that will be created Next, use the Loading the JAVELIN Clinical Trial Data Into Big Query notebook to download the JAVELIN clinical trial data and upload into BigQuery. Because this notebook will download a large amount of data and requires significant disk space it is recommended to run this Notebook in a Vertex AI Workbench instance. After executing the above step, use the Biomarker Identification with the Data Scientist Agent to run the Biomarker Identificaton notebook. Optional: After running the above notebook, you can download the notebook to PDF format and then provide it as input to other AI Agents such as NotebookLM, which can produce a report and a video outline of the Biomarker Identification findings.

Reference Architecture

The main component used in the Biomarker Identification use case is the Data Science Agent for Colab Enteprise. The following reference architecture illustrates how this tool relates to other components of the larger Target and Lead Identification workflow:

Reference Architecture

Disclaimers

This is not an officially supported Google Service. The use of this solution is on an “as-is” basis, and is not a Service offered under the Google Cloud Terms of Service.

This solution is under active development. Interfaces and functionality may change at any time.

License

This repository is licensed under the Apache License, Version 2.0 (see LICENSE). The solution includes declarative markdown files that are interpretable by certain third-party technologies (e.g., Terraform and DBT). These files are for informational use only and do not constitute an endorsement of those technologies, including any warranties, representations, or other guarantees as to their security, reliability, or suitability for purpose.