In [ ]:

Copied!





# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Multimodal Prompting with Gemini: Working with Audio¶

Run in Colab

Open in Colab Enterprise

Open in Vertex AI Workbench

View on GitHub


Author(s)	Michael Chertushkin
Reviewer(s)	Rajesh Thallam, Skander Hannachi
Last updated	2024-09-16

Overview¶

Gemini 2.0 models supports adding image, audio, video, and PDF files in text or chat prompts for a text or code response. Gemini 2.0 Flash supports up to 1 Million input tokens with up to 8.4 hours length of audio per prompt. You can add audio to Gemini requests to perform audio analysis tasks such as transcribing audio, audio chapterization (or localization), key event detection, audio translation and more.

In this notebook we cover prompting recipes and strategies for working with Gemini on audio files and show some examples on the way. This notebook is organized as follows:

Audio Understanding
Effective prompting
Key event detection
Using System instruction
Generating structured output

Getting Started¶

The following steps are necessary to run this notebook, no matter what notebook environment you're using.

If you're entirely new to Google Cloud, get started here.

Google Cloud Project Setup¶

Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
Make sure that billing is enabled for your project.
Enable the Service Usage API
Enable the Vertex AI API.
Enable the Cloud Storage API.

Google Cloud Permissions¶

To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.

If you want to skip the optional section, you need at least the following roles:

roles/serviceusage.serviceUsageAdmin to enable APIs
roles/iam.serviceAccountAdmin to modify service agent permissions
roles/aiplatform.user to use AI Platform components
roles/storage.objectAdmin to modify and delete GCS buckets

Install Vertex AI SDK for Python and other dependencies (If Needed)¶

The list packages contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.## Install Vertex AI SDK for Python and other dependencies (If Needed)

The list packages contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.

In [ ]:

Copied!

! pip install google-cloud-aiplatform --upgrade --quiet --user
! pip install google-cloud-aiplatform --upgrade --quiet --user

Restart Runtime¶

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [ ]:

Copied!

# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Authenticate¶

If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.

If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login in a shell on the machine running the notebook kernel is sufficient.

More authentication options are discussed here.

In [ ]:

Copied!





# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()
    print("Authenticated")
# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()
    print("Authenticated")

Set Google Cloud project information and Initialize Vertex AI SDK¶

To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.

Learn more about setting up a project and a development environment.

Make sure to change PROJECT_ID in the next cell. You can leave the values for REGION unless you have a specific reason to change them.

In [ ]:

Copied!





import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")
import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")

Import Libraries¶

In [ ]:

Copied!

from vertexai.generative_models import (GenerationConfig, GenerativeModel,
                                        HarmBlockThreshold, HarmCategory, Part)
from vertexai.generative_models import (GenerationConfig, GenerativeModel,
                                        HarmBlockThreshold, HarmCategory, Part)

Define Utility functions¶

In [ ]:

Copied!





import http.client
import textwrap
import typing
import urllib.request

from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"


def wrap(string, max_width=80):
    return textwrap.fill(string, max_width)


def get_bytes_from_url(url: str) -> bytes:
    with urllib.request.urlopen(url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        bytes = response.read()
    return bytes


def get_bytes_from_gcs(gcs_path: str):
    bucket_name = gcs_path.split("/")[2]
    object_prefix = "/".join(gcs_path.split("/")[3:])
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(object_prefix)
    return blob.download_as_bytes()


def display_image(image_url: str, width: int = 300, height: int = 200):
    if image_url.startswith("gs://"):
        image_bytes = get_bytes_from_gcs(image_url)
    else:
        image_bytes = get_bytes_from_url(image_url)
    display.display(display.Image(data=image_bytes, width=width, height=height))


def display_video(video_url: str, width: int = 300, height: int = 200):
    if video_url.startswith("gs://"):
        video_bytes = get_bytes_from_gcs(video_url)
    else:
        video_bytes = get_bytes_from_url(video_url)
    display.display(
        display.Video(
            data=video_bytes,
            width=width,
            height=height,
            embed=True,
            mimetype="video/mp4",
        )
    )


def display_audio(audio_url: str, width: int = 300, height: int = 200):
    if audio_url.startswith("gs://"):
        audio_bytes = get_bytes_from_gcs(audio_url)
    else:
        audio_bytes = get_bytes_from_url(audio_url)
    display.display(display.Audio(data=audio_bytes, embed=True))


def print_prompt(contents: list[str | Part]):
    for content in contents:
        if isinstance(content, Part):
            if content.mime_type.startswith("image"):
                display_image(image_url=content.file_data.file_uri)
            elif content.mime_type.startswith("video"):
                display_video(video_url=content.file_data.file_uri)
            elif content.mime_type.startswith("audio"):
                display_audio(audio_url=content.file_data.file_uri)
            else:
                print(content)
        else:
            print(content)
import http.client
import textwrap
import typing
import urllib.request

from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"


def wrap(string, max_width=80):
    return textwrap.fill(string, max_width)


def get_bytes_from_url(url: str) -> bytes:
    with urllib.request.urlopen(url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        bytes = response.read()
    return bytes


def get_bytes_from_gcs(gcs_path: str):
    bucket_name = gcs_path.split("/")[2]
    object_prefix = "/".join(gcs_path.split("/")[3:])
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(object_prefix)
    return blob.download_as_bytes()


def display_image(image_url: str, width: int = 300, height: int = 200):
    if image_url.startswith("gs://"):
        image_bytes = get_bytes_from_gcs(image_url)
    else:
        image_bytes = get_bytes_from_url(image_url)
    display.display(display.Image(data=image_bytes, width=width, height=height))


def display_video(video_url: str, width: int = 300, height: int = 200):
    if video_url.startswith("gs://"):
        video_bytes = get_bytes_from_gcs(video_url)
    else:
        video_bytes = get_bytes_from_url(video_url)
    display.display(
        display.Video(
            data=video_bytes,
            width=width,
            height=height,
            embed=True,
            mimetype="video/mp4",
        )
    )


def display_audio(audio_url: str, width: int = 300, height: int = 200):
    if audio_url.startswith("gs://"):
        audio_bytes = get_bytes_from_gcs(audio_url)
    else:
        audio_bytes = get_bytes_from_url(audio_url)
    display.display(display.Audio(data=audio_bytes, embed=True))


def print_prompt(contents: list[str | Part]):
    for content in contents:
        if isinstance(content, Part):
            if content.mime_type.startswith("image"):
                display_image(image_url=content.file_data.file_uri)
            elif content.mime_type.startswith("video"):
                display_video(video_url=content.file_data.file_uri)
            elif content.mime_type.startswith("audio"):
                display_audio(audio_url=content.file_data.file_uri)
            else:
                print(content)
        else:
            print(content)

Initialize Gemini¶

In [ ]:

Copied!





# Gemini Config
GENERATION_CONFIG = {
    "max_output_tokens": 8192,
    "temperature": 0.1,
    "top_p": 0.95,
}

SAFETY_CONFIG = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

gemini = GenerativeModel(model_name="gemini-2.0-flash-001")
audio_path_prefix = (
    "gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/audio"
)


def generate(
    model,
    contents,
    safety_settings=SAFETY_CONFIG,
    generation_config=GENERATION_CONFIG,
    as_markdown=False,
):
    responses = model.generate_content(
        contents=contents,
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
    )
    if isinstance(responses, list):
        for response in responses:
            if as_markdown:
                display.display(display.Markdown(response.text))
            else:
                print(wrap(response.text), end="")
    else:
        if as_markdown:
            display.display(display.Markdown(responses.text))
        else:
            print(wrap(responses.text), end="")
# Gemini Config
GENERATION_CONFIG = {
    "max_output_tokens": 8192,
    "temperature": 0.1,
    "top_p": 0.95,
}

SAFETY_CONFIG = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

gemini = GenerativeModel(model_name="gemini-2.0-flash-001")
audio_path_prefix = (
    "gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/audio"
)


def generate(
    model,
    contents,
    safety_settings=SAFETY_CONFIG,
    generation_config=GENERATION_CONFIG,
    as_markdown=False,
):
    responses = model.generate_content(
        contents=contents,
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
    )
    if isinstance(responses, list):
        for response in responses:
            if as_markdown:
                display.display(display.Markdown(response.text))
            else:
                print(wrap(response.text), end="")
    else:
        if as_markdown:
            display.display(display.Markdown(responses.text))
        else:
            print(wrap(responses.text), end="")

In [ ]:

Copied!

display_audio(
    audio_url="gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/audio/sound_1.mp3"
)
display_audio(
    audio_url="gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/audio/sound_1.mp3"
)

Prompt #1. Audio Understanding¶

This task requires the input to be presented in two different modalities: text and audio. The example of the API call is below, however this is non-optimal prompt and we can make it better.

In [ ]:

Copied!





audio_path = f"{audio_path_prefix}/sound_1.mp3"
audio_content = Part.from_uri(uri=audio_path, mime_type="audio/mp3")
prompt = """Provide a description of the audio.
The description should also contain anything important which people say in the audio."""

contents = [audio_content, prompt]
# print_prompt(contents)
audio_path = f"{audio_path_prefix}/sound_1.mp3"
audio_content = Part.from_uri(uri=audio_path, mime_type="audio/mp3")
prompt = """Provide a description of the audio.
The description should also contain anything important which people say in the audio."""

contents = [audio_content, prompt]
# print_prompt(contents)

In [6]:

Copied!

generate(gemini, contents, as_markdown=True)
generate(gemini, contents, as_markdown=True)

This is an audio recording of CD2, an audio program to accompany English in Action 1, second edition, by Barbara H. Foley and Elizabeth R. Neblet. The copyright is 2018, National Geographic Learning, a part of Cengage Learning. The audio contains a section titled "A. Listen and Repeat." The audio then lists the following sentences:

He is eating.
He is washing the car.
She is listening to the radio.
They are studying.
He is cooking.
She is sleeping.
He is reading.
She is drinking.
They are talking.
They are watching TV.
He is doing his homework.
She is cleaning the house.
She is driving.
They are walking.
She is making lunch.
He is doing the laundry.

As we see the model correctly picked that this is a lesson in English, however we can improve the level of details.

Prompt #2. Crafting an effective prompt¶

To get the best results from Gemini for a task, think about both what you tell it and how you tell it.

What: Include all the necessary information to solve the task, like instructions, examples, and background details.
How: Structure this information clearly.
- Order: Organize prompt in a logical sequence.
- Delimiters/Separators: Use headings or keywords to highlight key information. XML tags or Markdown headers are a good way to format.

A well-structured prompt is easier for the model to understand and process, leading to more accurate and relevant responses.

Let's rewrite the prompt and add a persona (or role), give clear goals, use XML tags as prompt separators.

In [7]:

Copied!





prompt = """You are an audio analyzer. You receive an audio and produce the 
detailed description about what happens in the audio.

<INSTRUCTIONS>
- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful
</INSTRUCTIONS>

Now analyse the following audio
"""

contents = [audio_content, prompt]
generate(gemini, contents, as_markdown=True)
prompt = """You are an audio analyzer. You receive an audio and produce the 
detailed description about what happens in the audio.


- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful


Now analyse the following audio
"""

contents = [audio_content, prompt]
generate(gemini, contents, as_markdown=True)

Okay, here is a detailed description of the audio:

The audio is an audio program to accompany English in Action 1, second edition, by Barbara H. Foley and Elizabeth R. Neblet. It is copyrighted in 2018 by National Geographic Learning, a part of Cengage Learning.

The audio contains a series of sentences that are read aloud. The sentences describe various actions that people are doing. The listener is instructed to listen and repeat each sentence.

Here is a list of the sentences that are read aloud:

He is eating.
He is washing the car.
She is listening to the radio.
They are studying.
He is cooking.
She is sleeping.
He is reading.
She is drinking.
They are talking.
They are watching TV.
He is doing his homework.
She is cleaning the house.
She is driving.
They are walking.
She is making lunch.
He is doing the laundry.

With the updated prompt, we are able to capture much more details, although this prompt is rather generic and can be used for other audio files. Now let's add these changes as system instruction and see.

Prompt #3. Using system instruction¶

System Instruction (SI) is an effective way to steer Gemini's behavior and shape how the model responds to your prompt. SI can be used to describe model behavior such as persona, goal, tasks to perform, output format / tone / style, any constraints etc.

SI behaves more "sticky" (or consistent) during multi-turn behavior. For example, if you want to achieve a behavior that the model will consistently follow, then system instruction is the best way to put this instruction.

In this example, we will move the task rules to system instruction.

In [8]:

Copied!





system_prompt = """You are an audio analyzer. You receive an audio and produce 
the detailed description about what happens in the audio.

<INSTRUCTIONS>
- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful
</INSTRUCTIONS>
"""

prompt = "Now analyze the audio"
system_prompt = """You are an audio analyzer. You receive an audio and produce 
the detailed description about what happens in the audio.


- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful

"""

prompt = "Now analyze the audio"

In [9]:

Copied!





gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt
)

contents = [audio_content, prompt]
generate(gemini_si, contents, as_markdown=True)
gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt
)

contents = [audio_content, prompt]
generate(gemini_si, contents, as_markdown=True)

Okay, here is the analysis of the audio:

General Description: The audio is an educational recording designed to accompany the "English in Action 1" textbook, second edition. It seems to be focused on teaching basic English vocabulary and grammar, specifically related to actions and present continuous tense.

Content Breakdown:

Introduction: A narrator introduces the audio program, mentioning the textbook it accompanies, the authors (Barbara H. Foley and Elizabeth R. Neblet), and the copyright information (2018, National Geographic Learning, a part of Cengage Learning).
Instructions: A voice instructs the listener to "Listen and repeat."
Vocabulary/Grammar Practice: A series of numbered sentences are presented, each describing an action. The sentences are simple and use the present continuous tense (e.g., "He is eating," "She is washing the car").

Talking Personas:

Narrator: A voice introduces the audio program and provides copyright information.
Instructor: A voice gives instructions to the listener (e.g., "Listen and repeat").
Speakers: Different voices (male and female) read the sentences describing the actions.

Hidden Meaning/Purpose: The audio aims to help learners:

Improve their listening comprehension skills.
Practice pronunciation by repeating the sentences.
Learn and reinforce vocabulary related to everyday actions.
Understand and use the present continuous tense correctly.

Overall: The audio is a straightforward and practical tool for English language learners, particularly beginners. It focuses on building foundational skills in listening, speaking, and grammar through repetition and simple sentence structures.

Prompt #4. Audio Understanding: Get structured outputs¶

Gemini models can generate structured outputs such as JSON, providing a blueprint for the model's output. This feature is also referred to as controlled generation.

In this example, we demonstrate Gemini to return structured output (JSON) from a audio analysis. One of the ways to achieve better understanding of audio (or any multimodal) content is to prompt the model to explain its "reasoning" about the response. This has proven to be very effective method, however it can increase the latency.

Vertex AI Gemini API makes it easy to return JSON output by configuring response MIME type as application/json. Optionally, you can also configure response_schema with the JSON schema for the model to generate output as per the schema.

In [10]:

Copied!





response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "OBJECT",
        "properties": {
            "harmfulness_reasoning": {
                "type": "STRING",
                "description": "Step-by-step detailed reasoning about how harmful is the audio",
            },
            "harmfulness_score": {
                "type": "INTEGER",
                "description": "Number between 0 and 5 indicating how harmful is the audio",
            },
        },
        "required": ["harmfulness_reasoning", "harmfulness_score"],
    },
}
response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "OBJECT",
        "properties": {
            "harmfulness_reasoning": {
                "type": "STRING",
                "description": "Step-by-step detailed reasoning about how harmful is the audio",
            },
            "harmfulness_score": {
                "type": "INTEGER",
                "description": "Number between 0 and 5 indicating how harmful is the audio",
            },
        },
        "required": ["harmfulness_reasoning", "harmfulness_score"],
    },
}

In [11]:

Copied!





structured_prompt = """You are an audio analyzer. You receive an audio and 
produce harmfulness score - how harmful this audio can be for kids."""

contents = [audio_content, structured_prompt]

generate(
    gemini,
    contents,
    generation_config=GenerationConfig(
        response_mime_type="application/json", response_schema=response_schema
    ),
)
structured_prompt = """You are an audio analyzer. You receive an audio and 
produce harmfulness score - how harmful this audio can be for kids."""

contents = [audio_content, structured_prompt]

generate(
    gemini,
    contents,
    generation_config=GenerationConfig(
        response_mime_type="application/json", response_schema=response_schema
    ),
)

[   {     "harmfulness_reasoning": "The audio primarily contains instructional
content for English language learning, featuring clear and neutral speech. There
are no elements that could be considered harmful to children; it lacks any
offensive language, violence, or suggestive themes. The overall tone is
educational and safe for children of all ages.",     "harmfulness_score": 1   }
]

The model returned the correct score for the audio by asking the model to output "reasoning" along with the score. Adding "reasoning" field before the "score" gives a consistent and correct score. The intuition is that LLM can generate "reasoning" first and rely on the thoughts to properly produce the score.

Conclusion¶

This demonstrated various examples of working with Gemini using audio files. Following are general prompting strategies when working with Gemini on multimodal prompts, that can help achieve better performance from Gemini:

Craft clear and concise instructions.
Add your video or any media first for single-media prompts.
Add few-shot examples to the prompt to show the model how you want the task done and the expected output.
Break down the task step-by-step.
Specify the output format.
Ask Gemini to include reasoning in its response along with decision or scores
Use context caching for repeated queries.

Specifically, when working with audio following may help:

Ask Gemini to avoid summarizing for transcription.
Add examples for effective speaker diarization.