# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Multimodal Prompting with Gemini 1.5: Working with Audio¶
Author(s) | Michael Chertushkin |
Reviewer(s) | Rajesh Thallam, Skander Hannachi |
Last updated | 2024-09-16 |
Overview¶
Gemini 1.5 Pro and Flash models supports adding image, audio, video, and PDF files in text or chat prompts for a text or code response. Gemini 1.5 Pro supports up to 2 Million input tokens with up to 19 hours length of audio per prompt. You can add audio to Gemini requests to perform audio analysis tasks such as transcribing audio, audio chapterization (or localization), key event detection, audio translation and more.
In this notebook we cover prompting recipes and strategies for working with Gemini on audio files and show some examples on the way. This notebook is organized as follows:
- Audio Understanding
- Effective prompting
- Key event detection
- Using System instruction
- Generating structured output
Getting Started¶
The following steps are necessary to run this notebook, no matter what notebook environment you're using.
If you're entirely new to Google Cloud, get started here.
Google Cloud Project Setup¶
- Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
- Make sure that billing is enabled for your project.
- Enable the Service Usage API
- Enable the Vertex AI API.
- Enable the Cloud Storage API.
Google Cloud Permissions¶
To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.
If you want to skip the optional section, you need at least the following roles:
roles/serviceusage.serviceUsageAdmin
to enable APIsroles/iam.serviceAccountAdmin
to modify service agent permissionsroles/aiplatform.user
to use AI Platform componentsroles/storage.objectAdmin
to modify and delete GCS buckets
Install Vertex AI SDK for Python and other dependencies (If Needed)¶
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.## Install Vertex AI SDK for Python and other dependencies (If Needed)
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.
! pip install google-cloud-aiplatform --upgrade --quiet --user
Restart Runtime¶
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.
# Restart kernel after installs so that your environment can access the new packages
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Authenticate¶
If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.
If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login
in a shell on the machine running the notebook kernel is sufficient.
More authentication options are discussed here.
# Colab authentication.
import sys
if "google.colab" in sys.modules:
from google.colab import auth
auth.authenticate_user()
print("Authenticated")
Set Google Cloud project information and Initialize Vertex AI SDK¶
To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.
Learn more about setting up a project and a development environment.
Make sure to change PROJECT_ID
in the next cell. You can leave the values for REGION
unless you have a specific reason to change them.
import vertexai
PROJECT_ID = "[your-project-id]" # @param {type:"string"}
REGION = "us-central1" # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")
Vertex AI SDK initialized. Vertex AI SDK version = 1.65.0
Import Libraries¶
from vertexai.generative_models import (GenerationConfig, GenerativeModel,
HarmBlockThreshold, HarmCategory, Part)
Define Utility functions¶
import http.client
import textwrap
import typing
import urllib.request
from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
def wrap(string, max_width=80):
return textwrap.fill(string, max_width)
def get_bytes_from_url(url: str) -> bytes:
with urllib.request.urlopen(url) as response:
response = typing.cast(http.client.HTTPResponse, response)
bytes = response.read()
return bytes
def get_bytes_from_gcs(gcs_path: str):
bucket_name = gcs_path.split("/")[2]
object_prefix = "/".join(gcs_path.split("/")[3:])
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.get_blob(object_prefix)
return blob.download_as_bytes()
def display_image(image_url: str, width: int = 300, height: int = 200):
if image_url.startswith("gs://"):
image_bytes = get_bytes_from_gcs(image_url)
else:
image_bytes = get_bytes_from_url(image_url)
display.display(display.Image(data=image_bytes, width=width, height=height))
def display_video(video_url: str, width: int = 300, height: int = 200):
if video_url.startswith("gs://"):
video_bytes = get_bytes_from_gcs(video_url)
else:
video_bytes = get_bytes_from_url(video_url)
display.display(
display.Video(
data=video_bytes,
width=width,
height=height,
embed=True,
mimetype="video/mp4",
)
)
def display_audio(audio_url: str, width: int = 300, height: int = 200):
if audio_url.startswith("gs://"):
audio_bytes = get_bytes_from_gcs(audio_url)
else:
audio_bytes = get_bytes_from_url(audio_url)
display.display(display.Audio(data=audio_bytes, embed=True))
def print_prompt(contents: list[str | Part]):
for content in contents:
if isinstance(content, Part):
if content.mime_type.startswith("image"):
display_image(image_url=content.file_data.file_uri)
elif content.mime_type.startswith("video"):
display_video(video_url=content.file_data.file_uri)
elif content.mime_type.startswith("audio"):
display_audio(audio_url=content.file_data.file_uri)
else:
print(content)
else:
print(content)
Initialize Gemini¶
# Gemini Config
GENERATION_CONFIG = {
"max_output_tokens": 8192,
"temperature": 0.1,
"top_p": 0.95,
}
SAFETY_CONFIG = {
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}
gemini_pro = GenerativeModel(model_name="gemini-1.5-pro-001")
gemini_flash = GenerativeModel(model_name="gemini-1.5-flash-001")
audio_path_prefix = (
"gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/audio"
)
def generate(
model,
contents,
safety_settings=SAFETY_CONFIG,
generation_config=GENERATION_CONFIG,
as_markdown=False,
):
responses = model.generate_content(
contents=contents,
generation_config=generation_config,
safety_settings=safety_settings,
stream=False,
)
if isinstance(responses, list):
for response in responses:
if as_markdown:
display.display(display.Markdown(response.text))
else:
print(wrap(response.text), end="")
else:
if as_markdown:
display.display(display.Markdown(responses.text))
else:
print(wrap(responses.text), end="")
display_audio(
audio_url="gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/audio/sound_1.mp3"
)
Prompt #1. Audio Understanding¶
This task requires the input to be presented in two different modalities: text and audio. The example of the API call is below, however this is non-optimal prompt and we can make it better.
audio_path = f"{audio_path_prefix}/sound_1.mp3"
audio_content = Part.from_uri(uri=audio_path, mime_type="audio/mp3")
prompt = """Provide a description of the audio.
The description should also contain anything important which people say in the audio."""
contents = [audio_content, prompt]
# print_prompt(contents)
generate(gemini_pro, contents, as_markdown=True)
The audio is a language learning track, specifically for English learners. It focuses on practicing the present continuous tense.
A voice announces "Listen and repeat." Then, a speaker describes an action using the present continuous tense (e.g., "He is eating," "She is washing the car," "They are studying"). After each sentence, there is a pause for the listener to repeat the phrase. This pattern continues with various actions being described.
As we see the model correctly picked that this is a lesson in English, however we can improve the level of details.
Prompt #2. Crafting an effective prompt¶
To get the best results from Gemini for a task, think about both what you tell it and how you tell it.
- What: Include all the necessary information to solve the task, like instructions, examples, and background details.
- How: Structure this information clearly.
- Order: Organize prompt in a logical sequence.
- Delimiters/Separators: Use headings or keywords to highlight key information. XML tags or Markdown headers are a good way to format.
A well-structured prompt is easier for the model to understand and process, leading to more accurate and relevant responses.
Let's rewrite the prompt and add a persona (or role), give clear goals, use XML tags as prompt separators.
prompt = """You are an audio analyzer. You receive an audio and produce the
detailed description about what happens in the audio.
<INSTRUCTIONS>
- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful
</INSTRUCTIONS>
Now analyse the following audio
"""
contents = [audio_content, prompt]
generate(gemini_pro, contents, as_markdown=True)
The audio is an English language learning exercise, specifically focusing on the present continuous tense.
Here's a breakdown:
- Narrator: The narrator sets up the exercise with the phrase "Listen and repeat."
- Speakers: Two speakers, one male and one female, alternate reading sentences in the present continuous tense. Each sentence describes an action currently in progress.
- Content: The sentences describe everyday activities like eating, washing the car, listening to the radio, studying, cooking, sleeping, reading, drinking, talking, watching TV, doing homework, cleaning the house, driving, walking, making lunch, and doing laundry.
Purpose:
The purpose of this audio is to help English language learners practice their pronunciation and comprehension of the present continuous tense. By listening to the speakers and repeating the sentences, learners can improve their fluency and accuracy in using this important grammatical structure.
With the updated prompt, we are able to capture much more details, although this prompt is rather generic and can be used for other audio files. Now let's add these changes as system instruction and see.
Prompt #3. Using system instruction¶
System Instruction (SI) is an effective way to steer Gemini's behavior and shape how the model responds to your prompt. SI can be used to describe model behavior such as persona, goal, tasks to perform, output format / tone / style, any constraints etc.
SI behaves more "sticky" (or consistent) during multi-turn behavior. For example, if you want to achieve a behavior that the model will consistently follow, then system instruction is the best way to put this instruction.
In this example, we will move the task rules to system instruction.
system_prompt = """You are an audio analyzer. You receive an audio and produce
the detailed description about what happens in the audio.
<INSTRUCTIONS>
- Determine what happens in the audio
- Understand the hidden meaning of the audio
- If there are dialogues, identify the talking personas
- Make sure the description is clear and helpful
</INSTRUCTIONS>
"""
prompt = "Now analyze the audio"
gemini_pro_si = GenerativeModel(
model_name="gemini-1.5-pro-001", system_instruction=system_prompt
)
contents = [audio_content, prompt]
generate(gemini_pro_si, contents, as_markdown=True)
The audio is an English language learning exercise for beginners.
The audio begins with a narrator introducing the audio program "CD 2" for the book "English in Action 1, Second Edition" by Barbara H. Foley and Elizabeth R. Nebleck. The copyright information is then given, stating that the copyright is held by National Geographic Learning, a part of Cengage Learning, in 2018.
The audio then transitions into a listening and repetition exercise. A narrator, likely male, instructs the listener to "Listen and repeat." What follows are 16 numbered sentences, each spoken by a different voice, alternating between a male and a female speaker. The sentences describe simple actions in the present continuous tense.
Here are the sentences:
- He is eating.
- He is washing the car.
- She is listening to the radio.
- They are studying.
- He is cooking.
- She is sleeping.
- He is reading.
- She is drinking.
- They are talking.
- They are watching TV.
- He is doing his homework.
- She is cleaning the house.
- She is driving.
- They are walking.
- She is making lunch.
- He is doing the laundry.
The purpose of this audio is to help English language learners practice listening comprehension and pronunciation of basic sentences and vocabulary related to everyday activities.
Prompt #4. Audio Understanding: Get structured outputs¶
Gemini 1.5 Pro and Flash models can generate structured outputs such as JSON, providing a blueprint for the model's output. This feature is also referred to as controlled generation.
In this example, we demonstrate Gemini to return structured output (JSON) from a audio analysis. One of the ways to achieve better understanding of audio (or any multimodal) content is to prompt the model to explain its "reasoning" about the response. This has proven to be very effective method, however it can increase the latency.
Vertex AI Gemini API makes it easy to return JSON output by configuring response MIME type as application/json
. Optionally, you can also configure response_schema
with the JSON schema for the model to generate output as per the schema.
response_schema = {
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"harmfulness_reasoning": {
"type": "STRING",
"description": "Step-by-step detailed reasoning about how harmful is the audio",
},
"harmfulness_score": {
"type": "INTEGER",
"description": "Number between 0 and 5 indicating how harmful is the audio",
},
},
"required": ["harmfulness_reasoning", "harmfulness_score"],
},
}
structured_prompt = """You are an audio analyzer. You receive an audio and
produce harmfulness score - how harmful this audio can be for kids."""
contents = [audio_content, structured_prompt]
generate(
gemini_pro,
contents,
generation_config=GenerationConfig(
response_mime_type="application/json", response_schema=response_schema
),
)
[{"harmfulness_reasoning": "The audio contains simple phrases related to everyday activities, entirely appropriate and harmless for children.", "harmfulness_score": 0}]
The model returned the correct score for the audio by asking the model to output "reasoning" along with the score. Adding "reasoning" field before the "score" gives a consistent and correct score. The intuition is that LLM can generate "reasoning" first and rely on the thoughts to properly produce the score.
Conclusion¶
This demonstrated various examples of working with Gemini using audio files. Following are general prompting strategies when working with Gemini on multimodal prompts, that can help achieve better performance from Gemini:
- Craft clear and concise instructions.
- Add your video or any media first for single-media prompts.
- Add few-shot examples to the prompt to show the model how you want the task done and the expected output.
- Break down the task step-by-step.
- Specify the output format.
- Ask Gemini to include reasoning in its response along with decision or scores
- Use context caching for repeated queries.
Specifically, when working with audio following may help:
- Ask Gemini to avoid summarizing for transcription.
- Add examples for effective speaker diarization.