# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Multimodal Prompting with Gemini: Working with Videos¶
Author(s) | Michael Chertushkin |
Reviewer(s) | Rajesh Thallam, Skander Hannachi |
Last updated | 2024-09-16 |
Overview¶
Gemini models supports adding image, audio, video, and PDF files in text or chat prompts for a text or code response. Gemini 2.0 Flash supports up to 1 Million input tokens with up to 1 hours length of video per prompt. Gemini can analyze the audio embedded within a video as well. You can add videos to Gemini requests to perform video analysis tasks such as video summarization, video chapterization (or localization), key event detection, scene analysis, captioning and transcription and more.
In this notebook we cover prompting recipes and strategies for working with Gemini on videos and show some examples on the way. This notebook is organized as follows:
- Video Understanding
- Key event detection
- Using System instruction
- Analyzing videos with step-by-step reasoning
- Generating structured output
- Using context caching for repeated queries
Getting Started¶
The following steps are necessary to run this notebook, no matter what notebook environment you're using.
If you're entirely new to Google Cloud, get started here.
Google Cloud Project Setup¶
- Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
- Make sure that billing is enabled for your project.
- Enable the Service Usage API
- Enable the Vertex AI API.
- Enable the Cloud Storage API.
Google Cloud Permissions¶
To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.
If you want to skip the optional section, you need at least the following roles:
roles/serviceusage.serviceUsageAdmin
to enable APIsroles/iam.serviceAccountAdmin
to modify service agent permissionsroles/aiplatform.user
to use AI Platform componentsroles/storage.objectAdmin
to modify and delete GCS buckets
Install Vertex AI SDK for Python and other dependencies (If Needed)¶
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.## Install Vertex AI SDK for Python and other dependencies (If Needed)
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.
! pip install google-cloud-aiplatform --upgrade --quiet --user
Restart Runtime¶
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.
# Restart kernel after installs so that your environment can access the new packages
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Authenticate¶
If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.
If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login
in a shell on the machine running the notebook kernel is sufficient.
More authentication options are discussed here.
# Colab authentication.
import sys
if "google.colab" in sys.modules:
from google.colab import auth
auth.authenticate_user()
print("Authenticated")
Set Google Cloud project information and Initialize Vertex AI SDK¶
To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.
Learn more about setting up a project and a development environment.
Make sure to change PROJECT_ID
in the next cell. You can leave the values for REGION
unless you have a specific reason to change them.
import vertexai
PROJECT_ID = "[your-project-id]" # @param {type:"string"}
REGION = "us-central1" # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")
Import Libraries¶
from vertexai.generative_models import (GenerationConfig, GenerativeModel,
HarmBlockThreshold, HarmCategory, Part)
Define Utility functions¶
import http.client
import textwrap
import typing
import urllib.request
from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
def wrap(string, max_width=80):
return textwrap.fill(string, max_width)
def get_bytes_from_url(url: str) -> bytes:
with urllib.request.urlopen(url) as response:
response = typing.cast(http.client.HTTPResponse, response)
bytes = response.read()
return bytes
def get_bytes_from_gcs(gcs_path: str):
bucket_name = gcs_path.split("/")[2]
object_prefix = "/".join(gcs_path.split("/")[3:])
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.get_blob(object_prefix)
return blob.download_as_bytes()
def display_image(image_url: str, width: int = 300, height: int = 200):
if image_url.startswith("gs://"):
image_bytes = get_bytes_from_gcs(image_url)
else:
image_bytes = get_bytes_from_url(image_url)
display.display(display.Image(data=image_bytes, width=width, height=height))
def display_video(video_url: str, width: int = 300, height: int = 200):
if video_url.startswith("gs://"):
video_bytes = get_bytes_from_gcs(video_url)
else:
video_bytes = get_bytes_from_url(video_url)
display.display(
display.Video(
data=video_bytes,
width=width,
height=height,
embed=True,
mimetype="video/mp4",
)
)
def display_audio(audio_url: str, width: int = 300, height: int = 200):
if audio_url.startswith("gs://"):
audio_bytes = get_bytes_from_gcs(audio_url)
else:
audio_bytes = get_bytes_from_url(audio_url)
display.display(display.Audio(data=audio_bytes, embed=True))
def print_prompt(contents: list[str | Part]):
for content in contents:
if isinstance(content, Part):
if content.mime_type.startswith("image"):
display_image(image_url=content.file_data.file_uri)
elif content.mime_type.startswith("video"):
display_video(video_url=content.file_data.file_uri)
elif content.mime_type.startswith("audio"):
display_audio(audio_url=content.file_data.file_uri)
else:
print(content)
else:
print(content)
Initialize Gemini¶
# Gemini Config
GENERATION_CONFIG = {
"max_output_tokens": 8192,
"temperature": 0.1,
"top_p": 0.95,
}
SAFETY_CONFIG = {
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}
gemini = GenerativeModel(model_name="gemini-2.0-flash-001")
videos_path_prefix = (
"gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/videos"
)
def generate(
model,
contents,
safety_settings=SAFETY_CONFIG,
generation_config=GENERATION_CONFIG,
as_markdown=False,
):
responses = model.generate_content(
contents=contents,
generation_config=generation_config,
safety_settings=safety_settings,
stream=False,
)
if isinstance(responses, list):
for response in responses:
if as_markdown:
display.display(display.Markdown(response.text))
else:
print(wrap(response.text), end="")
else:
if as_markdown:
display.display(display.Markdown(responses.text))
else:
print(wrap(responses.text), end="")
display_video(
video_url="gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/videos/video_1.mp4"
)
Prompt #1. Video Understanding¶
This task requires the input to be presented in two different modalities: text and video. The example of the API call is below, however this is non-optimal prompt and we can make it better.
video_path = f"{videos_path_prefix}/video_1.mp4"
video_content = Part.from_uri(uri=video_path, mime_type="video/mp4")
prompt = """Provide a description of the video. The description should also
contain anything important which people say in the video."""
contents = [video_content, prompt]
# print_prompt(contents)
generate(gemini, contents)
Here is a description of the video: The video shows a person tossing a pink collapsible cup in the air and catching it. The background is a white curtain. The person's arm and hand are visible. The cup is the main focus of the video. The video is shot in a bright, minimalist style. There is no audio in the video.
As we see the model correctly picked what happens there, but it did not provide much details. Let's modify the prompt.
Video Understanding. Advanced Prompt¶
prompt = """You are an expert video analyzer. You task is to analyze the video
and produce the detailed description about what happens on the video.
Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.
Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor
or some hidden context.
"""
contents = [video_content, prompt]
generate(gemini, contents, as_markdown=True)
Here's a detailed analysis of the video:
- 00:00 A hand throws a pink collapsible cup into the air against a white curtain backdrop.
- 00:01 The hand catches the cup.
- 00:02 The hand throws the cup into the air again, this time in its collapsed form.
- 00:03 The hand catches the cup in its expanded form.
- 00:04 The hand shakes the cup.
- 00:05 The hand holds the cup still.
- 00:06 The hand moves the cup around.
- 00:07 The hand throws the cup into the air again.
- 00:08 The hand holds the cup still.
- 00:09 The hand shakes the cup.
- 00:10 The hand holds the cup still.
The central theme of the video is showcasing a pink collapsible cup. The hand interacts with the cup by throwing it in the air, catching it, shaking it, and holding it still. The white curtain backdrop provides a clean and simple background, drawing attention to the cup and the hand's interaction with it.
The response with the updated prompt captures much more details. Although this prompt is rather generic and can be used for other videos, let's add specifics to the prompt. For example, if we want to capture at which time certain event happened.
Prompt #2. Video Understanding: Key events detection¶
prompt = """You are an expert video analyzer. You task is to analyze the video
and produce the detailed description about what happens on the video.
Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.
Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor
or some hidden context.
At which moment the cup was thrown for the second time?
"""
contents = [video_content, prompt]
generate(gemini, contents, as_markdown=True)
Here's a detailed analysis of the video:
- 00:00 A hand throws a pink collapsible cup into the air against a white curtain backdrop.
- 00:01 The hand catches the cup.
- 00:02 The hand throws the cup again.
- 00:03 The hand catches the cup again.
- 00:04 The hand shakes the cup.
- 00:07 The hand throws the cup again.
- 00:08 The hand catches the cup again.
- 00:09 The hand shakes the cup.
The central theme of the video is a person playing with a pink collapsible cup, repeatedly throwing it into the air and catching it.
Prompt #3. Video Understanding: Using System instruction¶
System Instruction (SI) is an effective way to steer Gemini's behavior and shape how the model responds to your prompt. SI can be used to describe model behavior such as persona, goal, tasks to perform, output format / tone / style, any constraints etc.
SI behaves more "sticky" (or consistent) during multi-turn behavior. For example, if you want to achieve a behavior that the model will consistently follow, then system instruction is the best way to put this instruction.
In this example, we will move the task rules to system instruction and the question on a specific event in the user prompt.
system_prompt = """You are an expert video analyzer. You task is to analyze the video
and produce the detailed description about what happens on the video.
Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.
Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor
or some hidden context.
"""
prompt = "At which moment the cup was thrown for the second time?"
gemini_si = GenerativeModel(
model_name="gemini-2.0-flash-001", system_instruction=system_prompt
)
contents = [video_content, prompt]
generate(gemini_si, contents, as_markdown=True)
[00:07] The cup was thrown for the second time.
Prompt #4. Video Understanding: Step-by-step reasoning¶
We see that actually a mistake happened in analyzing the video. The model does not show all the timestamps where the cup is thrown. Let's fix it with "step-by-step reasoning".
step_by_step_prompt = """Describe the video. Analyze the video step-by-step.
Output all times when the cup is thrown with timestamps.
After that output the timestamp, when the cup is thrown for the second time.
"""
contents = [video_content, step_by_step_prompt]
generate(gemini_si, contents, as_markdown=True)
Here's a breakdown of the video:
The central theme of the video revolves around a person playfully tossing and catching a pink, collapsible cup. The background is a simple white curtain, keeping the focus entirely on the cup and the hand interacting with it.
Here's a step-by-step analysis with timestamps:
- 00:00 The person throws the pink cup into the air.
- 00:01 The person catches the pink cup.
- 00:02 The person throws the pink cup into the air again.
- 00:03 The person catches the pink cup.
- 00:07 The person throws the pink cup into the air for the third time.
The cup is thrown for the second time at 00:02.
Prompt #5. Video Understanding: Get structured outputs¶
Gemini models can generate structured outputs such as JSON, providing a blueprint for the model's output. This feature is also referred to as controlled generation.
In this example, we demonstrate Gemini to return structured output (JSON) from a video analysis. One of the ways to achieve better understanding of video (or any multimodal) content is to prompt the model to explain its "reasoning" about the response. This has proven to be very effective method, however it can increase the latency.
Vertex AI Gemini API makes it easy to return JSON output by configuring response MIME type as application/json
. Optionally, you can also configure response_schema
with the JSON schema for the model to generate output as per the schema.
response_schema = {
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"harmfulness_reasoning": {
"type": "STRING",
"description": "Step-by-step detailed reasoning about how harmful is the video",
},
"harmfulness_score": {
"type": "INTEGER",
"description": "Number between 0 and 5 indicating how harmful is the video",
},
},
"required": ["harmfulness_reasoning", "harmfulness_score"],
},
}
structured_prompt = """You are an expert video analyzer. You task is to analyze the video
and produce a harmfulness score - how harmful this video can be for kids."""
contents = [video_content, structured_prompt]
generate(
gemini,
contents,
generation_config=GenerationConfig(
response_mime_type="application/json", response_schema=response_schema
),
)
[ { "harmfulness_reasoning": "The video features a hand tossing and catching a pink cup. There is no indication of any harmful or dangerous content, nor does it contain any themes or visuals that would be considered inappropriate for children. The scene is simple and does not present any risk of promoting negative behaviors.", "harmfulness_score": 0 } ]
The model returned the correct score for the video by asking the model to output "reasoning" along with the score. Adding "reasoning" field before the "score" gives a consistent and correct score. The intuition is that LLM can generate "reasoning" first and rely on the thoughts to properly produce the score.
Prompt #6. Video Understanding: Context Caching¶
Context caching is a method to reduce the cost of requests that contain repeated content with high input token count. It can potentially reduce the latency at the cost of storing the objects in the cache. The user can specify cache expiration time for which the object is saved in cache.
Context caching helps a lot when we want:
- to repeatedly ask questions about the long video
- to reduce costs and save latency
long_video_path = f"{videos_path_prefix}/long_video_1.mp4"
long_video_content = Part.from_uri(uri=long_video_path, mime_type="video/mp4")
prompt = """Describe what happens in the beginning, in the middle and in the
end of the video. Also, list the name of the main character and any problems
they face."""
contents = [long_video_content, prompt]
# print_prompt(contents)
# Time the call without context caching
from timeit import default_timer as timer
start = timer()
generate(gemini, contents)
end = timer()
print(f"\nTime elapsed: {end - start} seconds")
Here's a breakdown of the video: **Beginning (0:00 - 1:25):** * The video opens with the title card for "Sherlock Jr." starring Buster Keaton, presented by Joseph M. Schenck. * Credits for the story, photography, art direction, and electrician are shown. * Copyright information for 1924 is displayed. * A proverb appears: "Don't try to do two things at once and expect to do justice to both." * The opening narration sets the scene: a boy working as a moving picture operator in a small-town theater is also studying to be a detective. **Middle (1:26 - 38:59):** * The video shows the main character, a young man with a mustache, reading a book titled "How-To-Be-A-Detective" in an empty theater. * His boss tells him to clean the theater instead of reading detective books. * The young man is shown sweeping the theater and then walking to a confectionery store. * He sees a girl in the store and wants to buy her chocolates, but he doesn't have enough money. * The girl's father hires a man to help him. * The girl is seen with a dog, and a man steals her ring. * The girl's father is seen with the man he hired, and he discovers that his watch has been stolen. * The young man is called to investigate the theft. * The young man searches everyone, but the watch is not found. * The young man is told to leave the house and never come back. * The young man returns to the theater and starts the movie. * The movie is "Hearts and Pearls" and the young man falls asleep. * The young man dreams that he is in the movie. * The young man is seen in the movie, and he is trying to steal the pearls. * The young man is seen in the movie, and he is trying to escape. * The young man is seen in the movie, and he is being chased by the police. * The young man is seen in the movie, and he is trying to get away on a motorcycle. * The young man is seen in the movie, and he is trying to get away in a car. * The young man is seen in the movie, and he is trying to get away in a boat. * The young man is seen in the movie, and he is swimming away. **End (39:00 - 44:06):** * The young man wakes up and is back in the projection booth. * The girl comes to the projection booth and tells him that her father made a mistake. * The girl shows the young man the pawn ticket for the watch. * The young man is seen in the movie, and he is now a detective. * The detective is seen with his assistant, Gillette. * The detective is seen in the movie, and he is trying to solve the case. * The detective is seen in the movie, and he is trying to catch the thief. * The detective is seen in the movie, and he is trying to save the girl. * The detective is seen in the movie, and he is reunited with the girl. * The movie ends. **Main Character and Problems:** * **Main Character:** The young man working as a movie projectionist (played by Buster Keaton). He is also studying to be a detective. * **Problems:** * He is trying to balance his job with his dream of becoming a detective. * He is not taken seriously as a detective. * He is accused of stealing a watch. * He is kicked out of the girl's house. * He is trying to save the girl from the thief. Time elapsed: 40.79677771499996 seconds
import datetime
from vertexai.preview import caching
from vertexai.preview.generative_models import GenerativeModel
cached_content = caching.CachedContent.create(
model_name="gemini-2.0-flash-001",
contents=[long_video_content],
ttl=datetime.timedelta(hours=1),
display_name="long video cache",
)
model_cached = GenerativeModel.from_cached_content(cached_content=cached_content)
# Call with context caching
start = timer()
responses = model_cached.generate_content(
prompt,
generation_config=GENERATION_CONFIG,
safety_settings=SAFETY_CONFIG,
stream=False,
)
end = timer()
print(wrap(responses.text), end="")
print(f"\nTime elapsed: {end - start} seconds")
Here's a breakdown of the video: **Beginning (0:00-1:25):** * The video starts with the title card for the film "Sherlock Jr." starring Buster Keaton, presented by Joseph M. Schenck. * Credits are shown, including director, writers, photography, art director, and electrician. * Copyright information is displayed, indicating the film was copyrighted in 1924. * A proverb is presented: "Don't try to do two things at once and expect to do justice to both." * The introduction explains that the story is about a boy who tried to do two things at once: work as a moving picture operator and study to be a detective. **Middle (1:26-38:59):** * The scene shifts to a movie theater where a young man (Buster Keaton) is reading a book titled "How-To-Be-A- Detective." * His boss tells him to clean the theater instead of reading. * The young man is shown working at the theater, but he is distracted by his detective studies. * He tries to buy a box of chocolates for a girl he likes, but he doesn't have enough money. * The girl is seen with another man, who buys her a more expensive box of chocolates. * The young man is called to a house where a crime has been committed. * He tries to investigate, but he is clumsy and makes mistakes. * He is accused of stealing a watch and is kicked out of the house. * He returns to the theater and falls asleep in the projection booth. * While asleep, he dreams that he enters the movie screen and becomes the detective in the film. * He interacts with the characters and tries to solve the crime, but he is clumsy and makes mistakes. * He is chased by the villains and ends up in a series of dangerous situations. * He is chased by the police and ends up falling into a river. **End (38:59-44:06):** * The young man wakes up from his dream and realizes that he is late for work. * He rushes to the projection booth and starts the movie. * He sees the girl he likes in the audience and realizes that she is in danger. * He enters the movie screen and saves her from the villains. * The film ends with the young man and the girl together. **Main Character and Problems:** * **Main Character:** The young man, played by Buster Keaton, who works as a movie projectionist and aspires to be a detective. * **Problems:** * Balancing his job with his detective studies. * Not having enough money to impress the girl he likes. * Being accused of stealing a watch. * Being clumsy and making mistakes as a detective. * Being unable to save the girl he likes in the real world. Time elapsed: 29.962379157999976 seconds
As we see the result with context caching was relatively faster than without context caching. Not only that, the cost of the request is lower as we did not need to send the video again during the prompt for analysis.
Context caching therefore is ideal for the repeated questions against the same long file: video, document, audio.
Conclusion¶
This demonstrated various examples of working with Gemini using videos. Following are general prompting strategies when working with Gemini on multimodal prompts, that can help achieve better performance from Gemini:
- Craft clear and concise instructions.
- Add your video or any media first for single-media prompts.
- Add few-shot examples to the prompt to show the model how you want the task done and the expected output.
- Break down the task step-by-step.
- Specify the output format.
- Ask Gemini to include reasoning in its response along with decision or scores
- Use context caching for repeated queries.
Specifically, when working with videos following may help:
- Specify timestamp format when localizing videos.
- Ask Gemini to focus on visual content for well-known video clips.
- Process long videos in segments for dense outputs.