# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Multimodal Prompting with Gemini 1.5: Working with Videos¶
Author(s) | Michael Chertushkin |
Reviewer(s) | Rajesh Thallam, Skander Hannachi |
Last updated | 2024-09-16 |
Overview¶
Gemini 1.5 Pro and Flash models supports adding image, audio, video, and PDF files in text or chat prompts for a text or code response. Gemini 1.5 Pro supports up to 2 Million input tokens with up to 2 hours length of video per prompt. Gemini can analyze the audio embedded within a video as well. You can add videos to Gemini requests to perform video analysis tasks such as video summarization, video chapterization (or localization), key event detection, scene analysis, captioning and transcription and more.
In this notebook we cover prompting recipes and strategies for working with Gemini on videos and show some examples on the way. This notebook is organized as follows:
- Video Understanding
- Key event detection
- Using System instruction
- Analyzing videos with step-by-step reasoning
- Generating structured output
- Using context caching for repeated queries
Getting Started¶
The following steps are necessary to run this notebook, no matter what notebook environment you're using.
If you're entirely new to Google Cloud, get started here.
Google Cloud Project Setup¶
- Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
- Make sure that billing is enabled for your project.
- Enable the Service Usage API
- Enable the Vertex AI API.
- Enable the Cloud Storage API.
Google Cloud Permissions¶
To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.
If you want to skip the optional section, you need at least the following roles:
roles/serviceusage.serviceUsageAdmin
to enable APIsroles/iam.serviceAccountAdmin
to modify service agent permissionsroles/aiplatform.user
to use AI Platform componentsroles/storage.objectAdmin
to modify and delete GCS buckets
Install Vertex AI SDK for Python and other dependencies (If Needed)¶
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.## Install Vertex AI SDK for Python and other dependencies (If Needed)
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.
! pip install google-cloud-aiplatform --upgrade --quiet --user
Restart Runtime¶
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.
# Restart kernel after installs so that your environment can access the new packages
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Authenticate¶
If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.
If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login
in a shell on the machine running the notebook kernel is sufficient.
More authentication options are discussed here.
# Colab authentication.
import sys
if "google.colab" in sys.modules:
from google.colab import auth
auth.authenticate_user()
print("Authenticated")
Set Google Cloud project information and Initialize Vertex AI SDK¶
To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.
Learn more about setting up a project and a development environment.
Make sure to change PROJECT_ID
in the next cell. You can leave the values for REGION
unless you have a specific reason to change them.
import vertexai
PROJECT_ID = "[your-project-id]" # @param {type:"string"}
REGION = "us-central1" # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")
Vertex AI SDK initialized. Vertex AI SDK version = 1.65.0
Import Libraries¶
from vertexai.generative_models import (GenerationConfig, GenerativeModel,
HarmBlockThreshold, HarmCategory, Part)
Define Utility functions¶
import http.client
import textwrap
import typing
import urllib.request
from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
def wrap(string, max_width=80):
return textwrap.fill(string, max_width)
def get_bytes_from_url(url: str) -> bytes:
with urllib.request.urlopen(url) as response:
response = typing.cast(http.client.HTTPResponse, response)
bytes = response.read()
return bytes
def get_bytes_from_gcs(gcs_path: str):
bucket_name = gcs_path.split("/")[2]
object_prefix = "/".join(gcs_path.split("/")[3:])
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.get_blob(object_prefix)
return blob.download_as_bytes()
def display_image(image_url: str, width: int = 300, height: int = 200):
if image_url.startswith("gs://"):
image_bytes = get_bytes_from_gcs(image_url)
else:
image_bytes = get_bytes_from_url(image_url)
display.display(display.Image(data=image_bytes, width=width, height=height))
def display_video(video_url: str, width: int = 300, height: int = 200):
if video_url.startswith("gs://"):
video_bytes = get_bytes_from_gcs(video_url)
else:
video_bytes = get_bytes_from_url(video_url)
display.display(
display.Video(
data=video_bytes,
width=width,
height=height,
embed=True,
mimetype="video/mp4",
)
)
def display_audio(audio_url: str, width: int = 300, height: int = 200):
if audio_url.startswith("gs://"):
audio_bytes = get_bytes_from_gcs(audio_url)
else:
audio_bytes = get_bytes_from_url(audio_url)
display.display(display.Audio(data=audio_bytes, embed=True))
def print_prompt(contents: list[str | Part]):
for content in contents:
if isinstance(content, Part):
if content.mime_type.startswith("image"):
display_image(image_url=content.file_data.file_uri)
elif content.mime_type.startswith("video"):
display_video(video_url=content.file_data.file_uri)
elif content.mime_type.startswith("audio"):
display_audio(audio_url=content.file_data.file_uri)
else:
print(content)
else:
print(content)
Initialize Gemini¶
# Gemini Config
GENERATION_CONFIG = {
"max_output_tokens": 8192,
"temperature": 0.1,
"top_p": 0.95,
}
SAFETY_CONFIG = {
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}
gemini_pro = GenerativeModel(model_name="gemini-1.5-pro-001")
gemini_flash = GenerativeModel(model_name="gemini-1.5-flash-001")
videos_path_prefix = (
"gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/videos"
)
def generate(
model,
contents,
safety_settings=SAFETY_CONFIG,
generation_config=GENERATION_CONFIG,
as_markdown=False,
):
responses = model.generate_content(
contents=contents,
generation_config=generation_config,
safety_settings=safety_settings,
stream=False,
)
if isinstance(responses, list):
for response in responses:
if as_markdown:
display.display(display.Markdown(response.text))
else:
print(wrap(response.text), end="")
else:
if as_markdown:
display.display(display.Markdown(responses.text))
else:
print(wrap(responses.text), end="")
display_video(
video_url="gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/videos/video_1.mp4"
)
Prompt #1. Video Understanding¶
This task requires the input to be presented in two different modalities: text and video. The example of the API call is below, however this is non-optimal prompt and we can make it better.
video_path = f"{videos_path_prefix}/video_1.mp4"
video_content = Part.from_uri(uri=video_path, mime_type="video/mp4")
prompt = """Provide a description of the video. The description should also
contain anything important which people say in the video."""
contents = [video_content, prompt]
# print_prompt(contents)
generate(gemini_pro, contents)
The video shows a hand holding a pink collapsible cup. The hand opens and closes the cup several times. There is no sound in the video.
As we see the model correctly picked what happens there, but it did not provide much details. Let's modify the prompt.
Video Understanding. Advanced Prompt¶
prompt = """You are an expert video analyzer. You task is to analyze the video
and produce the detailed description about what happens on the video.
Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.
Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor
or some hidden context.
"""
contents = [video_content, prompt]
generate(gemini_pro, contents, as_markdown=True)
The video showcases a person playfully tossing and catching a pink collapsible cup against a backdrop of pristine white curtains.
Detailed Breakdown:
- 00:00: The video begins with the person tossing the cup upwards. The cup is partially collapsed, showcasing its flexibility.
- 00:01: The person catches the cup effortlessly, demonstrating its lightweight and easy-to-handle design.
- 00:02 - 00:10: This sequence repeats the tossing and catching action, emphasizing the cup's portability and fun aspect. The repetitive motion suggests a sense of enjoyment and leisure.
Entities and Relationships:
- Person: The video focuses on the hand and arm of a person, suggesting their interaction with the cup.
- Collapsible Cup: The central object is a bright pink collapsible cup, highlighting its vibrant color and unique feature.
- White Curtains: The plain white curtains serve as a neutral background, drawing attention solely to the cup and its movement.
Central Theme:
The video aims to showcase the collapsible cup's practicality and playful nature. The bright color, combined with the tossing action, suggests a product designed for an active, on-the-go lifestyle. The white background further emphasizes the cup's aesthetic appeal and versatility.
The response with the updated prompt captures much more details. Although this prompt is rather generic and can be used for other videos, let's add specifics to the prompt. For example, if we want to capture at which time certain event happened.
Prompt #2. Video Understanding: Key events detection¶
prompt = """You are an expert video analyzer. You task is to analyze the video
and produce the detailed description about what happens on the video.
Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.
Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor
or some hidden context.
At which moment the cup was thrown for the second time?
"""
contents = [video_content, prompt]
generate(gemini_pro, contents, as_markdown=True)
The video showcases a hand playfully tossing and catching a pink collapsible cup against a backdrop of pristine white curtains.
Here's a breakdown:
- 00:00 The video begins with the hand already in motion, tossing the cup upwards.
- 00:01 The hand deftly catches the cup as it descends, momentarily pausing before sending it airborne again.
- 00:02 This marks the second throw of the cup, demonstrating the ease with which it can be caught and tossed due to its lightweight and collapsible design.
The video's central theme revolves around the portability and fun aspect of the collapsible cup. The simple act of tossing and catching emphasizes its lightweight nature, while the vibrant pink color adds a playful touch.
Prompt #3. Video Understanding: Using System instruction¶
System Instruction (SI) is an effective way to steer Gemini's behavior and shape how the model responds to your prompt. SI can be used to describe model behavior such as persona, goal, tasks to perform, output format / tone / style, any constraints etc.
SI behaves more "sticky" (or consistent) during multi-turn behavior. For example, if you want to achieve a behavior that the model will consistently follow, then system instruction is the best way to put this instruction.
In this example, we will move the task rules to system instruction and the question on a specific event in the user prompt.
system_prompt = """You are an expert video analyzer. You task is to analyze the video
and produce the detailed description about what happens on the video.
Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.
Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor
or some hidden context.
"""
prompt = "At which moment the cup was thrown for the second time?"
gemini_pro_si = GenerativeModel(
model_name="gemini-1.5-pro-001", system_instruction=system_prompt
)
contents = [video_content, prompt]
generate(gemini_pro_si, contents, as_markdown=True)
The video showcases a hand playfully tossing and catching a collapsible pink cup against a backdrop of pristine white curtains. The cup's flexibility and the hand's dexterity are emphasized throughout the short clip.
Here's a breakdown:
- 0:00: The video begins with the hand launching the cup upwards.
- 0:01: The hand deftly catches the cup as it descends.
- 0:02: The cup is thrown for the second time. The toss is gentle, almost like a light bounce.
The video doesn't explicitly convey a deeper narrative or humor. It seems to focus on the simple satisfaction of effortless tossing and catching, highlighting the object's properties.
Prompt #4. Video Understanding: Step-by-step reasoning¶
We see that actually a mistake happened in analyzing the video. The model does not show all the timestamps where the cup is thrown. Let's fix it with "step-by-step reasoning".
step_by_step_prompt = """Describe the video. Analyze the video step-by-step.
Output all times when the cup is thrown with timestamps.
After that output the timestamp, when the cup is thrown for the second time.
"""
contents = [video_content, step_by_step_prompt]
generate(gemini_pro_si, contents, as_markdown=True)
The video showcases a person playfully tossing a pink collapsible cup against a white curtain backdrop. The cup's flexibility is evident as it expands and collapses with each toss.
Here's a breakdown of the key moments:
- 0:00: The video begins with the person tossing the cup upwards.
- 0:01: The person catches the cup with their right hand.
- 0:02: The cup is thrown again.
- 0:03: The person catches the cup again.
The cup is thrown for the second time at the timestamp 0:02.
The video highlights the functionality and portability of the collapsible cup, subtly emphasizing its convenience for those constantly on the move. The playful tossing adds a touch of lightheartedness, suggesting the product is not just practical but also fun to use.
Prompt #5. Video Understanding: Get structured outputs¶
Gemini 1.5 Pro and Flash models can generate structured outputs such as JSON, providing a blueprint for the model's output. This feature is also referred to as controlled generation.
In this example, we demonstrate Gemini to return structured output (JSON) from a video analysis. One of the ways to achieve better understanding of video (or any multimodal) content is to prompt the model to explain its "reasoning" about the response. This has proven to be very effective method, however it can increase the latency.
Vertex AI Gemini API makes it easy to return JSON output by configuring response MIME type as application/json
. Optionally, you can also configure response_schema
with the JSON schema for the model to generate output as per the schema.
response_schema = {
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"harmfulness_reasoning": {
"type": "STRING",
"description": "Step-by-step detailed reasoning about how harmful is the video",
},
"harmfulness_score": {
"type": "INTEGER",
"description": "Number between 0 and 5 indicating how harmful is the video",
},
},
"required": ["harmfulness_reasoning", "harmfulness_score"],
},
}
structured_prompt = """You are an expert video analyzer. You task is to analyze the video
and produce a harmfulness score - how harmful this video can be for kids."""
contents = [video_content, structured_prompt]
generate(
gemini_pro,
contents,
generation_config=GenerationConfig(
response_mime_type="application/json", response_schema=response_schema
),
)
[{"harmfulness_reasoning": "The video features a person playing with a collapsible cup. There are no elements of violence, sexual content, drugs, or harmful activities. The person handles the cup gently.", "harmfulness_score": 0}]
The model returned the correct score for the video by asking the model to output "reasoning" along with the score. Adding "reasoning" field before the "score" gives a consistent and correct score. The intuition is that LLM can generate "reasoning" first and rely on the thoughts to properly produce the score.
Prompt #6. Video Understanding: Context Caching¶
Context caching is a method to reduce the cost of requests that contain repeated content with high input token count. It can potentially reduce the latency at the cost of storing the objects in the cache. The user can specify cache expiration time for which the object is saved in cache.
Context caching helps a lot when we want:
- to repeatedly ask questions about the long video
- to reduce costs and save latency
long_video_path = f"{videos_path_prefix}/long_video_1.mp4"
long_video_content = Part.from_uri(uri=long_video_path, mime_type="video/mp4")
prompt = """Describe what happens in the beginning, in the middle and in the
end of the video. Also, list the name of the main character and any problems
they face."""
contents = [long_video_content, prompt]
# print_prompt(contents)
# Time the call without context caching
from timeit import default_timer as timer
start = timer()
generate(gemini_pro, contents)
end = timer()
print(f"\nTime elapsed: {end - start} seconds")
The video is a silent film called "Sherlock Jr." starring Buster Keaton. In the beginning, Buster is a movie projectionist who is studying to be a detective. He is in love with a girl, but her father doesn't approve of him. Buster is framed for stealing the girl's father's watch, and he is kicked out of the house. In the middle, Buster falls asleep while projecting a movie and dreams that he is a detective investigating the theft of a pearl necklace. He uses his detective skills to solve the case, but he is constantly thwarted by the villain. In the end, Buster wakes up from his dream and realizes that he has been framed for stealing the watch. He goes to the pawn shop where the watch was pawned and finds the real thief. He clears his name and wins the girl's heart. The main character is Buster Keaton, and he faces the problems of being framed for stealing a watch, being kicked out of the house, and trying to win the girl's heart. Time elapsed: 65.75050516799092 seconds
import datetime
from vertexai.preview import caching
from vertexai.preview.generative_models import GenerativeModel
cached_content = caching.CachedContent.create(
model_name="gemini-1.5-pro-001",
contents=[long_video_content],
ttl=datetime.timedelta(hours=1),
display_name="long video cache",
)
model_cached = GenerativeModel.from_cached_content(cached_content=cached_content)
# Call with context caching
start = timer()
responses = model_cached.generate_content(
prompt,
generation_config=GENERATION_CONFIG,
safety_settings=SAFETY_CONFIG,
stream=False,
)
end = timer()
print(wrap(responses.text), end="")
print(f"\nTime elapsed: {end - start} seconds")
The video is a silent film called "Sherlock Jr." starring Buster Keaton. In the beginning, Buster is a movie projectionist who is studying to be a detective. He is in love with a girl, but her father doesn't approve of him. A rival for the girl's affections frames Buster for stealing her father's watch. In the middle, Buster is kicked out of the girl's house and tries to follow his rival to prove his innocence. He gets into a series of misadventures, including being chased by a train and falling into a river. In the end, Buster returns to the movie theater and falls asleep while watching a movie. He dreams that he is a detective in the movie and solves the case. He wakes up and realizes that he has solved the case in real life as well. He is reunited with the girl and her father, and his rival is arrested. The main character is Buster Keaton. He faces the problems of being framed for a crime he didn't commit, being kicked out of the girl's house, and being chased by a train. He also has to deal with a series of misadventures that happen to him while he is trying to prove his innocence. Time elapsed: 60.3449609875679 seconds
As we see the result with context caching was relatively faster than without context caching. Not only that, the cost of the request is lower as we did not need to send the video again during the prompt for analysis.
Context caching therefore is ideal for the repeated questions against the same long file: video, document, audio.
Conclusion¶
This demonstrated various examples of working with Gemini using videos. Following are general prompting strategies when working with Gemini on multimodal prompts, that can help achieve better performance from Gemini:
- Craft clear and concise instructions.
- Add your video or any media first for single-media prompts.
- Add few-shot examples to the prompt to show the model how you want the task done and the expected output.
- Break down the task step-by-step.
- Specify the output format.
- Ask Gemini to include reasoning in its response along with decision or scores
- Use context caching for repeated queries.
Specifically, when working with videos following may help:
- Specify timestamp format when localizing videos.
- Ask Gemini to focus on visual content for well-known video clips.
- Process long videos in segments for dense outputs.