# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Using Gemini Long Context Window for Video¶
Open in Colab |
Open in Colab Enterprise |
Open in Vertex AI Workbench |
View on GitHub |
Author(s) | Vijay Reddy |
Reviewer(s) | Rajesh Thallam, Skander Hannachi |
Overview¶
Gemini 1.5 Pro supports up to 2 Million input tokens. This is the equivalent of roughly:
- ~2000 pages of text
- ~19 hours of audio
- ~2 hours of video
- ~60K lines of code
This long context window (LCW) opens up possibilities for new use cases and optimizing standard use cases such as:
- Analyzing video(s) and identifying key moments
- Incident analysis in videos to identify policy violations
- Transcribing, summarizing conversations such as podcasts
In this notebook we will demonstrate Gemini's capability of understanding long context window (LCW) using the video modality*.
Getting Started¶
The following steps are necessary to run this notebook, no matter what notebook environment you're using.
If you're entirely new to Google Cloud, get started here.
Google Cloud Project Setup¶
- Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
- Make sure that billing is enabled for your project.
- Enable the Service Usage API
- Enable the Vertex AI API.
- Enable the Cloud Storage API.
Google Cloud Permissions¶
To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.
If you want to skip the optional section, you need at least the following roles:
roles/serviceusage.serviceUsageAdmin
to enable APIsroles/iam.serviceAccountAdmin
to modify service agent permissionsroles/aiplatform.user
to use AI Platform componentsroles/storage.objectAdmin
to modify and delete GCS buckets
Install Vertex AI SDK for Python and other dependencies (If Needed)¶
The list packages
contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.
! pip install google-cloud-aiplatform --upgrade --quiet --user
Restart Runtime¶
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.
# Restart kernel after installs so that your environment can access the new packages
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Authenticate¶
If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.
If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login
in a shell on the machine running the notebook kernel is sufficient.
More authentication options are discussed here.
# Colab authentication.
import sys
if "google.colab" in sys.modules:
from google.colab import auth
auth.authenticate_user()
print("Authenticated")
Set Google Cloud project information and Initialize Vertex AI SDK¶
To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.
Learn more about setting up a project and a development environment.
Make sure to change PROJECT_ID
in the next cell. You can leave the values for REGION
unless you have a specific reason to change them.
import vertexai
PROJECT_ID = "[your-project-id]" # @param {type:"string"}
PROJECT_ID = "rthallam-demo-project" # @param {type:"string"}
REGION = "us-central1" # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")
Vertex AI SDK initialized. Vertex AI SDK version = 1.64.0
Import Libraries¶
import datetime
from IPython.display import Markdown
from vertexai.preview import caching
from vertexai.preview.generative_models import (GenerativeModel,
HarmBlockThreshold,
HarmCategory, Part)
Initialize Gemini¶
# Gemini Config
GENERATION_CONFIG = dict(temperature=0, seed=1)
SAFETY_CONFIG = {
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
}
Long Context for Video Analysis¶
To demonstrate Gemini's long context capabilities in the video modality we will use two videos from I/O 2024, Google's annual developer conference.
- The opening keynote. It is 21 minutes long and ~370K tokens.
- The deepmind Keynote. It is 17 minutes and ~300K tokens.
We will start with some questions single video questions, then we will demonstrate multi-video prompting by including both videos as context for a total of ~670K tokens.
These videos are publically available on youtube, however since the Gemini API requires video content to be staged in Google Cloud Storage we store copies of these videos there.
OPENING_URI = "gs://gen-ai-assets-public/Google_IO_2024_Keynote_Opening.mp4"
DEEPMIND_URI = "gs://gen-ai-assets-public/Google_IO_2024_Keynote_Deepmind.mp4"
Single Video Prompts¶
Caching context for repeated long context prompts¶
For any repeated long context prompts it is best practice to first cache. Caching large inputs improves cost significantly by avoiding reprocessing large input in every request. For more detailed analysis on the cost savings of caching see this notebook.
%%time
system_instruction = """
Here is the opening keynote from Google I/O 2024. Based on the video answer the following questions.
"""
contents = [
Part.from_uri(OPENING_URI, mime_type="video/mp4"),
]
# create cache
cached_content = caching.CachedContent.create(
model_name="gemini-1.5-pro-001",
system_instruction=system_instruction,
contents=contents,
ttl=datetime.timedelta(minutes=30),
)
cached_content = caching.CachedContent(cached_content_name=cached_content.name)
# configure model to read from cache
model_cached = GenerativeModel.from_cached_content(
cached_content=cached_content,
generation_config=GENERATION_CONFIG,
)
CPU times: user 37.2 ms, sys: 11.7 ms, total: 48.9 ms Wall time: 20.2 s
Prompt #1: Video analysis¶
%%time
response = model_cached.generate_content(
"Describe the setting in which the video takes place"
)
Markdown(response.text)
CPU times: user 55.6 ms, sys: 1.5 ms, total: 57.1 ms Wall time: 1min 13s
The video takes place at the Google I/O 2024 keynote, held at the Shoreline Amphitheatre in Mountain View, California. The CEO of Google, Sundar Pichai, is giving the opening keynote speech. The stage has a large screen displaying the Google logo and various presentations. The audience consists of thousands of developers, with millions more joining virtually around the world.
Analysis¶
This response demonstrates Gemini's use of both audio and visual signals in the video.
- 'The stage has a large screen behind it, and there is a podium with two laptops on it.'. This is a purely visual cue.
- 'The audience consists thousands of developers, with millions more joining virtually around the world.' This is an audio cue as the speaker says this.
Prompt #2: Key event detection from video¶
%%time
response = model_cached.generate_content(
"Give me the timestamps of all applauses in the video with a start and end time (MM:SS)."
)
Markdown(response.text)
CPU times: user 66 ms, sys: 15.2 ms, total: 81.2 ms Wall time: 1min 33s
Sure, here are the timestamps of all the applauses in the video:
- 01:31-01:47
- 05:45-05:51
- 06:50-06:54
- 07:43-07:49
- 11:04-11:11
- 11:35-11:41
- 12:08-12:15
- 16:53-16:58
Let me know if you have any other questions.
Analysis¶
This response demonstrates Gemini's retrieval accuracy over the span of the video, and could be used streamline editing a video.
Prompt #3: Focus on visual content¶
%%time
response = model_cached.generate_content(
"Describe the hand gesture the speaker uses most frequently."
)
Markdown(response.text)
CPU times: user 34.7 ms, sys: 1.94 ms, total: 36.7 ms Wall time: 31.4 s
The speaker most frequently uses a gesture where he brings his hands together in front of his chest, with his palms facing each other and fingers loosely interlocked. He often moves his hands slightly apart and back together while speaking.
Analysis¶
This response illustrates Gemini's attention to subtle visual details
Prompt #4: Attention to text and visual details¶
%%time
response = model_cached.generate_content("Who presented the live demo?")
Markdown(response.text)
CPU times: user 32.9 ms, sys: 28.8 ms, total: 61.8 ms Wall time: 37.8 s
The live demo was presented by Josh Woodward.
Analysis¶
In the video Josh is only introduced by his first name, while his full name is briefly shown on a slide. Gemini is able to pick up on this text and associate it with the name of the speaker. It is also able to differentiate the demo portion of the talk from the main speaker (Sundar Pichai).
Multi Video Prompts¶
Now let's include multiple videos in the prompt. Gemini 1.5 Pro model currently supports up to 10 videos per prompt with total video length of ~2hrs.
Caching videos in the long context¶
%%time
system_instruction = """
Here are two videos from Google I/O 2024.
The first is the opening keynote and the second is the Google DeepMind keynote.
"""
contents = [
Part.from_uri(OPENING_URI, mime_type="video/mp4"),
Part.from_uri(DEEPMIND_URI, mime_type="video/mp4"),
"Based on the videos answer the following questions.",
]
# create cache
cached_content = caching.CachedContent.create(
model_name="gemini-1.5-pro-001",
system_instruction=system_instruction,
contents=contents,
ttl=datetime.timedelta(minutes=30),
)
cached_content = caching.CachedContent(cached_content_name=cached_content.name)
# configure model to read from cache
model_cached = GenerativeModel.from_cached_content(
cached_content=cached_content,
generation_config=GENERATION_CONFIG,
)
CPU times: user 91.7 ms, sys: 14.1 ms, total: 106 ms Wall time: 54.8 s
Prompt #5: Analyzing and comparing two videos¶
%%time
res = model_cached.generate_content("How do the videos differ?")
Markdown(res.text)
CPU times: user 67.1 ms, sys: 23.2 ms, total: 90.2 ms Wall time: 54 s
The first video is the Google I/O 2024 opening keynote, presented by Sundar Pichai. It focuses on the advancements in AI, particularly the Gemini model, and its integration into various Google products like Search, Photos, and Workspace. The video highlights the capabilities of Gemini, including its multimodal reasoning, long context window, and ability to handle complex queries. It also showcases the potential of AI agents in simplifying everyday tasks.
The second video is the Google DeepMind keynote, presented by Demis Hassabis. It delves deeper into the research and development behind Gemini, emphasizing its foundation in neuroscience and the goal of achieving artificial general intelligence (AGI). The video showcases specific examples of DeepMind's work, including AlphaFold 3 for protein structure prediction, Project Astra for AI agents, Imagen 3 for image generation, and Veo for generative video.
Here's a table summarizing the key differences:
Feature | Google I/O Keynote | Google DeepMind Keynote |
---|---|---|
Focus | Gemini's integration into Google products and its impact on users | DeepMind's research and development efforts in AI, particularly Gemini |
Speaker | Sundar Pichai | Demis Hassabis |
Key Highlights | Gemini's capabilities, AI agents, user-focused applications | Technical advancements, AGI, specific projects like AlphaFold, Astra, Imagen, and Veo |
Target Audience | General audience, developers, users | Researchers, developers, AI enthusiasts |
In essence, the Google I/O keynote provides a broader overview of Gemini and its applications, while the DeepMind keynote offers a more technical and research-oriented perspective.
Analysis¶
This response demonstrates comparative analysis of two videos. It requires first an understanding of the contents of each individual video, then being able to reason about how they differ.
Prompt #6: Information retrieval across videos¶
%%time
res = model_cached.generate_content(
"What new features were launched? Format your response as a bulleted list."
)
Markdown(res.text)
CPU times: user 50.3 ms, sys: 41.2 ms, total: 91.5 ms Wall time: 1min 16s
Sure, here are the new features launched based on the video provided:
- AI Overviews - A new search experience that allows users to ask longer and more complex questions, even searching with photos.
- Ask Photos - A new feature in Google Photos that allows users to search their memories in a deeper way by asking questions about their photos.
- 2 Million Tokens Context Window - An expansion of the context window in Gemini 1.5 Pro to 2 million tokens, opening up new possibilities for developers.
- Audio Overviews - A new feature in NotebookLM that allows users to listen to a lively science discussion personalized for them based on the text material they provide.
- Gemini 1.5 Flash - A lighter-weight model compared to Gemini 1.5 Pro, designed to be fast and cost-efficient to serve at scale while still featuring multimodal reasoning capabilities and breakthrough long context.
- Project Astra - A universal AI agent that can be truly helpful in everyday life.
- Imagen 3 - Google's most capable image generation model yet, featuring stronger evaluations, extensive red teaming, and state-of-the-art watermarking with SynthID.
- Music AI Sandbox - A suite of professional music AI tools that can create new instrumental sections from scratch, transfer styles between tracks, and more.
- Veo - Google's newest and most capable generative video model, capable of creating high-quality 1080p videos from text, image, and video prompts.
Please note that some of these features are still in development and may not be available to the public yet.
Analysis¶
This response illustrates retrieval across multiple videos.
Prompt #7: Targeted video analysis and relevant detail extraction¶
%%time
res = model_cached.generate_content(
"What technologies were introduced that can help artists?"
)
Markdown(res.text)
CPU times: user 40 ms, sys: 32.9 ms, total: 72.9 ms Wall time: 46.6 s
The video shows two technologies that can help artists:
- Music AI Sandbox: This is a suite of professional music AI tools that can create new instrumental sections from scratch, transfer styles between tracks, and more.
- Veo: This is a generative video model that can create high-quality 1080p videos from text, image, and video prompts. It can capture the details of your instructions in different visual and cinematic styles. You can prompt for things like aerial shots of a landscape or a timelapse and further edit your videos using additional prompts.
Both of these technologies are powered by Google's Gemini AI model.
Analysis¶
The artist collaborations are shown in the second video only. Gemini is able to isolate this video and pick out the relevant technologies mentioned.
Conclusion¶
The notebook demonstrated combining Gemini's long context and multimodal capability to analyze videos of considerable length. Gemini has demonstrated competence on retrieval, description, and reasoning tasks on both single and multi video prompts.