In [ ]:

Copied!





# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Multimodal Prompting with Gemini: Working with Images¶

Run in Colab

Open in Colab Enterprise

Open in Vertex AI Workbench

View on GitHub


Author(s)	Michael Chertushkin
Reviewer(s)	Rajesh Thallam, Skander Hannachi
Last updated	2024-09-16

Overview¶

Gemini 2.0 models supports adding image, audio, video, and PDF files in text or chat prompts for a text, image or code response. Gemini 2.0 Flash supports up to 1 Million input tokens with up to 3600 images per prompt. You can add images to Gemini requests to perform image understanding tasks such as image captioning, visual question and answering, comparing images, object or text detection and more.

In this notebook we cover prompting recipes and strategies for working with Gemini on image files and show examples on the way. This notebook is organized as follows:

Image Understanding
Using system instruction
Structuring prompt with images
Adding few-shot examples the image prompt
Document understanding
Math understanding

This notebook does not cover image generation task. Imagen on Vertex AI lets you quickly generate high-quality images from simple text descriptions. Refer to this notebook for image generation.

Getting Started¶

The following steps are necessary to run this notebook, no matter what notebook environment you're using.

If you're entirely new to Google Cloud, get started here.

Google Cloud Project Setup¶

Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
Make sure that billing is enabled for your project.
Enable the Service Usage API
Enable the Vertex AI API.
Enable the Cloud Storage API.

Google Cloud Permissions¶

To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.

If you want to skip the optional section, you need at least the following roles:

roles/serviceusage.serviceUsageAdmin to enable APIs
roles/iam.serviceAccountAdmin to modify service agent permissions
roles/aiplatform.user to use AI Platform components
roles/storage.objectAdmin to modify and delete GCS buckets

Install Vertex AI SDK for Python and other dependencies (If Needed)¶

The list packages contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.## Install Vertex AI SDK for Python and other dependencies (If Needed)

The list packages contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.

In [ ]:

Copied!

! pip install google-cloud-aiplatform --upgrade --quiet --user
! pip install google-cloud-aiplatform --upgrade --quiet --user

Restart Runtime¶

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [ ]:

Copied!

# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Authenticate¶

If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.

If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login in a shell on the machine running the notebook kernel is sufficient.

More authentication options are discussed here.

In [ ]:

Copied!





# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()
    print("Authenticated")
# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()
    print("Authenticated")

Set Google Cloud project information and Initialize Vertex AI SDK¶

To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.

Learn more about setting up a project and a development environment.

Make sure to change PROJECT_ID in the next cell. You can leave the values for REGION unless you have a specific reason to change them.

In [ ]:

Copied!





import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")
import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")

Import Libraries¶

In [ ]:

Copied!

from vertexai.generative_models import (GenerativeModel, HarmBlockThreshold,
                                        HarmCategory, Image, Part)
from vertexai.generative_models import (GenerativeModel, HarmBlockThreshold,
                                        HarmCategory, Image, Part)

Define Utility functions¶

In [ ]:

Copied!





import http.client
import textwrap
import typing
import urllib.request

from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"


def wrap(string, max_width=80):
    return textwrap.fill(string, max_width)


def get_bytes_from_url(url: str) -> bytes:
    with urllib.request.urlopen(url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        bytes = response.read()
    return bytes


def get_bytes_from_gcs(gcs_path: str):
    bucket_name = gcs_path.split("/")[2]
    object_prefix = "/".join(gcs_path.split("/")[3:])
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(object_prefix)
    return blob.download_as_bytes()


def display_image(image_url: str, width: int = 300, height: int = 200):
    if image_url.startswith("gs://"):
        image_bytes = get_bytes_from_gcs(image_url)
    else:
        image_bytes = get_bytes_from_url(image_url)
    display.display(display.Image(data=image_bytes, width=width, height=height))


def display_video(video_url: str, width: int = 300, height: int = 200):
    if video_url.startswith("gs://"):
        video_bytes = get_bytes_from_gcs(video_url)
    else:
        video_bytes = get_bytes_from_url(video_url)
    display.display(
        display.Video(
            data=video_bytes,
            width=width,
            height=height,
            embed=True,
            mimetype="video/mp4",
        )
    )

def display_audio(audio_url: str, width: int = 300, height: int = 200):
    if audio_url.startswith("gs://"):
        audio_bytes = get_bytes_from_gcs(audio_url)
    else:
        audio_bytes = get_bytes_from_url(audio_url)
    display.display(display.Audio(data=audio_bytes, embed=True))


def print_prompt(contents: list[str | Part]):
    for content in contents:
        if isinstance(content, Part):
            if content.mime_type.startswith("image"):
                display_image(image_url=content.file_data.file_uri)
            elif content.mime_type.startswith("video"):
                display_video(video_url=content.file_data.file_uri)
            elif content.mime_type.startswith("audio"):
                display_audio(audio_url=content.file_data.file_uri)
            else:
                print(content)
        else:
            print(content)
import http.client
import textwrap
import typing
import urllib.request

from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"


def wrap(string, max_width=80):
    return textwrap.fill(string, max_width)


def get_bytes_from_url(url: str) -> bytes:
    with urllib.request.urlopen(url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        bytes = response.read()
    return bytes


def get_bytes_from_gcs(gcs_path: str):
    bucket_name = gcs_path.split("/")[2]
    object_prefix = "/".join(gcs_path.split("/")[3:])
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(object_prefix)
    return blob.download_as_bytes()


def display_image(image_url: str, width: int = 300, height: int = 200):
    if image_url.startswith("gs://"):
        image_bytes = get_bytes_from_gcs(image_url)
    else:
        image_bytes = get_bytes_from_url(image_url)
    display.display(display.Image(data=image_bytes, width=width, height=height))


def display_video(video_url: str, width: int = 300, height: int = 200):
    if video_url.startswith("gs://"):
        video_bytes = get_bytes_from_gcs(video_url)
    else:
        video_bytes = get_bytes_from_url(video_url)
    display.display(
        display.Video(
            data=video_bytes,
            width=width,
            height=height,
            embed=True,
            mimetype="video/mp4",
        )
    )

def display_audio(audio_url: str, width: int = 300, height: int = 200):
    if audio_url.startswith("gs://"):
        audio_bytes = get_bytes_from_gcs(audio_url)
    else:
        audio_bytes = get_bytes_from_url(audio_url)
    display.display(display.Audio(data=audio_bytes, embed=True))


def print_prompt(contents: list[str | Part]):
    for content in contents:
        if isinstance(content, Part):
            if content.mime_type.startswith("image"):
                display_image(image_url=content.file_data.file_uri)
            elif content.mime_type.startswith("video"):
                display_video(video_url=content.file_data.file_uri)
            elif content.mime_type.startswith("audio"):
                display_audio(audio_url=content.file_data.file_uri)
            else:
                print(content)
        else:
            print(content)

Initialize Gemini¶

In [ ]:

Copied!





# Gemini Config
GENERATION_CONFIG = {
    "max_output_tokens": 8192,
    "temperature": 0.1,
    "top_p": 0.95,
}

SAFETY_CONFIG = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

gemini = GenerativeModel(model_name="gemini-2.0-flash-001")
image_path_prefix = (
    "gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/images"
)


def generate(
    model,
    contents,
    safety_settings=SAFETY_CONFIG,
    generation_config=GENERATION_CONFIG,
    as_markdown=False,
):
    responses = model.generate_content(
        contents=contents,
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
    )
    if isinstance(responses, list):
        for response in responses:
            if as_markdown:
                display.display(display.Markdown(response.text))
            else:
                print(wrap(response.text), end="")
    else:
        if as_markdown:
            display.display(display.Markdown(responses.text))
        else:
            print(wrap(responses.text), end="")
# Gemini Config
GENERATION_CONFIG = {
    "max_output_tokens": 8192,
    "temperature": 0.1,
    "top_p": 0.95,
}

SAFETY_CONFIG = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

gemini = GenerativeModel(model_name="gemini-2.0-flash-001")
image_path_prefix = (
    "gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/images"
)


def generate(
    model,
    contents,
    safety_settings=SAFETY_CONFIG,
    generation_config=GENERATION_CONFIG,
    as_markdown=False,
):
    responses = model.generate_content(
        contents=contents,
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
    )
    if isinstance(responses, list):
        for response in responses:
            if as_markdown:
                display.display(display.Markdown(response.text))
            else:
                print(wrap(response.text), end="")
    else:
        if as_markdown:
            display.display(display.Markdown(responses.text))
        else:
            print(wrap(responses.text), end="")

Prompt #1. Image Understanding¶

This task requires the input to be presented in two different modalities: text and image. The example of the API call is below, however this is non-optimal prompt and we can make it better.

In [7]:

Copied!

image_path = f"{image_path_prefix}/example_1.jpg"
image_content = Part.from_uri(uri=image_path, mime_type="image/jpeg")
display_image(image_path)
image_path = f"{image_path_prefix}/example_1.jpg"
image_content = Part.from_uri(uri=image_path, mime_type="image/jpeg")
display_image(image_path)

No description has been provided for this image

In [15]:

Copied!

prompt = "Describe what is depicted on the image"
contents = [image_content, prompt]
generate(gemini, contents)
prompt = "Describe what is depicted on the image"
contents = [image_content, prompt]
generate(gemini, contents)

The image shows a group of men in suits standing in a hallway with a green and
white checkered floor. The hallway appears to be in a school or institutional
setting, as there are lockers visible in the background.  In the foreground, a
man in a gray suit is standing on a scale, adjusting the height measurement bar.
Next to him, former President Barack Obama is standing with a slight smile,
looking towards the man on the scale. Several other men in suits are standing
behind them, some smiling and looking in the same direction.  There are mirrors
on the walls, reflecting the scene and adding depth to the image. The overall
atmosphere seems lighthearted and jovial.

As we see the model was not able to pick the dynamics of the situation (the humor with which president Obama is joking).

Let's change the prompt asking Gemini to add more details and see what happens.

In [17]:

Copied!





prompt = """You are good at looking at pictures and uncovering the full story within a visual scene.
Your task is to provide a rich and insightful description of the image.

Key Points:
- Decipher the visual puzzle.
- Uncover hidden meanings.
- Navigate complex dynamics.
- Spotlight the heart of the matter.
- Craft a captivating narrative.

Remember:
- The most compelling descriptions not only capture what's visible but also hint at what lies beneath the surface.
- Try to recover hidden meaning from the scene, for example some hidden humor.
"""

# updated description with prompt changes
contents = [image_content, prompt]
generate(gemini, contents)
prompt = """You are good at looking at pictures and uncovering the full story within a visual scene.
Your task is to provide a rich and insightful description of the image.

Key Points:
- Decipher the visual puzzle.
- Uncover hidden meanings.
- Navigate complex dynamics.
- Spotlight the heart of the matter.
- Craft a captivating narrative.

Remember:
- The most compelling descriptions not only capture what's visible but also hint at what lies beneath the surface.
- Try to recover hidden meaning from the scene, for example some hidden humor.
"""

# updated description with prompt changes
contents = [image_content, prompt]
generate(gemini, contents)

In a brightly lit hallway with a distinctive green and white checkered floor, a
group of men in suits are gathered, creating a scene of both formality and
amusement. The focal point is a man in a gray suit standing on a scale,
seemingly having his height measured by another man in a similar suit who is
adjusting the measuring device.  Former President Barack Obama, dressed in a
dark suit, stands nearby with a playful expression, his body language suggesting
he's about to step onto the scale as well. His stance is casual, with one leg
extended, adding a touch of levity to the otherwise serious attire of the group.
The hallway is lined with lockers and mirrors, which reflect the scene and
multiply the number of people visible, creating a sense of depth and activity.
The mirrors also offer different perspectives of the individuals, adding to the
visual complexity of the image.  The overall atmosphere is lighthearted, with
smiles and laughter evident on the faces of some of the men. The presence of
Obama and the informal poses suggest a relaxed and jovial environment, perhaps a
moment of camaraderie amidst official duties. The scene captures a blend of
professionalism and playfulness, hinting at the human side of these individuals
in their formal attire.

After changing the prompt, the Gemini was able to capture humor and playful interaction.

We followed a few tips when rewriting the prompt:

Give a persona or a role to adopt (you are good at looking at pictures)
Specify a mission or goal (your task is to provide rich description)
Be specific about the instructions and structure them such as bullet points, prompt separators (markdown headers or XML tags)

Prompt #2. Image Understanding: Using System instruction¶

System Instruction (SI) is an effective way to steer Gemini's behavior and shape how the model responds to your prompt. SI can be used to describe model behavior such as persona, goal, tasks to perform, output format / tone / style, any constraints etc.

SI behaves more "sticky" (or consistent) during multi-turn behavior. For example, if you want to achieve a behavior that the model will consistently follow, then system instruction is the best way to put this instruction.

In [ ]:

Copied!





system_prompt = """You are good at looking at pictures and uncovering the full story within a visual scene.
Your task is to provide a rich and insightful description of the image.

Key Points:
- Decipher the visual puzzle.
- Uncover hidden meanings.
- Navigate complex dynamics.
- Spotlight the heart of the matter.
- Craft a captivating narrative.

Remember:
- The most compelling descriptions not only capture what's visible but also hint at what lies beneath the surface.
- Try to recover hidden meaning from the scene, for example some hidden humor.
"""
system_prompt = """You are good at looking at pictures and uncovering the full story within a visual scene.
Your task is to provide a rich and insightful description of the image.

Key Points:
- Decipher the visual puzzle.
- Uncover hidden meanings.
- Navigate complex dynamics.
- Spotlight the heart of the matter.
- Craft a captivating narrative.

Remember:
- The most compelling descriptions not only capture what's visible but also hint at what lies beneath the surface.
- Try to recover hidden meaning from the scene, for example some hidden humor.
"""

In [20]:

Copied!





gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt
)
simple_prompt = "Describe what is depicted on the image"

contents = [image_content, simple_prompt]
generate(gemini_si, contents)
gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt
)
simple_prompt = "Describe what is depicted on the image"

contents = [image_content, simple_prompt]
generate(gemini_si, contents)

In a brightly lit hallway with a retro green and white checkered floor, a group
of men in suits are gathered, seemingly in good spirits. The setting appears to
be a locker room or a similar institutional space, given the presence of lockers
and mirrors along the walls.  The focal point of the image is a man in a gray
suit standing on a scale, adjusting the measuring bar. He holds a black folder,
suggesting he might be a staff member or someone involved in an official
capacity.  Former President Barack Obama is prominently featured, walking with a
slight swagger and a smile, looking towards the man on the scale. His presence
draws attention and suggests that this is an event of some significance or
perhaps a lighthearted moment during a formal occasion.  The other men in the
hallway are also dressed in suits and ties, some smiling and engaged in the
scene, while others are partially visible, adding depth to the composition. The
mirrors on either side of the hallway create reflections that multiply the
number of people and add to the sense of activity and camaraderie.  The overall
atmosphere is one of levity and good humor, with the men appearing relaxed and
enjoying the moment. The presence of the scale and the act of measuring height
add a touch of the unexpected to the formal attire and setting, creating a
memorable and engaging image.

Prompt #3. Image Understanding: Structuring and order of images and texts¶

Gemini works well with images and text in any order. For single-image prompts, starting with the image and then text may improve performance. If your prompt needs images and text mixed together, use the order that feels most natural.

That being said, this isn't a hard and fast rule, and your results may vary. To illustrate, we've included examples of both image-first and text-first prompts below, and in this case there's no significant difference between the two.

In this example we achieved the same level of description as Prompt #1, but with using system instruction (or system prompt):

Add the persona, instructions, and mission into system instruction
Used the simple prompt as before in Prompt #1

In [8]:

Copied!

image_path = f"{image_path_prefix}/city_street.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)
image_path = f"{image_path_prefix}/city_street.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)

Let's run with image first and then text in the prompt.

In [9]:

Copied!





prompt_3 = (
    "Analyze the image and list the physical objects you can detect from the image."
)

contents = [image_content, prompt_3]
generate(gemini, contents)
prompt_3 = (
    "Analyze the image and list the physical objects you can detect from the image."
)

contents = [image_content, prompt_3]
generate(gemini, contents)

Here are the bounding box detections: ```json [   {"box_2d": [434, 606, 478,
695], "label": "sign"},   {"box_2d": [286, 766, 330, 823], "label": "sign"},
{"box_2d": [389, 893, 411, 921], "label": "sign"},   {"box_2d": [434, 873, 468,
923], "label": "sign"},   {"box_2d": [565, 424, 730, 592], "label": "car"},
{"box_2d": [580, 187, 674, 290], "label": "car"},   {"box_2d": [588, 278, 665,
337], "label": "car"},   {"box_2d": [588, 660, 652, 712], "label": "car"},
{"box_2d": [594, 698, 665, 742], "label": "car"},   {"box_2d": [548, 724, 708,
923], "label": "taxi"},   {"box_2d": [594, 622, 643, 666], "label": "car"},
{"box_2d": [604, 0, 771, 57], "label": "car"},   {"box_2d": [592, 387, 665,
427], "label": "motorcycle"},   {"box_2d": [594, 893, 907, 1000], "label":
"car"} ] ```

Let's run with text first and then image in the prompt.

In [10]:

Copied!

contents = [prompt_3, image_content]
generate(gemini, contents)
contents = [prompt_3, image_content]
generate(gemini, contents)

Here's a breakdown of the physical objects I can detect in the image:  *
**Vehicles:**     *   Cars (multiple, including a blue sedan, a yellow taxi, and
several others)     *   Motorcycle *   **Buildings:**     *   Various buildings
(different heights, styles, and materials like brick and glass) *   **Street
Infrastructure:**     *   Traffic lights     *   Street signs (including "Bus
Lane," "One Way," and street name signs like "E 83 St")     *   Street
lamps/light poles     *   Crosswalk markings     *   Road markings (lane
dividers, arrows) *   **Vegetation:**     *   Trees *   **People:**     *
People (visible on the sidewalks, some near outdoor seating) *   **Outdoor
Seating:**     *   Tables and chairs (indicating outdoor dining areas) *
**Sidewalks:**     *   Sidewalks on either side of the street. *   **Taxi
sign:**     *   Taxi sign on top of the yellow taxi.

From this particular example, we see better response with image-first-then-text compared to text-first-then-image. Your mileage may vary depending on the use case.

Prompt #4. Image Understanding: Adding few-shot examples¶

You can add multiple images in the prompt that Gemini can use as examples to understand the output you want. Adding these few-shot examples can help the model identify the patterns and apply the relationship between the given images and responses to the new example. Let's examine how to use few-shot examples for the image understanding task.

This prompt uses Gemini to count number of blocks in a image of Transformer architecture. To help the model, we add 3 images of different architectures - RNN, GRU and LSTM.

In [25]:

Copied!

# Transformer architecture
# Image source: https://aiml.com/compare-the-different-sequence-models-rnn-lstm-gru-and-transformers/
display_image(f"{image_path_prefix}/example_5.png")
# Transformer architecture
# Image source: https://aiml.com/compare-the-different-sequence-models-rnn-lstm-gru-and-transformers/
display_image(f"{image_path_prefix}/example_5.png")

In [26]:

Copied!

display_image(f"{image_path_prefix}/example_2.png")
display_image(f"{image_path_prefix}/example_2.png")

To construct an effective prompt with examples, enumerate images such as EXAMPLE# 1 in the below prompt.

In [27]:

Copied!





prompt_4 = "Analyze the model architecture in the image and count the number of blocks. Use following examples as reference when analyzing the image and returning the response."
image_content = Part.from_uri(
    uri=f"{image_path_prefix}/example_5.png", mime_type="image/png"
)

contents = [
    prompt_4,
    "EXAMPLE# 1",
    Part.from_uri(uri=f"{image_path_prefix}/example_2.png", mime_type="image/png"),
    '"response": {"name": "RNN", "number_of_blocks": 1}',
    "EXAMPLE# 2",
    Part.from_uri(uri=f"{image_path_prefix}/example_3.png", mime_type="image/png"),
    '"response": {"name": "GRU", "number_of_blocks": 3}',
    "EXAMPLE# 3",
    Part.from_uri(uri=f"{image_path_prefix}/example_4.png", mime_type="image/png"),
    '"response": {"name": "LSTM", "number_of_blocks": 5}',
    "ARCHITECTURE:",
    image_content,
    '"response":',
]

print_prompt(contents)
prompt_4 = "Analyze the model architecture in the image and count the number of blocks. Use following examples as reference when analyzing the image and returning the response."
image_content = Part.from_uri(
    uri=f"{image_path_prefix}/example_5.png", mime_type="image/png"
)

contents = [
    prompt_4,
    "EXAMPLE# 1",
    Part.from_uri(uri=f"{image_path_prefix}/example_2.png", mime_type="image/png"),
    '"response": {"name": "RNN", "number_of_blocks": 1}',
    "EXAMPLE# 2",
    Part.from_uri(uri=f"{image_path_prefix}/example_3.png", mime_type="image/png"),
    '"response": {"name": "GRU", "number_of_blocks": 3}',
    "EXAMPLE# 3",
    Part.from_uri(uri=f"{image_path_prefix}/example_4.png", mime_type="image/png"),
    '"response": {"name": "LSTM", "number_of_blocks": 5}',
    "ARCHITECTURE:",
    image_content,
    '"response":',
]

print_prompt(contents)

Analyze the model architecture in the image and count the number of blocks. Use following examples as reference when analyzing the image and returning the response.
EXAMPLE# 1

"response": {"name": "RNN", "number_of_blocks": 1}
EXAMPLE# 2

"response": {"name": "GRU", "number_of_blocks": 3}
EXAMPLE# 3

"response": {"name": "LSTM", "number_of_blocks": 5}
ARCHITECTURE:

"response":

In [28]:

Copied!





generate(
    gemini,
    contents,
    generation_config=dict(**GENERATION_CONFIG, response_mime_type="application/json"),
)
generate(
    gemini,
    contents,
    generation_config=dict(**GENERATION_CONFIG, response_mime_type="application/json"),
)

{ "response": { "name": "Transformers", "number_of_blocks": 10 } }

Prompt #5. Document understanding¶

Let's examine the task of document understanding using Gemini.

In [29]:

Copied!

image_path = f"{image_path_prefix}/order_1.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)
image_path = f"{image_path_prefix}/order_1.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)

In [30]:

Copied!

prompt_5 = "Describe the image"

contents = [image_content, prompt_5]
generate(gemini, contents, as_markdown=True)
prompt_5 = "Describe the image"

contents = [image_content, prompt_5]
generate(gemini, contents, as_markdown=True)

Here's a description of the image:

Overall:

The image is a purchase order document. It's formatted with clear sections for different types of information.

Sections:

Header: The title "PURCHASE ORDER" is prominently displayed at the top.
Company Information:
- Buyer: Includes the buyer's company name (ACME, INC), address, phone, fax, and email.
- Vendor: Includes the vendor's company name (LLM, INC), address, phone, fax, and email.
Shipping Information:
- "Ship To" section with attention to Bert Simpson, company name (ACME, INC), address, phone, fax, and email.
Order Details:
- Date and PO Number.
- Department and Requester.
- Payment Terms and Delivery Date.
Item List:
- A table with columns for Item #, Description, Quantity (Qty), Unit Price, and Total.
- Two items are listed: "15" LED Monitor" and "Vertical Mounting Stand".
Totals:
- Tax Rate, Taxes, Shipping & Handling, and Total Due are listed at the bottom of the item list.

Key Details:

The purchase order is from ACME, INC to LLM, INC.
The order includes a 15" LED Monitor and Vertical Mounting Stands.
The total due is $440.00.
The delivery date is 9/25/2023.
The date of the purchase order is 9/1/2023.
The PO number is PO-2023-A123.

As we see, the model successfully extracted main information, but it did not pick up all values from the table. Let's fix that with the same approach we used for task 1.

In [31]:

Copied!





system_prompt_5 = """You are an expert at document understanding and highly 
capable of extracting all relevant information from bills, receipts, and 
various documents.

Your task is to process the given document and identify all pertinent details 
such as the vendor/merchant name, date, transaction details (items, quantities, 
prices, etc.), total amount, payment method, and any other noteworthy information.

# INSTRUCTIONS
- Analyze Document Structure
- Identify Key Sections
- Extract Data:
  - Vendor/Merchant Name
  - Date
  - Transaction Details:
    - Items
    - Quantities
    - Prices
    - Subtotals
    - Total Amount
    - Payment Method
   - Other Information
- Present the extracted information in a clear and structured format, using appropriate headings and labels.

# CONSTRAINTS:
- Handle Variations
- Prioritize Accuracy
- Handle Ambiguity
- Maintain Confidentiality"""
system_prompt_5 = """You are an expert at document understanding and highly 
capable of extracting all relevant information from bills, receipts, and 
various documents.

Your task is to process the given document and identify all pertinent details 
such as the vendor/merchant name, date, transaction details (items, quantities, 
prices, etc.), total amount, payment method, and any other noteworthy information.

# INSTRUCTIONS
- Analyze Document Structure
- Identify Key Sections
- Extract Data:
  - Vendor/Merchant Name
  - Date
  - Transaction Details:
    - Items
    - Quantities
    - Prices
    - Subtotals
    - Total Amount
    - Payment Method
   - Other Information
- Present the extracted information in a clear and structured format, using appropriate headings and labels.

# CONSTRAINTS:
- Handle Variations
- Prioritize Accuracy
- Handle Ambiguity
- Maintain Confidentiality"""

In [33]:

Copied!





gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt_5
)
contents = [image_content, "DOCUMENT:"]
generate(gemini_si, contents, as_markdown=True)
gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt_5
)
contents = [image_content, "DOCUMENT:"]
generate(gemini_si, contents, as_markdown=True)

Here's the extracted information from the purchase order:

Vendor Information:

Vendor Name: LLM, INC
Address: 123 Bison Street, Gecko City, ST 12345
Phone: (123) 456-7890
Fax: (123) 456 - 7800
Email: langchain@llminc.com

Customer Information:

Company Name: ACME, INC
Address: 456 Model Garden, Codey City, BY, 67890
Phone: (222) - 345 - 6666
Fax: (222) - 345 - 6000
Email: buyer1@acmeinc.com

Purchase Order Details:

PO Number: PO-2023-A123
Date: 9/1/2023
Delivery Date: 9/25/2023
Department: Engineering
Requested By: Bert Simpson
Payment Terms: Net 15 Days
Ship To: Attn: BERT SIMPSON, ACME, INC, 456 Model Garden St, Codey City, BY, 67890

Transaction Details:

Item #	Description	Qty	Unit Price	Total
A233	15" LED Monitor	1	$200.00	$200.00
B124	Vertical Mounting Stand	2	$100.00	$200.00

Summary:

Tax Rate: 10%
Taxes: $40
Shipping & Handling: $0
Total Due: $440.00

As we see with the modification of the prompt and adding task in the system instruction, the model was able to extract the entities from the table in the way we wanted to do it.

Prompt #6. Math Understanding¶

In this prompt, let's examine Gemini's capabilities of math understanding by uploading a screenshot of a math problem and solve with Gemini.

In [34]:

Copied!

image_path = f"{image_path_prefix}/math_1.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)
image_path = f"{image_path_prefix}/math_1.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)

In [36]:

Copied!

prompt_6 = "Solve the mathematical problem"

contents = [image_content, prompt_6]
generate(gemini, contents, as_markdown=True)
prompt_6 = "Solve the mathematical problem"

contents = [image_content, prompt_6]
generate(gemini, contents, as_markdown=True)

To find the solutions of the equation $x^2 + 7x + 12 = 0$, we need to factor the quadratic expression. We are looking for two numbers that multiply to 12 and add up to 7. These numbers are 3 and 4. So, we can factor the quadratic as $(x + 3)(x + 4) = 0$. Now, we set each factor equal to zero and solve for x: $x + 3 = 0 \Rightarrow x = -3$ $x + 4 = 0 \Rightarrow x = -4$ The solutions are $x = -3$ and $x = -4$.

Comparing our solutions to the given options, we see that option A matches our solutions.

Final Answer: The final answer is $\boxed{x = -3; x = -4}$

Let's now switch to a different problem and update the prompt with better instructions.

In [37]:

Copied!

image_path = f"{image_path_prefix}/math_2.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)
image_path = f"{image_path_prefix}/math_2.png"
image_content = Part.from_uri(uri=image_path, mime_type="image/png")
display_image(image_path)

In [38]:

Copied!





prompt_6 = """Please provide a detailed, step-by-step solution, clearly 
outlining the reasoning behind each step. Show all intermediate results and 
calculations, ensuring a comprehensive and easy-to-follow explanation.

If the equation involves any specific mathematical concepts or techniques, 
please identify and explain them as part of the solution.

If there are multiple solutions or special cases, please address them comprehensively.

Finally, present the final answer or answers in a clear and concise manner. """

contents = [image_content, prompt_6]
generate(gemini, contents, as_markdown=True)
prompt_6 = """Please provide a detailed, step-by-step solution, clearly 
outlining the reasoning behind each step. Show all intermediate results and 
calculations, ensuring a comprehensive and easy-to-follow explanation.

If the equation involves any specific mathematical concepts or techniques, 
please identify and explain them as part of the solution.

If there are multiple solutions or special cases, please address them comprehensively.

Finally, present the final answer or answers in a clear and concise manner. """

contents = [image_content, prompt_6]
generate(gemini, contents, as_markdown=True)

Here's a step-by-step solution to the equation π^(x+1) = e:

1. Take the natural logarithm of both sides:

To solve for x, we need to get rid of the exponent. Taking the natural logarithm (ln) of both sides of the equation will help us do this. The natural logarithm is the logarithm to the base e.

ln(π^(x+1)) = ln(e)

2. Apply the power rule of logarithms:

The power rule of logarithms states that ln(a^b) = b * ln(a). Applying this rule to the left side of the equation, we get:

(x + 1) * ln(π) = ln(e)

3. Simplify ln(e):

The natural logarithm of e is always 1, because e raised to the power of 1 is e.

(x + 1) * ln(π) = 1

4. Isolate (x + 1):

Divide both sides of the equation by ln(π):

x + 1 = 1 / ln(π)

5. Solve for x:

Subtract 1 from both sides of the equation:

x = (1 / ln(π)) - 1

6. Simplify (optional):

We can combine the terms on the right side into a single fraction:

x = (1 - ln(π)) / ln(π)

Final Answer:

The solution to the equation π^(x+1) = e is:

x = (1 / ln(π)) - 1 or x = (1 - ln(π)) / ln(π)

Here we ask Gemini to use step-by-step reasoning and ask it to output intermediate steps also. This allows us to be more confident in the output answer. Asking the model to return reasoning and intermediate steps helps LLM to arrive at the answer better.

Conclusion¶

This demonstrated various examples of working with Gemini using images. Following are general prompting strategies when working with Gemini on multimodal prompts, that can help achieve better performance from Gemini:

Craft clear and concise instructions.
Add your image first for single-image prompts.
Add few-shot examples to the prompt to show the model how you want the task done and the expected output.
Break down the task step-by-step.
Specify the output format.
Ask Gemini to include reasoning in its response along with decision or scores
Use context caching for repeated queries.

Specifically, when working with images following may help:

Enumerate when prompt has multiple images.
Use a single image for optimal text detection.
You can detect objects in images with bounding boxes.
Guiding models’ attention by adding hints.
Ask for detailed analysis for optimizing output.