Prompt with visual data

Gemma release 3 and later can be used to understand and process information from both images and text. This capability enables it to perform complex tasks that require a comprehensive understanding of the world.

Specifically, this section explores how you can use visual data for prompts. Using Gemma to interpret and respond to images, videos, and other visual inputs, you can unlock powerful new applications, including:

  • Image Interpretation: Gemma can be instructed to analyze and understand the content of images.
  • Content Creation: Incorporating visual data into prompts allows Gemma to produce more creative and contextually appropriate content.

Visual data

Visual data can come in many formats and levels of resolution. The actual visual formats you can use with Gemma, such as JPEG and PNG formats, are determined by the framework you choose to convert visual data into tensors. Here are some specific considerations for preparing visual data for processing with Gemma:

  • Token cost: Each image typically uses 256 tokens. PaliGemma image token costs vary depending on the model you select.
  • Resolution: The interpreted resolution for images, meaning the number of pixels encoded into tokens and interpreted by the model, depends on the Gemma version you are using:
    • Gemma 3: (4B and higher) 896x896 resolution, with pan and scan options for larger images.
    • Gemma 3n: 256x256, 512x512, or 768x768 resolution
    • PaliGemma 2: 224x224, 448x448, or 896x896 resolution

Lower-resolution images are typically processed faster at the cost of having fewer interpretable visual details. If you want to optimize processing speed for image data, you should aim to provide visual data at one of the interpreted resolution sizes of the Gemma model you are using.

Do's

Here are some best practices to follow when prompting Gemma with visual data.

  • Be specific: If you have any specific tasks, provide sufficient context and guidance. Instead of "describe this image", try "describe the scene in this image, focusing on the relationship between the people and the objects."

  • Provide constraints: To achieve a particular style or tone, be sure to specify it in your prompt. For example, instead of a general story request, ask Gemma to "Write a short story about this image in the style of a film noir."

  • Iterative Refinement: Getting the intended output often requires experimentation and refining the prompts. Begin with a basic prompt and gradually add complexity.

Don'ts

Here are some things to avoid when prompting Gemma with visual data.

  • Expect Pixel-Perfect Precision from Gemma: Tasks requiring precise pixel-level analysis, such as detailed object detection and OCR, are best handled by dedicated computer vision models. Gemma, for example, cannot accurately count individual blades of grass in an image, only provide an approximation.

  • Vague or Ambiguous Prompts: Instead of general prompts like "Generate something based on this image", provide specific instructions to achieve intended outputs. Clearly define what "something" is. For example, a poem, recipe, or code snippet.

  • Ignore Model Limitations: Understanding Gemma's limitations is vital for effective use. Asking it to "Analyze this X-ray image and tell me the patient's exact medical condition" is a clear example of misuse, potentially leading to harmful medical misinformation.