Gemini models are able to process images and videos, enabling many frontier developer use cases that would have historically required domain specific models. Some of Gemini's vision capabilities include the ability to:
- Caption and answer questions about images
- Transcribe and reason over PDFs, including up to 2 million tokens
- Describe, segment, and extract information from videos up to 90 minutes long
- Detect objects in an image and return bounding box coordinates for them
Gemini was built to be multimodal from the ground up and we continue to push the frontier of what is possible.
What's next
This guide shows how to upload image and video files using the File API and then generate text outputs from image and video inputs. To learn more, see the following resources:
- File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
- System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
- Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.