Gemini 模型可處理圖片,因此開發人員可利用許多前衛的開發案例,而這些案例在過去需要專屬領域的模型。Gemini 的視覺功能包括:
- 為圖片加上說明文字,並回答圖片相關問題
- 轉錄及分析 PDF,包括最多 200 萬個符記
- 偵測圖片中的物件,並傳回物件的定界框座標
- 區分圖片中的物件
Gemini 從一開始就以多模態為設計宗旨,我們會持續突破 AI 技術的極限。本指南說明如何使用 Gemini API,根據圖片輸入內容產生文字回覆,並執行常見的圖像理解工作。
事前準備
呼叫 Gemini API 前,請確認您已安裝所選 SDK,並設定 Gemini API 金鑰,以便使用。
圖片輸入
您可以透過下列方式,將圖片做為 Gemini 的輸入內容:
- 請先使用 File API 上傳圖片檔案,再向
generateContent
提出要求。請在檔案大小超過 20 MB 或您想在多個要求中重複使用檔案時,使用這個方法。 - 透過要求傳遞內嵌圖片資料至
generateContent
。請針對較小的檔案 (總要求大小小於 20 MB) 或直接從網址擷取的圖片使用此方法。
上傳圖片檔案
您可以使用 Files API 上傳圖片檔案。如果要求總大小 (包括檔案、文字提示、系統指示等) 超過 20 MB,或是您打算在多個提示中使用相同圖片,請務必使用 Files API。
以下程式碼會上傳圖片檔案,然後在對 generateContent
的呼叫中使用該檔案。
Python
from google import genai
client = genai.Client(api_key="GOOGLE_API_KEY")
myfile = client.files.upload(file="path/to/sample.jpg")
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[myfile, "Caption this image."])
print(response.text)
JavaScript
import {
GoogleGenAI,
createUserContent,
createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
async function main() {
const myfile = await ai.files.upload({
file: "path/to/sample.jpg",
config: { mimeType: "image/jpeg" },
});
const response = await ai.models.generateContent({
model: "gemini-2.0-flash",
contents: createUserContent([
createPartFromUri(myfile.uri, myfile.mimeType),
"Caption this image.",
]),
});
console.log(response.text);
}
await main();
Go
file, err := client.UploadFileFromPath(ctx, "path/to/sample.jpg", nil)
if err != nil {
log.Fatal(err)
}
defer client.DeleteFile(ctx, file.Name)
model := client.GenerativeModel("gemini-2.0-flash")
resp, err := model.GenerateContent(ctx,
genai.FileData{URI: file.URI},
genai.Text("Caption this image."))
if err != nil {
log.Fatal(err)
}
printResponse(resp)
REST
IMAGE_PATH="path/to/sample.jpg"
MIME_TYPE=$(file -b --mime-type "${IMAGE_PATH}")
NUM_BYTES=$(wc -c < "${IMAGE_PATH}")
DISPLAY_NAME=IMAGE
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D upload-header.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${IMAGE_PATH}" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"file_data":{"mime_type": "${MIME_TYPE}", "file_uri": '$file_uri'}},
{"text": "Caption this image."}]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
如要進一步瞭解如何使用媒體檔案,請參閱 Files API。
內嵌傳遞圖片資料
您可以改為在要求中傳遞內嵌圖片資料,而非上傳圖片檔案。generateContent
這類做法適合用於較小的圖片 (總要求大小小於 20 MB),或直接從網址擷取的圖片。
您可以以 Base64 編碼字串的形式提供圖片資料,也可以直接讀取本機檔案 (視 SDK 而定)。
本機圖片檔案:
Python
from google.genai import types
with open('path/to/small-sample.jpg', 'rb') as f:
img_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.0-flash',
contents=[
types.Part.from_bytes(
data=img_bytes,
mime_type='image/jpeg',
),
'Caption this image.'
]
)
print(response.text)
JavaScript
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const base64ImageFile = fs.readFileSync("path/to/small-sample.jpg", {
encoding: "base64",
});
const contents = [
{
inlineData: {
mimeType: "image/jpeg",
data: base64ImageFile,
},
},
{ text: "Caption this image." },
];
const response = await ai.models.generateContent({
model: "gemini-2.0-flash",
contents: contents,
});
console.log(response.text);
Go
model := client.GenerativeModel("gemini-2.0-flash")
bytes, err := os.ReadFile("path/to/small-sample.jpg")
if err != nil {
log.Fatal(err)
}
prompt := []genai.Part{
genai.Blob{MIMEType: "image/jpeg", Data: bytes},
genai.Text("Caption this image."),
}
resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
log.Fatal(err)
}
for _, c := range resp.Candidates {
if c.Content != nil {
fmt.Println(*c.Content)
}
}
REST
IMG_PATH=/path/to/your/image1.jpg
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{
"inline_data": {
"mime_type":"image/jpeg",
"data": "'\$(base64 \$B64FLAGS \$IMG_PATH)'"
}
},
{"text": "Caption this image."},
]
}]
}' 2> /dev/null
圖片網址:
Python
from google import genai
from google.genai import types
import requests
image_path = "https://goo.gle/instrument-img"
image = requests.get(image_path)
client = genai.Client(api_key="GOOGLE_API_KEY")
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=["What is this image?",
types.Part.from_bytes(data=image.content, mime_type="image/jpeg")])
print(response.text)
JavaScript
import { GoogleGenAI } from "@google/genai";
async function main() {
const ai = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY });
const imageUrl = "https://goo.gle/instrument-img";
const response = await fetch(imageUrl);
const imageArrayBuffer = await response.arrayBuffer();
const base64ImageData = Buffer.from(imageArrayBuffer).toString('base64');
const result = await ai.models.generateContent({
model: "gemini-2.0-flash",
contents: [
{
inlineData: {
mimeType: 'image/jpeg',
data: base64ImageData,
},
},
{ text: "Caption this image." }
],
});
console.log(result.text);
}
main();
Go
func main() {
ctx := context.Background()
client, err := genai.NewClient(ctx, option.WithAPIKey(os.Getenv("GOOGLE_API_KEY")))
if err != nil {
log.Fatal(err)
}
defer client.Close()
model := client.GenerativeModel("gemini-2.0-flash")
// Download the image.
imageResp, err := http.Get("https://goo.gle/instrument-img")
if err != nil {
panic(err)
}
defer imageResp.Body.Close()
imageBytes, err := io.ReadAll(imageResp.Body)
if err != nil {
panic(err)
}
// Create the request.
req := []genai.Part{
genai.ImageData("jpeg", imageBytes),
genai.Text("Caption this image."),
}
// Generate content.
resp, err := model.GenerateContent(ctx, req...)
if err != nil {
panic(err)
}
// Handle the response of generated text.
for _, c := range resp.Candidates {
if c.Content != nil {
fmt.Println(*c.Content)
}
}
}
REST
IMG_URL="https://goo.gle/instrument-img"
MIME_TYPE=$(curl -sIL "$IMG_URL" | grep -i '^content-type:' | awk -F ': ' '{print $2}' | sed 's/\r$//' | head -n 1)
if [[ -z "$MIME_TYPE" || ! "$MIME_TYPE" == image/* ]]; then
MIME_TYPE="image/jpeg"
fi
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{
"inline_data": {
"mime_type":"'"$MIME_TYPE"'",
"data": "'$(curl -sL "$IMG_URL" | base64 $B64FLAGS)'"
}
},
{"text": "Caption this image."}
]
}]
}' 2> /dev/null
關於內嵌圖片資料,請注意以下幾點:
- 總要求大小上限為 20 MB,其中包括文字提示、系統操作說明和所有內嵌檔案。如果檔案大小會導致總要求大小超過 20 MB,請使用 Files API 上傳圖片檔案,以便在要求中使用。
- 如果您要多次使用圖片樣本,建議使用 File API 上傳圖片檔案,這樣會更有效率。
使用多張圖片提示
您可以在 contents
陣列中加入多個圖片 Part
物件,在單一提示中提供多張圖片。這些資料可以是內嵌資料 (本機檔案或網址) 和 File API 參照的組合。
Python
from google import genai
from google.genai import types
client = genai.Client(api_key="GOOGLE_API_KEY")
# Upload the first image
image1_path = "path/to/image1.jpg"
uploaded_file = client.files.upload(file=image1_path)
# Prepare the second image as inline data
image2_path = "path/to/image2.png"
with open(image2_path, 'rb') as f:
img2_bytes = f.read()
# Create the prompt with text and multiple images
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[
"What is different between these two images?",
uploaded_file, # Use the uploaded file reference
types.Part.from_bytes(
data=img2_bytes,
mime_type='image/png'
)
]
)
print(response.text)
JavaScript
import {
GoogleGenAI,
createUserContent,
createPartFromUri,
} from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
async function main() {
// Upload the first image
const image1_path = "path/to/image1.jpg";
const uploadedFile = await ai.files.upload({
file: image1_path,
config: { mimeType: "image/jpeg" },
});
// Prepare the second image as inline data
const image2_path = "path/to/image2.png";
const base64Image2File = fs.readFileSync(image2_path, {
encoding: "base64",
});
// Create the prompt with text and multiple images
const response = await ai.models.generateContent({
model: "gemini-2.0-flash",
contents: createUserContent([
"What is different between these two images?",
createPartFromUri(uploadedFile.uri, uploadedFile.mimeType),
{
inlineData: {
mimeType: "image/png",
data: base64Image2File,
},
},
]),
});
console.log(response.text);
}
await main();
Go
+ // Upload the first image
image1Path := "path/to/image1.jpg"
uploadedFile, err := client.UploadFileFromPath(ctx, image1Path, nil)
if err != nil {
log.Fatal(err)
}
defer client.DeleteFile(ctx, uploadedFile.Name)
// Prepare the second image as inline data
image2Path := "path/to/image2.png"
img2Bytes, err := os.ReadFile(image2Path)
if err != nil {
log.Fatal(err)
}
// Create the prompt with text and multiple images
model := client.GenerativeModel("gemini-2.0-flash")
prompt := []genai.Part{
genai.Text("What is different between these two images?"),
genai.FileData{URI: uploadedFile.URI},
genai.Blob{MIMEType: "image/png", Data: img2Bytes},
}
resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
log.Fatal(err)
}
printResponse(resp)
REST
# Upload the first image
IMAGE1_PATH="path/to/image1.jpg"
MIME1_TYPE=$(file -b --mime-type "${IMAGE1_PATH}")
NUM1_BYTES=$(wc -c < "${IMAGE1_PATH}")
DISPLAY_NAME1=IMAGE1
tmp_header_file1=upload-header1.tmp
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D upload-header1.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM1_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME1_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME1}'}}" 2> /dev/null
upload_url1=$(grep -i "x-goog-upload-url: " "${tmp_header_file1}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file1}"
curl "${upload_url1}" \
-H "Content-Length: ${NUM1_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${IMAGE1_PATH}" 2> /dev/null > file_info1.json
file1_uri=$(jq ".file.uri" file_info1.json)
echo file1_uri=$file1_uri
# Prepare the second image (inline)
IMAGE2_PATH="path/to/image2.png"
MIME2_TYPE=$(file -b --mime-type "${IMAGE2_PATH}")
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
IMAGE2_BASE64=$(base64 $B64FLAGS $IMAGE2_PATH)
# Now generate content using both images
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "What is different between these two images?"},
{"file_data":{"mime_type": "'"${MIME1_TYPE}"'", "file_uri": '$file1_uri'}},
{
"inline_data": {
"mime_type":"'"${MIME2_TYPE}"'",
"data": "'"$IMAGE2_BASE64"'"
}
}
]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
取得物件的邊界框
Gemini 模型經過訓練,可識別圖片中的物件,並提供定界框座標。系統會根據圖片尺寸傳回座標,並將其縮放至 [0, 1000]。您必須根據原始圖片大小縮減這些座標。
Python
prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."
JavaScript
const prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000.";
Go
prompt := []genai.Part{
genai.FileData{URI: sampleImage.URI},
genai.Text("Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."),
}
REST
PROMPT="Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."
您可以使用定界框偵測圖片和影片中的物件,並進行本地化。透過準確辨識物件並使用定界框來劃分物件,您就能發揮多種應用程式,並提升專案的智慧程度。
主要優點
- 簡單:無論您是否具備電腦視覺專業知識,都能輕鬆將物件偵測功能整合至應用程式。
- 可自訂:根據自訂指示 (例如「我想查看這張圖片中所有綠色物件的邊界框」) 產生邊界框,無須訓練自訂模型。
技術詳細資料
- 輸入內容:提示內容和相關圖片或影片影格。
- 輸出:定界框,格式為
[y_min, x_min, y_max, x_max]
。左上角是原點。x
和y
軸分別為水平和垂直。每張圖片的座標值都會正規化為 0 到 1000。 - 視覺化:AI Studio 使用者會在 UI 中看到邊界框。
Python 開發人員可試試 2D 空間理解筆記本或實驗性 3D 指標筆記本。
將座標標準化
模型會以 [y_min, x_min, y_max, x_max]
格式傳回定界框座標。如要將這些標準化座標轉換為原始圖片的像素座標,請按照下列步驟操作:
- 將每個輸出座標除以 1000。
- 將 x 座標乘以原始圖片寬度。
- 將 y 座標乘以原始圖片高度。
如要進一步瞭解如何產生邊界框座標並在圖片上顯示,請參閱 物件偵測食譜範例。
圖片區隔
從 Gemini 2.5 模型開始,Gemini 模型的訓練目標不僅是偵測項目,還要分割項目並提供輪廓遮罩。
模型會預測 JSON 清單,其中每個項目代表一個區隔遮罩。每個項目都有一個定界框 ("box_2d
"),格式為 [y0, x0, y1, x1]
,其規範化座標介於 0 和 1000 之間,標籤 ("label
") 可識別物件,最後是定界框內的區隔遮罩,以 base64 編碼的 png 為基礎,這是值介於 0 和 255 之間的機率圖。遮罩的大小必須與邊界框尺寸相符,然後以可信度門檻 (中點為 127) 進行二值化。
Python
prompt = """
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
"""
JavaScript
const prompt = `
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
`;
Go
prompt := []genai.Part{
genai.FileData{URI: sampleImage.URI},
genai.Text(`
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
`),
}
REST
PROMPT='''
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
'''

如需更詳細的範例,請參閱食譜指南中的區隔範例。
支援的圖片格式
Gemini 支援下列圖片格式的 MIME 類型:
- PNG -
image/png
- JPEG -
image/jpeg
- WEBP -
image/webp
- HEIC -
image/heic
- HEIF -
image/heif
圖片的技術細節
- 檔案限制:Gemini 2.5 Pro、2.0 Flash、1.5 Pro 和 1.5 Flash 每個要求最多支援 3,600 個圖片檔案。
- 符記計算:
- Gemini 1.5 Flash 和 Gemini 1.5 Pro:如果兩個尺寸均小於 384 像素,則為 258 個符記。較大的圖片會以平鋪方式顯示 (最小圖塊 256 像素,最大 768 像素,並調整為 768x768 像素),每個圖塊的符記費用為 258 個。
- Gemini 2.0 Flash:如果兩個尺寸均小於 384 像素,則為 258 個符記。較大的圖片會分割成 768x768 像素的圖塊,每個圖塊的符記費用為 258 個。
- 最佳做法:
- 確保圖片旋轉正確。
- 使用清晰且不模糊的圖片。
- 使用單張含文字圖片時,請將文字提示放在
contents
陣列中的圖片部分後方。
後續步驟
本指南說明如何上傳圖片檔案,並從圖片輸入內容產生文字輸出內容。如要進一步瞭解相關內容,請參閱下列資源: