Gemma 4 released with text, audio and image input and long up to 256K context window! Learn more

Run Gemma with Kubernetes Engine

Google Cloud Kubernetes Engine provides a wide range of deployment options for running Gemma models with high performance and low latency using preferred development frameworks. Check out the following deployment guides for Hugging Face, vLLM, TensorRT-LLM on GPUs, and TPU execution with JetStream, plus application, and tuning guides:

Deploy and serve

Serve Gemma on GPUs with Hugging Face TGI: Deploy Gemma models on GKE using GPUs and the Hugging Face Text Generation Inference (TGI) framework.
Serve Gemma on GPUs with vLLM: Deploy Gemma with vLLM for convenient model load management and high-throughput.
Serve Gemma on GPUs with TensorRT-LLM: Deploy Gemma with NVIDIA TensorRT-LLM to maximize model operation efficiency.
Serve Gemma on TPUs with JetStream: Deploy Gemma with JetStream on TPU processors for high-performance and low latency.

Analyze data

Analyze data on GKE using BigQuery, Cloud Run, and Gemma: Build a data analysis pipeline with BigQuery and Gemma.

Fine-tune

Fine-tune Gemma open models using multiple GPUs: Customize the behavior of Gemma based on your own dataset.