Audio and spoken language are rich sources of data for capturing user intents, recording information about the world around us, and understanding specific problems to be solved. Starting with Gemma 3n, you can use audio data in your prompting and generation tasks with Gemma. You can use it for a variety of audio analysis and interpretation tasks, and the model was trained to handle the following speech processing tasks with over 100 spoken languages:
- Speech to text (STT): Also known as automated speech recognition (ASR), takes spoken language audio data and transcribes it to text output in the same language. Learn more
- Automated speech translation (AST): Also known as speech to text translation (S2TT), takes spoken audio data in one language and translates it to text in another language. Learn more
You can use these features in a variety of applications, such as:
- Building voice-controlled application interfaces
- Creating transcription services for meetings or lectures
- Enabling voice search functionality in multilingual environments
This guide provides an overview of audio processing capabilities of Gemma 3n, including data considerations, example uses, and best practices.
Audio data
Digital audio data can come in many formats and levels of resolution. The actual audio formats you can use with Gemma, such as MP3 and WAV formats, are determined by the framework you choose to convert sound data into tensors. Here are some specific considerations for preparing audio data for processing with Gemma:
- Token cost: Each second of audio is 6.25 tokens.
- Audio channels: Audio data is processed as a single audio channel. If you are using multi-channel audio, such as left and right channels, consider reducing the data to a single channel by removing channels or combining the sound data into a single channel.
- Clip length: Audio clips of up to 30 seconds are recommended, but you can process longer lengths, up to the size of the model's context window, subtracting the output tokens you request.
- Sample rate: Audio processing rate for the tokenizer is 16kHz with 32 millisecond frames.
- Bit depth: Audio tokenizer uses float 32-bit data for each frame in the range [-1, 1] for each audio sample.
If the audio data you plan to process is significantly different from the input processing, particularly in terms of channels, sample rate and bit depth, consider resampling or trimming your audio data to match the data resolution handled by the model.
Audio encoding
When encoding audio data with your own code implementation for use with Gemma
3n, you should follow the recommended conversion process. If you are working
with audio files encoded in a specific format, such as MP3 or WAV encoded data,
you must first decode these to samples using a library such as ffmpeg
. Once
the data is decoded, convert the audio into mono-channel, 16 kHz float32
waveforms in the range [-1, 1]. For example, if you are working with stereo
signed 16-bit PCM integer WAV files at 44.1 kHz, follow these steps:
- Resample the audio data to 16 kHz
- Downmix from stereo to mono by averaging the 2 channels
- Convert from int16 to float32, and divide by 32768.0 to scale to the range [-1, 1]
Speech to text
Gemma 3n is trained for multilingual speech recognition, allowing you to transcribe audio input in various languages into text. The following code examples show how to prompt the model to transcribe text from audio files using Hugging Face Transformers:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
GEMMA_MODEL_ID = "google/gemma-3n-E4B-it"
processor = AutoProcessor.from_pretrained(GEMMA_MODEL_ID, device_map="auto")
model = AutoModelForImageTextToText.from_pretrained(
GEMMA_MODEL_ID, torch_dtype="auto", device_map="auto")
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "http://localhost/recording_01.wav"},
{"type": "audio", "audio": "http://localhost/recording_02.wav"},
{"type": "audio", "audio": "http://localhost/recording_03.wav"},
{"type": "text", "text": "Transcribe these audio files in order"},
]
}
]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=64)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
For a more complete code example, including library installation, see the documentation Run Gemma with Hugging Face Transformers, audio section.
Automated speech translation
Gemma 3n is trained for multilingual speech to translation tasks, allowing you to translate spoken audio directly into another language. The following code examples show how to prompt the model to translate spoken audio into text using Hugging Face Transformers:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
GEMMA_MODEL_ID = "google/gemma-3n-E4B-it"
processor = AutoProcessor.from_pretrained(GEMMA_MODEL_ID, device_map="auto")
model = AutoModelForImageTextToText.from_pretrained(
GEMMA_MODEL_ID, torch_dtype="auto", device_map="auto")
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
{"type": "text", "text": "Transcribe this audio into English, and then translate it into French."},
]
}
]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=64)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
For a more complete code example, including library installation, see the documentation Run Gemma with Hugging Face Transformers, audio section.