Nexa AI built its OmniAudio generative AI model for edge applications using Gemma.
Nexa AI is a company specialized in building AI tools for the edge hardware and software market. To meet its mission of bringing AI to everyone and on any device, the company offers production-ready “tiny models,” model architecture optimization and compression, and edge inference acceleration services.
Nexa AI developers used Gemma as the foundation for one of the company’s innovative AI solutions: OmniAudio, an audio-language model. OmniAudio’s strength lies in its unique architecture that maximizes performance for edge applications. Thanks to Gemma, the model launched at a compact size with low latency, high accuracy, and enhanced privacy.
The challenge
Nexa AI wanted to build a new audio-language model to add to its inventory of AI tools. Unlike more traditional audio-language models, they wanted to create one that worked entirely on-device for greater accessibility. Not calling on a cloud-based model also reduced privacy concerns and latency for the end user and reduced costs for the developers.
After extensive testing, Nexa AI developers found the available commercial models were less suited for on-device deployment and needed to find a smaller, more efficient model that could run on-device with best-in-class power. That’s when the team turned to Google’s Gemma open models. Nexa AI developers had worked with Gemma before to build its highly-regarded Octopus v2 model, a generative large language model (LLM) also built for edge applications. With this knowledge in mind, they knew it would be the perfect solution to build their OmniAudio language model.
“Gemma is a game-changer for edge AI development, offering unparalleled efficiency and accuracy to create powerful, resource-friendly models. Its scalability and ease of integration also make it ideal for experimentation and gradual implementation.”
The solution
OmniAudio is a 2.6B-parameter audio-language multimodal model that combines Gemma-2-2b, automatic speech recognition model WhisperTurbo, and a custom projector module to unify audio-speech recognition and LLM capabilities in one architecture. This model can record summaries, generate audio content, perform voice quality assurance, and more. Using Gemma 2 as its foundation enabled the Nexa AI team to meet its privacy and performance priorities, thanks to the model’s diverse on-device inference capabilities.
“Gemma’s strong language understanding and content generation capabilities made it easy to fine-tune the model for audio-language capabilities,” said Zack Li, CTO of Nexa AI. In addition to using functional tokens to enhance function calling in OmniAudio, Nexa AI developers also integrated Gemma 2 with WhisperTurbo for seamless audio-text processing. The team used their Nexa SDK, Nexa AI’s own edge inference engine, for OmniAudio model inference.
According to the team, Gemma’s efficient design significantly reduces the cost-per-inference. Its on-device capabilities also minimize energy consumption and eliminate the need for constant cloud connectivity, providing scalable and cost-effective solutions for multimodal use cases. All this, combined with Gemma’s compact architecture, supported Nexa AI’s development of OmniAudio, which boasts impressive inference speed with minimal latency.

The impact
With Gemma's pretrained architecture, its engineers achieved significant performance gains while maintaining efficiency for “smooth development,” said Zack. “Gemma2 model is lightweight and has attracted a large developer community, that motivates us to use Gemma as LLM backbone ”, said Alex. The team also cited Gemma’s excellent documentation, which helped them tremendously during development.
5.5-10.3x
faster performace on consumer hardware
31k+
downloads on Hugging Face**
- *across FP16 GGUF and Q4_K_M quantized GGUF versions
- **number of downloads from December 1 - December 31, 2024
What’s next
According to the Nexa AI team, Gemma is instrumental in making AI accessible on devices where latency, privacy, and energy efficiency matter the most. “Gemma-based models maintain exceptional accuracy for specific in-domain tasks while being small enough for edge deployment,” said Zack. The team is excited to see more developers join the journey of creating impactful and sustainable solutions.
The Nexa AI team plans to continue refining OmniAudio to improve accuracy and reduce latency on edge devices. They also want to expand the use of all of its Gemma models in on-device AI applications such as conversational agents, multimodal processing, and function calling, transforming how users interact with their devices. Moving forward, the team plans to rely on Gemma for building enhanced multimodal and action-oriented AI models.