This is an automated archive made by the Lemmit Bot.
The original was posted on /r/homeassistant by /u/InternationalNebula7 on 2025-05-20 19:50:23+00:00.
Gemma 3n sounds like the perfect low latency model for HA voice. I wonder if users will be able to skip the STT parts of the pipeline to get a seamless experience. Anyone playing with this idea?
Gemma 3n can understand and process audio, text, and images, and offers significantly enhanced video understanding. Its audio capabilities enable the model to perform high-quality Automatic Speech Recognition (transcription) and Translation (speech to translated text). Additionally, the model accepts interleaved inputs across modalities, enabling understanding of complex multimodal interactions. (Public implementation coming soon)
Gemma 3n leverages a Google DeepMind innovation called Per-Layer Embeddings (PLE) that delivers a significant reduction in RAM usage. While the raw parameter count is 5B and 8B, this innovation allows you to run larger models on mobile devices or live-stream from the cloud, with a memory overhead comparable to a 2B and 4B model, meaning the models can operate with a dynamic memory footprint of just 2GB and 3GB. Learn more in our documentation.