Home Assistant Voice should not be reduced to "local is always better" or "cloud is always smarter." The useful decision is to split the voice path into five stages: wake word, speech-to-text, intent handling, text-to-speech, and device execution. Homes that prioritize privacy and offline control should keep wake word detection, intent handling, and device execution local. Homes that need better multilingual recognition, noisy-room handling, or natural voices can selectively use cloud STT or TTS. For households with children, older users, or safety-sensitive devices, the best default is local-first control with cloud-assisted experience, not a fully outsourced voice path.
| Pipeline stage | Local-first fit | Cloud fit | Main tradeoff |
|---|---|---|---|
| Wake word | Privacy, low latency, offline availability | Consistent experience across managed devices | Local hardware and false-wake tuning |
| STT | Fixed commands, fewer languages, acceptable hardware cost | Multilingual speech, long sentences, noisy rooms | Cloud dependency and privacy boundary |
| Intent handling | Local device control and predictable commands | Broad Q&A or complex dialogue | Local intent coverage may be narrower |
| TTS | Short status feedback and offline prompts | Natural voices and multilingual announcements | Cloud latency and availability |
| Device execution | Lights, switches, covers, climate, sensors | Core control should not depend on cloud | Confirmation, permission, and audit design |
The conclusion is: the core control path of Home Assistant Voice should be local-first, while experience-enhancing stages can use cloud services when the tradeoff is acceptable. If a voice setup cannot turn on lights, stop an automation, or close a cover when the internet is down, it is weak as a home control interface. If a fully local setup performs poorly in real noise, accent, or multilingual conditions, it should not sacrifice usability just to stay pure.

1. Treat Home Assistant Voice as a pipeline
Home Assistant Assist is not a single feature. It is a pipeline from sound to action. After the user speaks, the system usually handles wake word detection, recording, STT, intent parsing, service calls, and TTS feedback. Each stage can have a different implementation. Some may run on a voice satellite, some on the Home Assistant host, and some through an external service.
This is where many voice projects get misdiagnosed. "Voice does not work well" may mean four different things:
- the microphone side has poor pickup, so wake word or recording quality is unstable
- STT mishears device names, room names, or short commands
- intent handling understands the text but cannot map it to Home Assistant entities
- TTS or device response is slow enough that users repeat the command
Before choosing local or cloud, define what you are optimizing for: privacy, offline availability, recognition accuracy, response time, or maintenance cost. The answer changes depending on the target.
2. What should run locally
The parts most directly tied to home control safety should be local-first.
Wake word detection is a strong local candidate. Wake word logic is close to a continuously listening audio entry point. Keeping it local reduces unnecessary audio exposure and keeps the first interaction fast.
Intent handling and device execution should usually stay local. Turning on lights, stopping a fan, adjusting climate, or closing covers should be deterministic and low-latency. Home Assistant's strength is the local entity model, device graph, and automation system. If every control action must round-trip through a cloud account, internet outage and service limits become home-control risks.
Status feedback needs a local fallback. Even if the preferred TTS voice is cloud-based, critical status should remain available locally: "garage door opened," "automation stopped," or "bedroom AC is unavailable" should not depend entirely on an external service.
Decision Block
If a voice command affects lighting, covers, locks, climate, safety, or automation stop actions, keep intent handling and device execution on the Home Assistant side. Cloud services can improve recognition and voice quality, but they should not be the only path for core control.
3. What can use cloud services
Cloud services are not the enemy of a voice architecture. They are useful for stages where local models are costly, fast-moving, or hard to tune.
STT can be cloud-assisted when the environment demands it. For short fixed commands, local STT can be predictable. For multilingual homes, longer utterances, accents, background noise, and far-field microphones, cloud STT may produce more stable recognition. The key is to document what audio is uploaded, when it is uploaded, and what happens when the cloud path fails.
TTS is often an experience layer. Local TTS is good for short offline feedback. Cloud TTS is useful for more natural voices, multilingual responses, and longer announcements. For "living room light is on," local is often enough. For natural assistant behavior or multilingual household prompts, cloud TTS may be worth the dependency.
Open-ended conversation should not share the same trust path as device control. It is fine to use cloud models for Q&A or broad dialogue. It is risky to let them bypass Home Assistant's entity permissions and confirmation rules. For device control, an LLM should produce candidate intent or explanation, while execution remains an auditable Home Assistant service call.
4. Local, cloud, or hybrid?
| Architecture | Best fit | Poor fit | Recommended boundary |
|---|---|---|---|
| Fully local | Privacy-first homes, fixed commands, offline control | Complex multilingual speech or weak local hardware | Make voice a reliable control entry point |
| Fully cloud | Natural language experience and existing cloud ecosystem | Locks, safety, critical automations, unreliable networks | Keep it to non-critical control and Q&A |
| Hybrid | Most real homes and integrator projects | Teams that cannot maintain fallback behavior | Local control, cloud-enhanced recognition and speech |
Most homes and deployments are best served by a hybrid model: wake word, local intent handling, device execution, and critical status feedback remain local; STT, TTS, or natural-language extensions use cloud services only where they improve the user experience. This keeps basic control available during internet outages while still allowing better recognition and more natural feedback when cloud services are healthy.
The cost is operational clarity. You need to know which pipeline uses which STT engine, which TTS engine, whether internet access is required, where it falls back, and which commands need confirmation. Without those boundaries, "hybrid" turns into "hard to debug."
5. Test these five things before trusting the setup
First, test end-to-end latency. Do not measure only STT or TTS in isolation. Measure the time from wake word to device action and feedback. Short home-control commands need to feel fast enough that users do not repeat themselves.
Second, test false wakes and missed wakes. TV audio, kitchen noise, children speaking, and distance from the microphone should all be part of the test.
Third, test entity and room names. Voice failures are often naming failures. If "living room lamp," "main lamp," and "light strip" are inconsistent, both local and cloud recognition will struggle.
Fourth, test without internet. Disconnect the WAN path and list which commands still execute, which lose only speech feedback, and which fail completely. This is more useful than simply calling the system "local-first."
Fifth, test high-risk commands. Locks, garage doors, covers, heating devices, and security actions need confirmation, permission boundaries, time windows, and audit logs.
6. When voice should not be the only control path
Voice is convenient, but it should not be the only interface. Keep physical switches, dashboards, app controls, or automation fallbacks when:
- older users, children, or guests use the home and command habits are inconsistent
- the network or Home Assistant host is not maintained reliably
- controlled devices include locks, security, heaters, gas-related devices, garage doors, or medical-adjacent equipment
- the household uses multiple languages or accents and no one has time to tune STT
- the project will be maintained by non-technical users after delivery
Voice should be a low-friction entry point, not the only control plane. The more critical the action is, the more visible state, physical fallback, and auditability it needs.
7. Conclusion: local control, cloud enhancement
The practical rule is simple: keep the certainty of home control local, and use cloud services only to enhance recognition and expression where the tradeoff is acceptable. Fully local is best for privacy and reliable fixed commands. Fully cloud can be acceptable for non-critical Q&A and natural dialogue. Hybrid is the most realistic option for many households.
The label matters less than the boundary. Where does wake word detection run? Does STT upload audio? Who parses intent? Are device commands executed locally? Does TTS failure block status feedback? If those answers are clear, Home Assistant Voice can move from a demo to a dependable daily interface.
References
- Home Assistant Voice Control: https://www.home-assistant.io/voice_control/
- Home Assistant Assist documentation: https://www.home-assistant.io/voice_control/voice_remote_local_assistant/
- Home Assistant Assist pipeline integration: https://www.home-assistant.io/integrations/assist_pipeline/
- ESPHome Voice Assistant component: https://esphome.io/components/voice_assistant.html
- Wyoming protocol project: https://github.com/rhasspy/wyoming