Home Assistant Voice Preview Edition Review: Is the $59 Box Ready for Daily Use?

After more than a year of daily use, the Home Assistant Voice Preview Edition is ready for daily use, with caveats. It is the only $59 smart speaker on the market with zero cloud dependency, and for anyone who already runs Home Assistant it slots into existing automations with almost no friction. On the plus side you get fully local wake word detection, sub-second response on common commands, a capable far-field mic array, and a privacy story Alexa and Google cannot touch. The frustrations have been equally consistent: wake word accuracy drops in noisy rooms, the built-in speaker is too quiet for a kitchen, custom wake words require a training pipeline most users will not bother with, and anything beyond “turn the lights on” still needs either a local LLM or a cloud model piped through Assist.
What the Voice Preview Edition Actually Is
The Voice PE is a reference hardware design from Nabu Casa , the company behind Home Assistant. It was positioned from day one as the canonical voice satellite for any HA install, released in late 2024 at a $59 street price and still shipping at roughly that price in 2026 (MSRP lists $69 USD / €59 EUR depending on region).
Inside the 84x84x21 mm polycarbonate shell is a deliberately split-brain architecture. An ESP32-S3 with 16 MB of flash and 8 MB of octal PSRAM handles networking and ESPHome orchestration, while an XMOS XU316 audio DSP takes on the echo cancellation, stationary noise suppression, auto-gain, and beamforming. Keeping those two jobs separate is what lets the little ESP32 avoid drowning in audio math the way earlier DIY satellites did.

The sensing and output layout follows the same “do one thing well” logic:
- Dual MEMS microphones with XMOS-driven beamforming for far-field pickup.
- A single amplified mono speaker with a TI AIC3204 DAC running at 48 kHz. It is fine for confirmations and timers, and underpowered as a music source. That tradeoff is intentional, since the goal is a voice satellite rather than a Sonos competitor.
- A 3.5 mm stereo output jack for hooking it into better speakers when you do want music to land somewhere loud.
- A hardware mute switch that physically cuts microphone power, wired as an actual electrical interrupt rather than a software flag.
- A rotary volume dial wrapped around a programmable top button, plus a translucent case with an RGB ring underneath for visual feedback.
The firmware stack is as interesting as the silicon. The device ships with stock ESPHome
and the voice_assistant component, talking to a Home Assistant instance over the Wyoming protocol
. There is no proprietary firmware, no companion app, and no vendor account. The unit is flashed and configured entirely through HA, and both the ESP32 firmware and the XMOS firmware are open source
.
The “Preview Edition” label is the other half of the story. Nabu Casa shipped this deliberately as a hackable reference device, not a polished consumer product. Firmware updates through 2025 and into 2026 have closed most of the launch-day gaps. The 2025.5.0 ESPHome release alone cut CPU usage from 72% to 35% during music-with-announcement scenarios, but the name still reflects the intent that the device evolve alongside the HA voice stack.
Compared to DIY alternatives, the pitch is blunt. You can build a Wyoming satellite on an M5Stack Atom Echo or a Raspberry Pi Zero for less money, and you will get considerably worse mic pickup and no XMOS DSP. The Voice PE is the first time the “just buy the reference” option has been the smart move for most people. For another approach to fully local HA voice control, the Willow firmware for the ESP32-S3 Box covers similar ground with a different software stack.
Unboxing and First-Time Setup
Setup is where most smart speakers either delight or infuriate, and this is where the Voice PE has improved the most since launch. What follows is the actual experience of getting a sealed unit to the “Okay Nabu, turn on the office lights” state on a non-trivial HA install.
The prerequisites are unglamorous but firm. You need a running Home Assistant instance (Core, Supervised, or OS), an Assist pipeline configured at least once, and a 2.4 GHz Wi-Fi network the device can reach. No HA means no Voice PE, since the device has no standalone cloud mode and that is by design. One more thing to know: the box does not include a USB-C cable or charger, a small sustainability decision that has annoyed more than a few buyers.
Out of the box, the flow is: plug in USB-C, the device boots into Bluetooth provisioning mode, open the Home Assistant mobile app, tap the “new device discovered” notification, and pair. The pairing ritual is handled through Improv over BLE rather than through a captive portal or a separate app install. On a reasonable network, plug-in to “the ring turns blue and it is ready” is 2-4 minutes, most of which is the initial firmware update.
The setup wizard then asks which Assist pipeline to use. There are three honest options:
- Fully local, using Piper for TTS, Whisper (typically faster-whisper ) for STT, and either the built-in intent matcher or a local LLM through Ollama . Nothing leaves the LAN. Needs an N100-class mini-PC or better on the HA host. For a full walkthrough of assembling this stack independently of Home Assistant, see the guide on building a private local AI voice assistant .
- Home Assistant Cloud, where STT and TTS are handled by Nabu Casa’s servers and intents are resolved locally. Best latency on a Raspberry Pi, but audio does leave your network.
- A cloud conversation agent such as OpenAI, Claude, or a local Ollama model, with either local or cloud STT/TTS in front of it. Most flexible, and also the most variable on privacy posture.
The first “Okay Nabu” moment is deliberately satisfying: the ring animates, a confirmation chime plays, and the TTS response comes back through the onboard speaker. In practice, setup snags tend to cluster around networking. mDNS issues on segmented VLANs , the Wyoming integration failing to auto-discover across subnets, and the awkward first experience when the default pipeline is cloud but the user wanted local are the three repeat offenders. The fix for the last one is buried under Settings, Voice Assistants, and picking a different pipeline as the default, which is worth knowing before you plug in.
A short post-setup checklist pays off later: run the hardware mute test so you can see the mic sensor toggle in HA, assign a friendly name per device if you plan to run multiple satellites, and set an Area so “turn off the lights” resolves to the room the device sits in by default.
Wake Word Accuracy and the Phrases That Actually Work
Wake word and intent recognition is the biggest differentiator between a voice assistant people keep using and one they unplug after a week. After a year of daily triggers across three units (office, kitchen, living room), the picture is clearer than any launch review could have given.
The default wake word, “Okay Nabu,” runs entirely on-device through microWakeWord , a TensorFlow Lite Micro pipeline in ESPHome. There is no network round-trip for the wake stage. In a quiet room at a normal speaking volume the false-reject rate sits comfortably under 5% at three meters. With a range hood running at medium in the kitchen, that climbs toward 20-25%, which is noticeable: you learn to speak a little louder or step a meter closer. By comparison, a 5th-gen Echo Dot placed in the same kitchen had a lower miss rate under noise, because Amazon has spent a decade tuning exactly this.
Activation distance in a 4x5 meter office at normal volume hits around 4 meters reliably. That is below Echo-class, but clearly ahead of any DIY ESP32 satellite that lacks the XMOS beamformer.

Two alternative built-in wake words are available: “Hey Jarvis” and “Hey Mycroft,” both drawn from the openWakeWord project. These run server-side on the HA host rather than on-device, which adds roughly 80-120 ms of latency versus microWakeWord. Unless you have a specific reason to switch, stay on “Okay Nabu.”
Custom wake words are supported through openWakeWord, and the training pipeline is non-trivial. You need a few hundred synthesized positive samples, a few thousand negative samples, a GPU or a patient afternoon on CPU, and the stamina to iterate on thresholds. Most users will never do this. I have trained exactly one custom model in a year and it was mostly to prove the pipeline works.
Intent matching versus LLM conversation is the other split that matters. Home Assistant’s built-in intent system handles the core patterns well:
| Query pattern | Built-in intent matcher | LLM conversation agent |
|---|---|---|
| “Turn on the office lights” | Excellent | Excellent |
| “Set brightness to 40%” | Excellent | Excellent |
| “What is the temperature in the living room?” | Very good | Very good |
| “Is it cold outside?” | Misses - no intent | Works, with latency cost |
| “Turn off the lights and lock the door” | Brittle, often partial | Works |
| “Start the coffee when I’m in the kitchen” | Not supported | Works |
Anything conversational or compound needs an LLM, and the moment you wire one in the latency story changes. A local Ollama with a 7B model on a consumer GPU adds 600-1200 ms to the end-to-end round trip; a cloud model adds 400-800 ms plus whatever your network contributes.
Phrases that reliably break even with an LLM: ambiguous room names with no Area context, numbers-with-units that Whisper mis-transcribes (especially hyphenated time ranges), and anything where the STT confidence is low and the LLM confidently hallucinates an entity name. Treating entity names as short and spoken-friendly from the start saves a lot of grief.
Compared with the big two, the split looks like this: Alexa still wins on casual phrasing tolerance, Google still wins on general-knowledge questions, and the Voice PE wins on “I said a thing and a light turned on in 400 ms without anything leaving my LAN.” Those are the actual trade-offs and they matter more than any marketing parity claim.
Integration with Home Assistant Automations and Assist Pipelines
The whole reason to buy a Voice PE over something cheaper is that it plugs directly into the Home Assistant automation graph as a first-class citizen. The mechanism is the Assist pipeline : an ordered chain of wake word, speech-to-text, intent or conversation, and text-to-speech stages, each independently swappable. The Voice PE is just the audio I/O at the edge of that pipeline.
The glue underneath is the Wyoming protocol , a simple TCP-based format for shipping audio chunks and metadata between services. It is what makes the stack modular. You can swap Whisper for faster-whisper for Kokoro TTS without touching the satellite, or run your STT on a remote GPU box while the Voice PE stays pinned to Wi-Fi on a battery-backed wall socket.
Three integration patterns are worth the upfront effort. The first is TTS-back announcements: using tts.speak with the Voice PE media player entity as the target lets any automation talk back through the device with lines like “laundry is done,” “front door unlocked at 3:14am,” or “guest Wi-Fi password is on the fridge.” This turns the satellite into a targeted notification channel that follows presence.
The second is the programmable top button. It exposes a button entity in HA, which means it can trigger anything you want: a “good night” scene, a coffee routine, a panic toggle that flashes every light in the house. Short-press, long-press, and double-press each fire a different event.

The third is the LED ring as a status light. The ring is exposed as a light entity, and it carries non-audio notifications well. A blue pulse can signal that a porch camera detected a package, red hold can flag a door that has been unlocked past the threshold, slow amber can count down the last few minutes of a laundry cycle. The same state-driven logic powers a dashboard layout where cards appear only when something needs attention , surfacing those alerts the moment they matter.
Multi-satellite setups are where the design pays off most clearly. Running three or four Voice PEs across a house, each assigned to an Area, means “turn off the lights” resolves to the room the command was spoken in by default. A single central Echo cannot do this, and it removes a surprising amount of daily friction.
The realistic latency budget in 2026 looks roughly like this for a local-only pipeline on an Intel N100 mini-PC:
| Pipeline stage | Typical latency |
|---|---|
| On-device wake detection (microWakeWord) | ~20 ms |
| Audio capture and stream to HA | 40-80 ms |
| STT with faster-whisper base model | 250-500 ms |
| Built-in intent resolution | 30-80 ms |
| Action execution | 50-200 ms (device-dependent) |
| Piper TTS response | 150-400 ms |
| End-to-end | ~600-1200 ms |
Swap in HA Cloud STT and TTS and the round-trip drops closer to 500-900 ms on a good connection. Wire in a cloud LLM as the conversation agent and it climbs to 1.5-3 seconds depending on the provider and the prompt. There is a regression worth knowing about: the HA 2026.1 release introduced a several-second STT lag for some users on Voice PE and Wyoming satellites that took a couple of point releases to fully resolve.
Fallback behavior when HA is down is clean. The device goes offline, the LED ring shows a fault color, and it does not try to phone anything else. For a privacy-first product that is the correct failure mode.
Privacy Architecture: The Only Fully-Local Smart Speaker You Can Buy
Every other mainstream smart speaker sends audio to a vendor cloud at some point in the pipeline, even when the marketing suggests otherwise. The Voice PE can be configured so no audio, no transcripts, no wake events, and no telemetry ever leave the LAN.
The threat model framing matters. “Local” here means a specific thing: when configured with a local pipeline, the device never opens a connection outside your network for any part of voice handling, covering the wake stage, STT, intent resolution, and TTS alike. The only outbound traffic during normal operation is NTP and the occasional firmware check, both of which are optional and auditable.
Stage by stage:
- Wake word. microWakeWord runs on the ESP32-S3 itself. It never ships continuous audio anywhere. This is the same pattern Alexa and Google use at the wake stage, but the difference is what happens next.
- Speech-to-text. Whisper via the Wyoming Whisper add-on runs on the HA host and stays on the LAN. HA Cloud STT sends audio to Nabu Casa’s servers (not Amazon or Google, which is a meaningful improvement, but still off-LAN). OpenAI Whisper API ships audio to OpenAI. Pick one.
- Conversation agent. The moment you wire a cloud LLM as the conversation agent, “local” stops being true for any query that falls through to it. The mitigation is either a local Ollama or llama.cpp backend with a 7B-14B model, or disciplined use of the built-in intent matcher with no LLM at all.
- Text-to-speech. Piper runs locally. HA Cloud TTS ships text to Nabu Casa. No cloud TTS path sends your STT transcript anywhere beyond what you already chose.
The hardware mute is a last line of defense worth calling out. The switch disconnects the microphone circuit electrically. That is verifiable with a multimeter, which is a claim neither Amazon nor Google has ever made about a consumer device. When the mute is engaged, the LED ring turns solid red, and the mic sensor in HA flips off.

No account is required to use the device. There is no Nabu Casa login and no forced registration. A Home Assistant Cloud subscription is optional and only affects remote access and cloud STT/TTS. The Voice PE works on a fully offline HA install.
Firmware transparency is the last piece. The ESPHome YAML and the microWakeWord models are open. You can read what the device is doing, diff it against upstream between releases, and reflash your own build if you prefer. That is not a claim any other smart speaker can make in 2026. With the state of cloud telemetry, recent enforcement actions against voice assistant vendors for storing child voice data, and the practical reality that most users have given up trying to audit Alexa, that openness matters more than it did a few years ago.
Long-Term Reliability and the Ecosystem Around It
A one-week review of a $59 box is easy. A one-year review is where the honest verdict lives. Three units running since launch give a reasonable picture of what holds up.
Hardware durability after roughly 15 months of always-on operation has been uneventful. The semi-transparent polycarbonate has not yellowed noticeably on any of the three units, the speaker has not developed rattle, and the USB-C ports are still snug. None of the three has failed or required a reflash. The rotary volume dial on the kitchen unit shows the most wear under bright light, but it still turns cleanly.
Thermal behavior is equally unremarkable. Idle case temperature sits a few degrees above ambient, and sustained Wyoming streaming with music playback pushes it up another 8-10 degrees C. No throttling, no audible creaks from the enclosure. The passive cooling is fine for a voice satellite and would not be for anything heavier.
Firmware update cadence has been steady. Nabu Casa has shipped ESPHome releases for the Voice PE roughly every 4-6 weeks since launch, all delivered OTA through Home Assistant with no app involved. The 2026.2.x line landed cleanly across all three of mine. No update has bricked a unit, though the 2026.1 release did introduce the STT lag regression mentioned earlier before it was resolved in the point releases that followed.
Power draw is modest enough that PoE via a splitter is viable. Community measurements put idle consumption around 0.5W and active streaming closer to 1.5-2W, so a 5V/2A USB-C PoE splitter handles it comfortably with headroom for the speaker during announcements.
The ecosystem that has grown around the device may be more valuable than the hardware itself. faster-whisper Wyoming containers have largely replaced the original Whisper add-on for anyone running on modest hardware. Piper voice packs keep expanding, and community-maintained custom wake word packs cover an increasing range of phrases. The upstream ESPHome voice_assistant integration point keeps getting faster and more configurable.
What Nabu Casa has added since launch makes the day-to-day experience noticeably better: timer support, media controls with proper queue handling, multi-room audio routing, and a rebuilt Assist pipeline UI that makes switching between configurations quick enough to experiment with. For synced music in every room, the 3.5 mm jack also feeds a multi-room audio rig the built-in speaker cannot match . Voice chapter 10 and chapter 11 in October 2025 rolled most of these in, along with multilingual assistant improvements.
Competing hardware in 2026 exists but has not displaced the Voice PE as the default recommendation. A handful of third-party Wyoming satellites have appeared, including a few DIY ESP32-S3 reference boards and one or two commercial clones. None of them match the combination of mic array, XMOS DSP, hardware mute, and first-party HA integration at the $59 price.
The honest buy/don’t-buy verdict at $59 in 2026:
- Get one if you already run Home Assistant, privacy is a real requirement, you want multi-satellite Areas that route commands locally, or you are the kind of person who reflashes firmware for fun.
- Skip it if you want a music speaker first and a voice assistant second, you are not running HA and have no plans to, you need best-in-class general-knowledge answering, or you expect Alexa-grade casual conversation tolerance.
What a Voice PE 2 would need to justify itself is straightforward: a louder speaker, better far-field performance in real domestic noise, and ideally an on-device small-LLM intent parser so the cloud conversation agent is no longer the obvious upgrade path for anything beyond “turn on the lights.” Until then, the current device stays on the shortlist, and at $59 it is the one smart speaker I can still recommend without caveats around where the audio goes.
Botmonster Tech