Files
nixos/docs/superpowers/specs/2026-05-10-zbt2-thread-otbr-design.md
marthsincemelee dbeda276e1 docs(home-assistant): design spec for ZBT-2 Thread + OTBR setup
Captures the architecture, operator workflow, and verification for
running the Connect ZBT-2 as an OpenThread Border Router on jupiter
(via nixos-unstable's services.openthread-border-router module),
with HA's otbr + thread integrations driving the Thread network
and the existing matter-server consuming credentials for
Matter-over-Thread device commissioning.

Supersedes the ZHA-direction commit on this branch (e8d09f4),
which will be reverted at the start of implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 15:29:21 +02:00

242 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ZBT-2 as a Thread Border Router for Home Assistant on `jupiter`
**Date:** 2026-05-10
**Branch:** `feature/ha-zbt-2-thread`
**Status:** Design — pending implementation plan
## Context
Home Assistant on `jupiter` already runs natively (`services.home-assistant`) with the Matter integration and `services.matter-server` enabled, but has no Zigbee or Thread radio. The user has acquired a **Home Assistant Connect ZBT-2** (Nabu Casa's Silicon Labs EFR32MG24based USB Zigbee/Thread radio).
The user wants the dongle running as an **OpenThread Border Router (OTBR)** — Thread only, not Zigbee — so Matter-over-Thread devices can be onboarded through the existing HA Matter integration.
A previous iteration of this work shipped `zha` enablement on the same branch (commit `e8d09f4`). That commit will be reverted as part of implementation; this design supersedes it.
## Goals
- Bring up `otbr-agent` on jupiter against the ZBT-2.
- Have Home Assistant auto-discover the OTBR via mDNS and use its REST API to manage the Thread network.
- Have `services.matter-server` (already enabled) consume Thread credentials from HA so Matter-over-Thread devices commission through the ZBT-2.
- One-time, manual firmware flash from Zigbee NCP to OpenThread RCP via `universal-silabs-flasher` (option B from brainstorming — no HA-driven update flow).
## Non-goals
- **Multipan / multiprotocol** (Zigbee + Thread on one radio). Out of scope; the dongle will be Thread-only.
- **Falling back to ZHA** if Thread misbehaves. Thread-only by choice; if it fails the response is to debug, not to dual-stack.
- **HA-UI-driven firmware updates.** The HAOS "Silicon Labs Multiprotocol" add-on workflow doesn't translate to native NixOS without faking a supervisor; the user explicitly accepted CLI-only flashing.
- **Thread network credential backups.** HA owns the dataset; standard HA backup hygiene (separate concern) covers it.
## Architecture
```
┌────────────────────────── jupiter (NixOS) ──────────────────────────┐
│ │
ZBT-2 USB ──►│ /dev/serial/by-id/usb-Nabu_Casa_..._ZBT-2_<serial>-... │
│ │ │
│ │ spinel+hdlc+uart, 115200 baud │
│ ▼ │
│ ┌───────────────┐ REST :8081 (loopback) ┌──────────────────┐ │
│ │ otbr-agent │ ◄─────────────────────►│ home-assistant │ │
│ │ (systemd) │ │ + matter-server │ │
│ │ wpan0 ───────┼── advertises via ─┐ │ extraComponents:│ │
│ └───────────────┘ avahi (_meshcop) │ │ matter, │ │
│ ▼ │ mobile_app, │ │
│ enp3s0 (LAN — backbone) │ otbr, thread │ │
│ └──────────────────┘ │
└────────────────────────────────────┬──────────────────────────────────┘
home LAN ◄─┘
(Matter-over-Thread devices join here)
```
### Components
1. **The radio.** ZBT-2, USB-attached, running OpenThread RCP firmware after a one-time flash.
2. **`otbr-agent`** (systemd). Managed by the unstable `services.openthread-border-router` NixOS module imported via `inputs.nixpkgs-unstable`. Owns `wpan0`, talks Spinel to the dongle, exposes the OTBR REST API on `127.0.0.1:8081`, advertises `_meshcop._udp` over `enp3s0` via avahi.
3. **Home Assistant** (already running). Gains the `otbr` and `thread` extra components. Discovers OTBR via mDNS, drives the REST API, supplies Thread operational datasets to `matter-server` during Matter commissioning.
### Data flows
- **OTBR ↔ ZBT-2:** Spinel-over-HDLC over UART. Built automatically by the module from `radio.device` as `spinel+hdlc+uart://<device>?uart-baudrate=115200`.
- **HA ↔ OTBR:** mDNS discovery (`_meshcop._udp`) → REST calls to `127.0.0.1:8081` for network management.
- **Matter commissioning:** HA scans QR → `matter-server` does BLE commissioning → asks HA for Thread dataset → HA fetches from OTBR → ships to device → device joins Thread mesh through the ZBT-2.
HA never opens the serial port directly; `matter-server` never talks to OTBR directly. HA brokers between them — that's why all four extra components are needed.
## NixOS-side changes
All changes live in **`modules/environments/home-assistant/default.nix`**. No host-level changes in `machines/jupiter/` (the existing profile activation handles that), no flake-level changes (the existing `_module.args.self = self;` wiring is sufficient).
### Edited module sketch
```nix
{ config, lib, pkgs, self, ... }:
let
cfg = config.my.profiles.home-assistant;
hostName = config.networking.hostName;
in
{
imports = [
# OTBR module isn't in 25.11 yet; use unstable's directly. Package
# comes from the existing `unstable` overlay.
"${self.inputs.nixpkgs-unstable}/nixos/modules/services/home-automation/openthread-border-router.nix"
];
options.my.profiles.home-assistant.enable = lib.mkEnableOption "Home Automation";
config = lib.mkIf cfg.enable {
services.matter-server.enable = true;
services.home-assistant = {
enable = true;
openFirewall = true;
extraComponents = [
"matter"
"mobile_app"
"otbr"
"thread"
];
};
services.home-assistant.config = {
name = "Home - Rechberg";
unit_system = "metric";
mobile_app = { };
};
services.openthread-border-router = {
enable = true;
package = pkgs.unstable.openthread-border-router;
openFirewall = true;
backboneInterfaces = [ "enp3s0" ]; # verify with `ip link` post-deploy
radio.device = "/dev/serial/by-id/usb-Nabu_Casa_Home_Assistant_Connect_ZBT-2_<serial>-...";
# web.enable left default (off) — HA UI is the management surface
};
my.homepage.services = [
{
group = "Services";
name = "Home Assistant";
description = "Home automation";
href = "http://${hostName}:8123";
icon = "si-homeassistant";
}
];
};
}
```
### Reverts of the prior ZHA commit
Drop both lines from commit `e8d09f4`:
- `"zha"` from `extraComponents` (replaced by `"otbr"` + `"thread"`).
- `users.users.hass.extraGroups = [ "dialout" ];``otbr-agent` runs as root and owns the device directly; HA never opens the serial port itself.
Done by `git revert e8d09f4` at the start of implementation, before applying the new diff.
### Decisions captured
- **No `universal-silabs-flasher` in `environment.systemPackages`.** Flashing is a once-or-twice-a-year operation; `nix shell nixpkgs#python313Packages.universal-silabs-flasher` is sufficient when needed and avoids a perma-dep on a tool that's idle most of the time.
- **No firmware pinning in the flake.** Consistent with option B (CLI-only manual flashing). The user fetches the `.gbl` from <https://github.com/NabuCasa/silabs-firmware-builder/releases> at update time.
- **`backboneInterfaces = [ "enp3s0" ]`** as a starting value (per `machines/jupiter/hardware-configuration.nix:64`). To be verified against `ip link` after first deploy; correctable in a follow-up commit if the actual primary interface differs.
## Operator workflow
All commands the user runs themselves; nothing is SSH'd from the dev session.
### Step 0 — branch hygiene (dev Mac)
```
git switch feature/ha-zbt-2-thread # already renamed
git revert --no-edit e8d09f4 # drops ZHA + dialout commit
```
### Step 1 — apply the module changes (dev Mac)
Edit `modules/environments/home-assistant/default.nix` per the sketch above. Leave `<serial>` as a placeholder; fill after Step 3.
### Step 2 — eval-only sanity check (dev Mac)
```
nix flake check
```
or, equivalently,
```
nixos-rebuild dry-build --flake .#jupiter
```
Catches: bad import path, option typos, version skew between unstable and stable.
### Step 3 — plug ZBT-2 into jupiter (still on stock Zigbee firmware)
On jupiter:
```
ls -l /dev/serial/by-id/
```
Then on dev Mac: copy the full `usb-Nabu_Casa_Home_Assistant_Connect_ZBT-2_<serial>-...` path into `radio.device`, commit on the feature branch.
### Step 4 — flash OpenThread RCP firmware (one-time, on jupiter)
```
nix shell nixpkgs#python313Packages.universal-silabs-flasher -c \
universal-silabs-flasher \
--device /dev/serial/by-id/usb-Nabu_Casa_Home_Assistant_Connect_ZBT-2_<serial>-... \
flash --firmware ~/ot-rcp-zbt-2-<version>.gbl
```
Firmware download: latest ZBT-2 OpenThread RCP `.gbl` from <https://github.com/NabuCasa/silabs-firmware-builder/releases>.
OTBR isn't running yet at this point, so there's no contention on the device.
### Step 5 — rebuild (on jupiter)
```
sudo nixos-rebuild switch --flake .#jupiter
```
Brings up `otbr-agent.service`, opens TCP/8081, loads `otbr` + `thread` integrations in HA.
### Step 6 — confirm HA discovered it
- `http://jupiter:8123` → Settings → Devices & Services → "Open Thread Border Router" appears as auto-discovered within ~30 s.
- Click "Configure", form a new Thread network (or import an existing dataset).
- "Matter" integration page now shows Thread credentials available.
### Step 7 — Matter-over-Thread smoke test
Pair one Matter-over-Thread device end-to-end via the HA Companion app. Pairing should complete in 3090 s. If it does, merge `feature/ha-zbt-2-thread` into `master`.
### Future updates
Identical to Step 4: stop `otbr-agent.service`, run the flasher with a new `.gbl`, start the service.
## Failure modes
| Symptom | Likely cause | Mitigation |
|---|---|---|
| `otbr-agent.service` fails: "Failed to open device" | Dongle unplugged or `radio.device` path stale (e.g. after replacement) | Module sets `Restart = "on-failure"`; check `systemctl status otbr-agent`, re-check `/dev/serial/by-id/`, update path. |
| OTBR up but HA never discovers it | mDNS not propagating on `enp3s0` (most often: `backboneInterfaces` wrong) | `avahi-browse -r _meshcop._udp` should show one entry. If not: `ip link`, fix `backboneInterfaces`, rebuild. |
| HA shows OTBR but Matter pairing times out | Thread mesh prefix not routed to LAN, or matter-server can't reach the device's IPv6 ULA | `nft list ruleset` should show OTBR's forwarding rules; `ip -6 route` should include the Thread mesh prefix. |
| Dongle stuck after a half-completed flash | Flasher interrupted mid-write | Re-run the flash; bootloader stays addressable even if RCP firmware is corrupt. The tool detects bootloader-mode automatically. |
| `nixos-rebuild` fails: "option `services.openthread-border-router` does not exist" | Unstable module import path wrong / not in scope | Caught by Step 2 (eval-only). Fix before deploy. |
## Verification
### Eval-only (dev Mac, before deploy)
```
nix flake check
nix eval --json .#nixosConfigurations.jupiter.config.services.openthread-border-router.radio.url
nix eval --json .#nixosConfigurations.jupiter.config.services.home-assistant.extraComponents
```
Expected: flake check passes; `radio.url` is a `spinel+hdlc+uart://...` string built from the by-id path; `extraComponents` includes `"otbr"` and `"thread"`.
### Service-level (jupiter, after rebuild)
```
systemctl status otbr-agent.service
journalctl -u otbr-agent.service -n 50 --no-pager
ip link show wpan0
avahi-browse -r -t _meshcop._udp
curl -s http://127.0.0.1:8081/node/state
```
Expected: service active; `wpan0` exists (DOWN until HA forms a network — correct); one `_meshcop._udp` entry; REST returns a JSON state string.
### Functional (HA UI)
- "Open Thread Border Router" appears under auto-discovered integrations.
- Forming a Thread network from the integration UI succeeds.
- Pairing one Matter-over-Thread device end-to-end succeeds.
## Open questions / risks
- **Unstable module ABI.** The `services.openthread-border-router` module is in `nixos-unstable` and may change shape before landing in 26.05. If options rename, the eval-only step catches it before deploy. Acceptable risk; we can pin the unstable input revision if churn becomes annoying.
- **Backbone interface name.** `enp3s0` is a best guess from `hardware-configuration.nix:64`'s commented-out line. Definitive answer comes from `ip link` on the actual host. Trivial to correct if wrong.
- **First-flash chicken-and-egg.** Deferred to `nix shell` rather than baked into the system, because the dongle must be flashed *before* `otbr-agent` claims it. This is documented in Step 4.