▲marliechorgan1 hour ago
[-] Author here. The core idea is using the latest low-latency STT/TTS/LLMs along with structured output to separate out OpenClaw tasks and text for TTS. Most current voice agents either talk OR call tools, never both at once. By using structured output, the same LLM call can stream text to TTS while simultaneously dispatching commands to an agentic backend (OpenClaw) in parallel. A few very simple tricks but seems to work better than anything else I've seen!
reply