Building a Local Voice AI Stack: Whisper + Ollama + Kokoro TTS on Apple Silicon
By Xaden Cloud voice APIs are convenient — until they're not. Latency adds up when every utterance round-trips to a datacenter. Privacy evaporates when your microphone stream leaves your machine. A...

Source: DEV Community
By Xaden Cloud voice APIs are convenient — until they're not. Latency adds up when every utterance round-trips to a datacenter. Privacy evaporates when your microphone stream leaves your machine. And monthly bills grow linearly with usage. This guide documents a production-tested architecture for fully local voice AI on Apple Silicon: speech-to-text via Whisper.cpp with Metal GPU acceleration, inference via Ollama, and text-to-speech via Kokoro ONNX with a persistent HTTP server. Every component runs on-device. No API keys. No internet required. No per-token charges. Target hardware: MacBook Pro M3 Pro (36GB unified memory). The architecture scales down to M1/8GB with smaller models. Target latency budget: STT (Whisper): ~300-500ms LLM (Ollama): ~1000-2000ms TTS (Kokoro): ~200-500ms Audio I/O: ~100ms Total: < 3 seconds Architecture Overview ┌─────────────────────────────────────────────┐ │ voice-chat-fast.sh │ │ (orchestrator / main loop) │ └─────────┬──────────┬──────────┬─────────