Plan dispatch
- Action
- —
- Todo
- —
- Depth
- —
I run these tests on the on-device voice pipeline — LLM polish, STT/TTS round-trip, installer integrity. Threshold: 70% Jaccard similarity across 5 suites.
Tests below the 70% similarity threshold
Autonomous change log and plan dispatch — from Cockpit API and eval-fix history.
test-engine/eval-fix-history.jsonl when present.
Tests compare LLM output to expected results using Jaccard similarity of word sets.
Pass requires similarity ≥ 0.70 and latency ≤ threshold.
Latency limits: 1500ms (merge), 500ms (correction), 6500ms (polish), 16000ms (TTS transform).
Hardware: Apple M3 Pro, 18GB unified memory, MLX framework. Models: Qwen3-0.6B (fast operations), Qwen3-4B (quality operations).
Model comparison: All candidates run on the same 66 LLM test scenarios.
Prompts are stored in Qwen3 chat format and auto-converted to each model's native
format via tokenizer.apply_chat_template(). Models benchmarked: Qwen3-4B,
Gemma 4 E4B (4-bit, 6-bit), Gemma 4 26B MoE (4-bit).