Rift Evals v2026.01.24 · Jan 24, 2026 · Source · JSON

71.2% pass rate on 185 scenarios.

Above 70% threshold

Local MLX benchmarks · Apple Silicon

I run these tests on the on-device voice pipeline — LLM polish, STT/TTS round-trip, installer integrity. Threshold: 70% Jaccard similarity across 5 suites.

Pass rate

71.2%

Threshold 70%

Scenarios

185

Synthetic + E2E

Suites

LLM · context · TTS · silence · installer

Hardware

M3 Pro

18GB · MLX

Benchmarks

expandable suites

▸ What is not working

Tests below the 70% similarity threshold

What would fix these

Open fix prompt in Cursor

Self-Driving

autopilot · eval-fix

Autonomous change log and plan dispatch — from Cockpit API and eval-fix history.

Hypothesized

—

In Progress

—

Shipped

—

Reverted

—

Plan dispatch

Action: —
Todo: —
Depth: —

Open Cockpit →

Eval gate

Last run: —
Status: placeholder
Workflow: eval-gate.yml

Memory gate

Peak RSS: —
Tier: —
Model config: —

Autopilot metrics

Pass rate: —
Rollback rate: —
Kanban: —
Flat curve: —

Autonomous change log

loading…No entries yet — wired from test-engine/eval-fix-history.jsonl when present.

Run by Rift — loop not proven yet · awaiting first SDK merge

Methodology

grader

Tests compare LLM output to expected results using Jaccard similarity of word sets. Pass requires similarity ≥ 0.70 and latency ≤ threshold. Latency limits: 1500ms (merge), 500ms (correction), 6500ms (polish), 16000ms (TTS transform).

Hardware: Apple M3 Pro, 18GB unified memory, MLX framework. Models: Qwen3-0.6B (fast operations), Qwen3-4B (quality operations).

Model comparison: All candidates run on the same 66 LLM test scenarios. Prompts are stored in Qwen3 chat format and auto-converted to each model's native format via tokenizer.apply_chat_template(). Models benchmarked: Qwen3-4B, Gemma 4 E4B (4-bit, 6-bit), Gemma 4 26B MoE (4-bit).

Limitations

honest gaps

Tests hit Python servers directly, not Electron IPC integration
Clipboard paste/replace operations are mocked
No WER or MOS metrics for TTS audio quality
All test scenarios are synthetic, not real user inputs
Model comparison prompts use automatic format conversion — models trained on different data may need prompt tuning