feat(harbor): operator dashboard for watching optimization trials live#29
feat(harbor): operator dashboard for watching optimization trials live#29shehabyasser-scale wants to merge 3 commits into
Conversation
vero harbor dashboard serves a local single-page UI (stdlib HTTP server, zero new dependencies) showing every trial on the machine: which experiments are running, each one's live ledger (commit, split, score, error rate per recorded eval, via the admin /experiments endpoint through docker exec), remaining budget (/status), pre-registered bars from an optional --meta JSON, finals scanned from harbor jobs dirs (reward.json), and a WIN/MISS verdict against the win bar. A scorecard section renders finished experiments from the meta file. Phases are derived, not guessed: building (container up, no ledger rows), running, done (reward.json exists), gone (no container and no final, so a crashed launch is conspicuous rather than silently absent). Live trials the meta file does not mention still appear. Binds 127.0.0.1 by default since ledger rows include hidden-split scores. The status-assembly layer (key derivation, phase, verdicts, meta/live/final join, reward.json scanning) is pure and unit-tested without docker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| def verdict(final: dict | None, bars: dict | None) -> str | None: | ||
| """WIN / MISS against the pre-registered win bar, when both are known.""" | ||
| if not final or not bars or "win_bar" not in bars: | ||
| return None | ||
| scores = [v for v in final.values() if isinstance(v, (int, float))] | ||
| if not scores: | ||
| return None | ||
| return "WIN" if scores[0] > bars["win_bar"] else "MISS" |
There was a problem hiding this comment.
verdict picks the first numeric value regardless of which field it is
scores = [v for v in final.values() if isinstance(v, (int, float))] then uses scores[0]. Dict iteration order is insertion order in Python 3.7+, so if reward.json ever contains multiple numeric fields — e.g. {"total_tasks": 120, "accuracy": 0.47} — scores[0] is 120 and the verdict is evaluated against the win bar with the wrong number. All tests use a single-key dict, so this case is not covered. Since the dashboard is decision-critical (WIN/MISS governs what the operator does next), a key field to compare against, or at minimum a tie-breaking policy, should be specified explicitly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 71-78
Comment:
**`verdict` picks the first numeric value regardless of which field it is**
`scores = [v for v in final.values() if isinstance(v, (int, float))]` then uses `scores[0]`. Dict iteration order is insertion order in Python 3.7+, so if `reward.json` ever contains multiple numeric fields — e.g. `{"total_tasks": 120, "accuracy": 0.47}` — `scores[0]` is `120` and the verdict is evaluated against the win bar with the wrong number. All tests use a single-key dict, so this case is not covered. Since the dashboard is decision-critical (WIN/MISS governs what the operator does next), a key field to compare against, or at minimum a tie-breaking policy, should be specified explicitly.
How can I resolve this? If you propose a fix, please make it concise.| finals: dict[str, dict] = {} | ||
| for jd in jobs_dirs: | ||
| for rj in Path(jd).glob("**/reward.json"): | ||
| key = None | ||
| for part in rj.parts: | ||
| if "__" in part: | ||
| key = part.split("__")[0] | ||
| if key is None: | ||
| key = Path(jd).name | ||
| try: | ||
| finals[key] = json.loads(rj.read_text()) | ||
| except (OSError, json.JSONDecodeError): | ||
| continue |
There was a problem hiding this comment.
Silent last-writer-wins when a task has multiple reward.json files
finals[key] = json.loads(rj.read_text()) unconditionally overwrites the dict entry on every matching file. If the same task was run more than once (two trial dirs like gaia-exp11-task__run1 and gaia-exp11-task__run2, each with a reward.json), the key resolves identically and whichever file glob yields last wins — glob order is not guaranteed to be chronological. The displayed final (and therefore the WIN/MISS verdict) can be arbitrary when re-trials exist under the same jobs dir.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 201-213
Comment:
**Silent last-writer-wins when a task has multiple reward.json files**
`finals[key] = json.loads(rj.read_text())` unconditionally overwrites the dict entry on every matching file. If the same task was run more than once (two trial dirs like `gaia-exp11-task__run1` and `gaia-exp11-task__run2`, each with a `reward.json`), the key resolves identically and whichever file `glob` yields last wins — `glob` order is not guaranteed to be chronological. The displayed final (and therefore the WIN/MISS verdict) can be arbitrary when re-trials exist under the same jobs dir.
How can I resolve this? If you propose a fix, please make it concise.| if token: | ||
| raw = _docker( | ||
| "exec", container, "curl", "-s", | ||
| "-H", f"Authorization: Bearer {token}", | ||
| "http://localhost:8000/experiments", | ||
| ) |
There was a problem hiding this comment.
Admin token visible in container process table
The token is passed as -H "Authorization: Bearer <token>" in the docker exec curl command line. Inside the container, the full argument list is readable via /proc/<pid>/cmdline (and by any process with access to ps aux inside the container). Curl supports reading headers from stdin with -H @-, which would avoid putting the secret on the command line entirely.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 177-182
Comment:
**Admin token visible in container process table**
The token is passed as `-H "Authorization: Bearer <token>"` in the `docker exec curl` command line. Inside the container, the full argument list is readable via `/proc/<pid>/cmdline` (and by any process with access to `ps aux` inside the container). Curl supports reading headers from stdin with `-H @-`, which would avoid putting the secret on the command line entirely.
How can I resolve this? If you propose a fix, please make it concise.| document.getElementById("grid").innerHTML = d.experiments.map(e => { | ||
| const rounds = e.rounds.slice(-10).map(x => | ||
| `<tr><td>${x.commit}</td><td>${x.split ?? "-"}</td><td>${fmt(x.score)}</td><td>${fmt(x.error_rate)}</td></tr>`).join(""); | ||
| const bars = e.bars ? `baseline ${fmt(e.bars.baseline)} · win > ${fmt(e.bars.win_bar)}` + | ||
| (e.bars.healthy !== undefined ? ` · healthy ${fmt(e.bars.healthy)}` : "") : ""; | ||
| const budget = (e.budget || []).map(s => | ||
| s.remaining_run_budget !== undefined && s.remaining_run_budget !== null | ||
| ? `${s.split}: ${s.remaining_run_budget} evals left` : "").filter(Boolean).join(" · "); | ||
| const final = e.final ? `<div class="final ${e.verdict ?? "noverdict"}">final ${ | ||
| Object.entries(e.final).map(([k,v]) => k+" = "+fmt(v)).join(", ")}${ | ||
| e.verdict ? " · " + e.verdict : ""}</div>` : ""; | ||
| return `<div class="card"> | ||
| <div class="hdr"><span class="name">${e.title}</span><span class="chip ${e.phase}">${e.phase}</span></div> | ||
| ${e.question ? `<div class="q">${e.question}</div>` : ""} | ||
| ${bars ? `<div class="bars">${bars}</div>` : ""} | ||
| ${budget ? `<div class="bars">${budget}</div>` : ""} | ||
| ${e.rounds.length ? `<table><tr><th>commit</th><th>split</th><th>score</th><th>err</th></tr>${rounds}</table>` : ""} | ||
| ${final} | ||
| ${e.note ? `<div class="note">${e.note}</div>` : ""} | ||
| </div>`; | ||
| }).join(""); | ||
| const h = d.history || []; | ||
| document.getElementById("hist").innerHTML = h.length ? `<h1>scorecard</h1><table> | ||
| <tr><th>exp</th><th>question</th><th>baseline</th><th>bar</th><th>final</th><th>verdict</th></tr>` + | ||
| h.map(x => `<tr><td>${x.exp}</td><td>${x.question}</td><td>${x.baseline ?? "-"}</td> | ||
| <td>${x.bar ?? "-"}</td><td>${x.final ?? "-"}</td><td>${x.verdict ?? "-"}</td></tr>`).join("") + "</table>" : ""; |
There was a problem hiding this comment.
Unsanitized data injected into
innerHTML
Fields from the API response — e.title, e.question, e.note, and the history columns x.exp, x.question — are interpolated directly into HTML template literals that are assigned to innerHTML. Any <script> tags or event-handler attributes in those strings will execute. The meta file is operator-controlled and the server is localhost-only, so the practical risk is low, but a maliciously crafted (or accidentally munged) meta file would have full script execution in the operator's browser. Consider running these strings through a small escaping helper (htmlEscape) before insertion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 271-296
Comment:
**Unsanitized data injected into `innerHTML`**
Fields from the API response — `e.title`, `e.question`, `e.note`, and the history columns `x.exp`, `x.question` — are interpolated directly into HTML template literals that are assigned to `innerHTML`. Any `<script>` tags or event-handler attributes in those strings will execute. The meta file is operator-controlled and the server is localhost-only, so the practical risk is low, but a maliciously crafted (or accidentally munged) meta file would have full script execution in the operator's browser. Consider running these strings through a small escaping helper (`htmlEscape`) before insertion.
How can I resolve this? If you propose a fix, please make it concise.…ns, legend) Feedback from first real use: terse verdicts mean nothing without campaign context. The scorecard now carries a one-sentence plain-language summary per experiment and a legend defining baseline / bar / final.
| def _docker(*args: str, timeout: int = 15) -> str: | ||
| res = subprocess.run( | ||
| ["docker", *args], capture_output=True, text=True, timeout=timeout | ||
| ) | ||
| return res.stdout if res.returncode == 0 else "" |
There was a problem hiding this comment.
subprocess.TimeoutExpired is not caught, crashing the HTTP handler
subprocess.run(..., timeout=15) raises subprocess.TimeoutExpired if the process doesn't finish in time. _docker catches neither this nor FileNotFoundError (docker not on PATH), so both propagate through snapshot_sidecar → collect → do_GET. ThreadingHTTPServer catches it at the thread boundary and closes the connection, leaving the browser with a network error rather than partial data. Any container that is alive but has a frozen shell (precisely the "crashed launch" scenario the gone phase targets) will repeatedly time out every 20 s and render the dashboard blank until manually cleared. Adding except (subprocess.TimeoutExpired, FileNotFoundError): return "" in _docker would match the existing non-zero-exit fallback.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 150-154
Comment:
**`subprocess.TimeoutExpired` is not caught, crashing the HTTP handler**
`subprocess.run(..., timeout=15)` raises `subprocess.TimeoutExpired` if the process doesn't finish in time. `_docker` catches neither this nor `FileNotFoundError` (docker not on PATH), so both propagate through `snapshot_sidecar` → `collect` → `do_GET`. `ThreadingHTTPServer` catches it at the thread boundary and closes the connection, leaving the browser with a network error rather than partial data. Any container that is alive but has a frozen shell (precisely the "crashed launch" scenario the `gone` phase targets) will repeatedly time out every 20 s and render the dashboard blank until manually cleared. Adding `except (subprocess.TimeoutExpired, FileNotFoundError): return ""` in `_docker` would match the existing non-zero-exit fallback.
How can I resolve this? If you propose a fix, please make it concise.…banners Computed verdicts render as PASSED/FAILED (was WIN/MISS); scorecard rows take a result field (passed / failed / no change / cancelled) shown as a color-coded chip in a dedicated verdict column.
What
vero harbor dashboardserves a local, self-refreshing status page for Harbor optimization trials: which experiments are running, what each is doing right now, and how finished ones scored against their pre-registered bars. Built while operating a 13-experiment campaign where "what's running and where is it?" was answered by hand-rolled shell monitors; this replaces them with one page.How it works
*-eval-sidecar-1containers viadocker ps, then tails each trial's ledger through the admin/experimentsendpoint (this PR stacks on the observability PR that added it), using the in-container admin token viadocker exec. Remaining budget comes from/status.--jobs, repeatable) forreward.json, keyed by the trial dir's task name so finished trials stay visible after teardown.--metaJSON supplies what containers cannot know: each experiment's question, pre-registered bars (baseline / win bar / healthy ceiling), notes, and a scorecard of completed experiments. Verdicts (WIN/MISS) are computed against the win bar.building(container up, no ledger rows) ->running->done(reward.json), andgonewhen there is no container and no final, so a crashed launch is conspicuous instead of silently absent. Live trials missing from the meta file still appear.Design constraints
http.server+ inline HTML/JS,subprocessfor docker.127.0.0.1by default: ledger rows include hidden-split scores, so the page is for the operator, not the network./experimentsendpoint it consumes); it never reads agent volumes.Usage
Tests
The status-assembly layer is pure and covered without docker (16 tests): container-name to experiment-key derivation, phase transitions, WIN/MISS verdicts, the meta/live/final join (including unlisted-live-trial and meta-only-gone cases), ledger normalization with null rows, and reward.json scanning.
🤖 Generated with Claude Code
Greptile Summary
This PR introduces
vero harbor dashboard, a zero-dependency local status page for watching Harbor optimization trials live — discovering sidecar containers viadocker ps, tailing ledgers through the admin/experimentsendpoint, scanning jobs dirs forreward.jsonfinals, and joining everything with an optional--metacontext file. The pure status-assembly layer is well-separated and covered by 16 tests, but several issues identified in the review remain open.x.resultfor the verdict chip, but the test fixture (and therefore the effective meta-file documentation) uses"verdict"— the column will silently show"-"for every historical experiment in any meta file following the example. Values like "WIN"/"MISS" also won't match thevclasslookup._docker:subprocess.TimeoutExpiredandFileNotFoundErrorare uncaught and propagate throughcollect()intodo_GET, causing the browser to receive a network error for the entire poll cycle rather than partial data when a container shell stalls.-Hargument in thedocker exec curlcall, making it readable viaps//proc/<pid>/cmdlineinside the container.verdict()picks the first numeric value fromreward.jsonregardless of field name;scan_finalssilently overwrites on key collision when multiple reward files share the same task key.Confidence Score: 3/5
Safe to run locally but the history scorecard will silently malfunction for any meta file written following the test example, and unhandled docker-exec timeouts can blank the dashboard entirely during active trials.
The history scorecard's verdict column will show "-" for every historical experiment in any meta file an operator writes using the "verdict" key demonstrated by the test fixture — the frontend reads x.result. Several other issues from prior review rounds (unhandled TimeoutExpired crashing the HTTP handler, admin token in the container process table, innerHTML injection, first-value verdict pick) are also still unresolved, compounding the overall risk for a decision-critical operator tool.
The history fixture in test_harbor_dashboard.py and the frontend template in dashboard.py (around the x.result / vclass lines) are the immediate focus; the _docker error-handling and verdict() key-selection issues in dashboard.py also warrant attention before this lands.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Browser participant Handler as HTTP Handler (ThreadingHTTPServer) participant Collect as collect() participant Docker as docker CLI participant FS as Filesystem Browser->>Handler: GET /api/status (every 20s) Handler->>Collect: collect(meta_path, jobs_dirs) Collect->>FS: read meta JSON (optional) Collect->>Docker: "docker ps --format {{.Names}}" Docker-->>Collect: container list loop for each sidecar (sequential) Collect->>Docker: docker exec python3 -c _TOKEN_SNIPPET Docker-->>Collect: admin token Collect->>Docker: docker exec curl /experiments (Bearer token) Docker-->>Collect: ledger JSON Collect->>Docker: docker exec curl /status Docker-->>Collect: budget JSON end Collect->>FS: "glob **/reward.json in jobs_dirs" FS-->>Collect: reward.json files Collect-->>Handler: merged status dict Handler-->>Browser: 200 application/json Browser->>Handler: GET / Handler-->>Browser: 200 text/html (inline PAGE template)%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Browser participant Handler as HTTP Handler (ThreadingHTTPServer) participant Collect as collect() participant Docker as docker CLI participant FS as Filesystem Browser->>Handler: GET /api/status (every 20s) Handler->>Collect: collect(meta_path, jobs_dirs) Collect->>FS: read meta JSON (optional) Collect->>Docker: "docker ps --format {{.Names}}" Docker-->>Collect: container list loop for each sidecar (sequential) Collect->>Docker: docker exec python3 -c _TOKEN_SNIPPET Docker-->>Collect: admin token Collect->>Docker: docker exec curl /experiments (Bearer token) Docker-->>Collect: ledger JSON Collect->>Docker: docker exec curl /status Docker-->>Collect: budget JSON end Collect->>FS: "glob **/reward.json in jobs_dirs" FS-->>Collect: reward.json files Collect-->>Handler: merged status dict Handler-->>Browser: 200 application/json Browser->>Handler: GET / Handler-->>Browser: 200 text/html (inline PAGE template)Reviews (3): Last reviewed commit: "feat(harbor): PASSED/FAILED verdict chip..." | Re-trigger Greptile