feat(harbor): operator dashboard for watching optimization trials live by shehabyasser-scale · Pull Request #29 · scaleapi/vero

shehabyasser-scale · 2026-07-04T16:21:13Z

What

vero harbor dashboard serves a local, self-refreshing status page for Harbor optimization trials: which experiments are running, what each is doing right now, and how finished ones scored against their pre-registered bars. Built while operating a 13-experiment campaign where "what's running and where is it?" was answered by hand-rolled shell monitors; this replaces them with one page.

How it works

Live trials: discovers *-eval-sidecar-1 containers via docker ps, then tails each trial's ledger through the admin /experiments endpoint (this PR stacks on the observability PR that added it), using the in-container admin token via docker exec. Remaining budget comes from /status.
Finals: scans harbor jobs dirs (--jobs, repeatable) for reward.json, keyed by the trial dir's task name so finished trials stay visible after teardown.
Context: an optional --meta JSON supplies what containers cannot know: each experiment's question, pre-registered bars (baseline / win bar / healthy ceiling), notes, and a scorecard of completed experiments. Verdicts (WIN/MISS) are computed against the win bar.
Phases are derived, not guessed: building (container up, no ledger rows) -> running -> done (reward.json), and gone when there is no container and no final, so a crashed launch is conspicuous instead of silently absent. Live trials missing from the meta file still appear.

Design constraints

Zero new dependencies: stdlib http.server + inline HTML/JS, subprocess for docker.
Binds 127.0.0.1 by default: ledger rows include hidden-split scores, so the page is for the operator, not the network.
Host-side admin tool (same trust posture as the /experiments endpoint it consumes); it never reads agent volumes.

Usage

vero harbor dashboard --meta exp-meta.json --jobs /tmp/gaia-exp10-jobs --jobs /tmp/gaia-exp11-jobs
# -> http://127.0.0.1:8300

Tests

The status-assembly layer is pure and covered without docker (16 tests): container-name to experiment-key derivation, phase transitions, WIN/MISS verdicts, the meta/live/final join (including unlisted-live-trial and meta-only-gone cases), ledger normalization with null rows, and reward.json scanning.

🤖 Generated with Claude Code

Greptile Summary

This PR introduces vero harbor dashboard, a zero-dependency local status page for watching Harbor optimization trials live — discovering sidecar containers via docker ps, tailing ledgers through the admin /experiments endpoint, scanning jobs dirs for reward.json finals, and joining everything with an optional --meta context file. The pure status-assembly layer is well-separated and covered by 16 tests, but several issues identified in the review remain open.

History scorecard field name mismatch: the frontend template reads x.result for the verdict chip, but the test fixture (and therefore the effective meta-file documentation) uses "verdict" — the column will silently show "-" for every historical experiment in any meta file following the example. Values like "WIN"/"MISS" also won't match the vclass lookup.
Error handling gaps in _docker: subprocess.TimeoutExpired and FileNotFoundError are uncaught and propagate through collect() into do_GET, causing the browser to receive a network error for the entire poll cycle rather than partial data when a container shell stalls.
Admin token on command line: the Bearer token is passed as a -H argument in the docker exec curl call, making it readable via ps//proc/<pid>/cmdline inside the container.
verdict() picks the first numeric value from reward.json regardless of field name; scan_finals silently overwrites on key collision when multiple reward files share the same task key.

Confidence Score: 3/5

Safe to run locally but the history scorecard will silently malfunction for any meta file written following the test example, and unhandled docker-exec timeouts can blank the dashboard entirely during active trials.

The history scorecard's verdict column will show "-" for every historical experiment in any meta file an operator writes using the "verdict" key demonstrated by the test fixture — the frontend reads x.result. Several other issues from prior review rounds (unhandled TimeoutExpired crashing the HTTP handler, admin token in the container process table, innerHTML injection, first-value verdict pick) are also still unresolved, compounding the overall risk for a decision-critical operator tool.

The history fixture in test_harbor_dashboard.py and the frontend template in dashboard.py (around the x.result / vclass lines) are the immediate focus; the _docker error-handling and verdict() key-selection issues in dashboard.py also warrant attention before this lands.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/dashboard.py	New dashboard module: clean pure-layer / side-effect separation, but several flagged issues remain open — unhandled TimeoutExpired/FileNotFoundError crashes the handler thread, admin token exposed on container process list, innerHTML injection from operator-controlled strings, first-numeric-value pick in verdict(), and last-writer-wins in scan_finals().
vero/tests/test_harbor_dashboard.py	Good coverage of pure helpers and join logic; test fixture uses wrong field key ("verdict" vs "result") that will cause silent rendering failure in the history scorecard.
vero/src/vero/harbor/cli.py	Adds the "dashboard" CLI command with appropriate options; straightforward delegation to dashboard.py with no issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Browser
    participant Handler as HTTP Handler (ThreadingHTTPServer)
    participant Collect as collect()
    participant Docker as docker CLI
    participant FS as Filesystem

    Browser->>Handler: GET /api/status (every 20s)
    Handler->>Collect: collect(meta_path, jobs_dirs)
    Collect->>FS: read meta JSON (optional)
    Collect->>Docker: "docker ps --format {{.Names}}"
    Docker-->>Collect: container list
    loop for each sidecar (sequential)
        Collect->>Docker: docker exec python3 -c _TOKEN_SNIPPET
        Docker-->>Collect: admin token
        Collect->>Docker: docker exec curl /experiments (Bearer token)
        Docker-->>Collect: ledger JSON
        Collect->>Docker: docker exec curl /status
        Docker-->>Collect: budget JSON
    end
    Collect->>FS: "glob **/reward.json in jobs_dirs"
    FS-->>Collect: reward.json files
    Collect-->>Handler: merged status dict
    Handler-->>Browser: 200 application/json

    Browser->>Handler: GET /
    Handler-->>Browser: 200 text/html (inline PAGE template)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Browser
    participant Handler as HTTP Handler (ThreadingHTTPServer)
    participant Collect as collect()
    participant Docker as docker CLI
    participant FS as Filesystem

    Browser->>Handler: GET /api/status (every 20s)
    Handler->>Collect: collect(meta_path, jobs_dirs)
    Collect->>FS: read meta JSON (optional)
    Collect->>Docker: "docker ps --format {{.Names}}"
    Docker-->>Collect: container list
    loop for each sidecar (sequential)
        Collect->>Docker: docker exec python3 -c _TOKEN_SNIPPET
        Docker-->>Collect: admin token
        Collect->>Docker: docker exec curl /experiments (Bearer token)
        Docker-->>Collect: ledger JSON
        Collect->>Docker: docker exec curl /status
        Docker-->>Collect: budget JSON
    end
    Collect->>FS: "glob **/reward.json in jobs_dirs"
    FS-->>Collect: reward.json files
    Collect-->>Handler: merged status dict
    Handler-->>Browser: 200 application/json

    Browser->>Handler: GET /
    Handler-->>Browser: 200 text/html (inline PAGE template)

_{Reviews (3): Last reviewed commit: "feat(harbor): PASSED/FAILED verdict chip..." | Re-trigger Greptile}

vero harbor dashboard serves a local single-page UI (stdlib HTTP server, zero new dependencies) showing every trial on the machine: which experiments are running, each one's live ledger (commit, split, score, error rate per recorded eval, via the admin /experiments endpoint through docker exec), remaining budget (/status), pre-registered bars from an optional --meta JSON, finals scanned from harbor jobs dirs (reward.json), and a WIN/MISS verdict against the win bar. A scorecard section renders finished experiments from the meta file. Phases are derived, not guessed: building (container up, no ledger rows), running, done (reward.json exists), gone (no container and no final, so a crashed launch is conspicuous rather than silently absent). Live trials the meta file does not mention still appear. Binds 127.0.0.1 by default since ledger rows include hidden-split scores. The status-assembly layer (key derivation, phase, verdicts, meta/live/final join, reward.json scanning) is pure and unit-tested without docker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-07-04T16:25:09Z

+def verdict(final: dict | None, bars: dict | None) -> str | None:
+    """WIN / MISS against the pre-registered win bar, when both are known."""
+    if not final or not bars or "win_bar" not in bars:
+        return None
+    scores = [v for v in final.values() if isinstance(v, (int, float))]
+    if not scores:
+        return None
+    return "WIN" if scores[0] > bars["win_bar"] else "MISS"


verdict picks the first numeric value regardless of which field it is

scores = [v for v in final.values() if isinstance(v, (int, float))] then uses scores[0]. Dict iteration order is insertion order in Python 3.7+, so if reward.json ever contains multiple numeric fields — e.g. {"total_tasks": 120, "accuracy": 0.47} — scores[0] is 120 and the verdict is evaluated against the win bar with the wrong number. All tests use a single-key dict, so this case is not covered. Since the dashboard is decision-critical (WIN/MISS governs what the operator does next), a key field to compare against, or at minimum a tie-breaking policy, should be specified explicitly.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/dashboard.py Line: 71-78 Comment: **`verdict` picks the first numeric value regardless of which field it is** `scores = [v for v in final.values() if isinstance(v, (int, float))]` then uses `scores[0]`. Dict iteration order is insertion order in Python 3.7+, so if `reward.json` ever contains multiple numeric fields — e.g. `{"total_tasks": 120, "accuracy": 0.47}` — `scores[0]` is `120` and the verdict is evaluated against the win bar with the wrong number. All tests use a single-key dict, so this case is not covered. Since the dashboard is decision-critical (WIN/MISS governs what the operator does next), a key field to compare against, or at minimum a tie-breaking policy, should be specified explicitly. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-04T16:25:10Z

+    finals: dict[str, dict] = {}
+    for jd in jobs_dirs:
+        for rj in Path(jd).glob("**/reward.json"):
+            key = None
+            for part in rj.parts:
+                if "__" in part:
+                    key = part.split("__")[0]
+            if key is None:
+                key = Path(jd).name
+            try:
+                finals[key] = json.loads(rj.read_text())
+            except (OSError, json.JSONDecodeError):
+                continue


Silent last-writer-wins when a task has multiple reward.json files

finals[key] = json.loads(rj.read_text()) unconditionally overwrites the dict entry on every matching file. If the same task was run more than once (two trial dirs like gaia-exp11-task__run1 and gaia-exp11-task__run2, each with a reward.json), the key resolves identically and whichever file glob yields last wins — glob order is not guaranteed to be chronological. The displayed final (and therefore the WIN/MISS verdict) can be arbitrary when re-trials exist under the same jobs dir.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/dashboard.py Line: 201-213 Comment: **Silent last-writer-wins when a task has multiple reward.json files** `finals[key] = json.loads(rj.read_text())` unconditionally overwrites the dict entry on every matching file. If the same task was run more than once (two trial dirs like `gaia-exp11-task__run1` and `gaia-exp11-task__run2`, each with a `reward.json`), the key resolves identically and whichever file `glob` yields last wins — `glob` order is not guaranteed to be chronological. The displayed final (and therefore the WIN/MISS verdict) can be arbitrary when re-trials exist under the same jobs dir. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-04T16:25:11Z

+    if token:
+        raw = _docker(
+            "exec", container, "curl", "-s",
+            "-H", f"Authorization: Bearer {token}",
+            "http://localhost:8000/experiments",
+        )


Admin token visible in container process table

The token is passed as -H "Authorization: Bearer <token>" in the docker exec curl command line. Inside the container, the full argument list is readable via /proc/<pid>/cmdline (and by any process with access to ps aux inside the container). Curl supports reading headers from stdin with -H @-, which would avoid putting the secret on the command line entirely.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/dashboard.py Line: 177-182 Comment: **Admin token visible in container process table** The token is passed as `-H "Authorization: Bearer <token>"` in the `docker exec curl` command line. Inside the container, the full argument list is readable via `/proc/<pid>/cmdline` (and by any process with access to `ps aux` inside the container). Curl supports reading headers from stdin with `-H @-`, which would avoid putting the secret on the command line entirely. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-04T16:25:12Z

+  document.getElementById("grid").innerHTML = d.experiments.map(e => {
+    const rounds = e.rounds.slice(-10).map(x =>
+      `<tr><td>${x.commit}</td><td>${x.split ?? "-"}</td><td>${fmt(x.score)}</td><td>${fmt(x.error_rate)}</td></tr>`).join("");
+    const bars = e.bars ? `baseline ${fmt(e.bars.baseline)} &middot; win &gt; ${fmt(e.bars.win_bar)}` +
+      (e.bars.healthy !== undefined ? ` &middot; healthy ${fmt(e.bars.healthy)}` : "") : "";
+    const budget = (e.budget || []).map(s =>
+      s.remaining_run_budget !== undefined && s.remaining_run_budget !== null
+        ? `${s.split}: ${s.remaining_run_budget} evals left` : "").filter(Boolean).join(" &middot; ");
+    const final = e.final ? `<div class="final ${e.verdict ?? "noverdict"}">final ${
+      Object.entries(e.final).map(([k,v]) => k+" = "+fmt(v)).join(", ")}${
+      e.verdict ? " &middot; " + e.verdict : ""}</div>` : "";
+    return `<div class="card">
+      <div class="hdr"><span class="name">${e.title}</span><span class="chip ${e.phase}">${e.phase}</span></div>
+      ${e.question ? `<div class="q">${e.question}</div>` : ""}
+      ${bars ? `<div class="bars">${bars}</div>` : ""}
+      ${budget ? `<div class="bars">${budget}</div>` : ""}
+      ${e.rounds.length ? `<table><tr><th>commit</th><th>split</th><th>score</th><th>err</th></tr>${rounds}</table>` : ""}
+      ${final}
+      ${e.note ? `<div class="note">${e.note}</div>` : ""}
+    </div>`;
+  }).join("");
+  const h = d.history || [];
+  document.getElementById("hist").innerHTML = h.length ? `<h1>scorecard</h1><table>
+    <tr><th>exp</th><th>question</th><th>baseline</th><th>bar</th><th>final</th><th>verdict</th></tr>` +
+    h.map(x => `<tr><td>${x.exp}</td><td>${x.question}</td><td>${x.baseline ?? "-"}</td>
+      <td>${x.bar ?? "-"}</td><td>${x.final ?? "-"}</td><td>${x.verdict ?? "-"}</td></tr>`).join("") + "</table>" : "";


Unsanitized data injected into innerHTML

Fields from the API response — e.title, e.question, e.note, and the history columns x.exp, x.question — are interpolated directly into HTML template literals that are assigned to innerHTML. Any <script> tags or event-handler attributes in those strings will execute. The meta file is operator-controlled and the server is localhost-only, so the practical risk is low, but a maliciously crafted (or accidentally munged) meta file would have full script execution in the operator's browser. Consider running these strings through a small escaping helper (htmlEscape) before insertion.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/dashboard.py Line: 271-296 Comment: **Unsanitized data injected into `innerHTML`** Fields from the API response — `e.title`, `e.question`, `e.note`, and the history columns `x.exp`, `x.question` — are interpolated directly into HTML template literals that are assigned to `innerHTML`. Any `<script>` tags or event-handler attributes in those strings will execute. The meta file is operator-controlled and the server is localhost-only, so the practical risk is low, but a maliciously crafted (or accidentally munged) meta file would have full script execution in the operator's browser. Consider running these strings through a small escaping helper (`htmlEscape`) before insertion. How can I resolve this? If you propose a fix, please make it concise.

…ns, legend) Feedback from first real use: terse verdicts mean nothing without campaign context. The scorecard now carries a one-sentence plain-language summary per experiment and a legend defining baseline / bar / final.

greptile-apps · 2026-07-04T16:38:35Z

+def _docker(*args: str, timeout: int = 15) -> str:
+    res = subprocess.run(
+        ["docker", *args], capture_output=True, text=True, timeout=timeout
+    )
+    return res.stdout if res.returncode == 0 else ""


subprocess.TimeoutExpired is not caught, crashing the HTTP handler

subprocess.run(..., timeout=15) raises subprocess.TimeoutExpired if the process doesn't finish in time. _docker catches neither this nor FileNotFoundError (docker not on PATH), so both propagate through snapshot_sidecar → collect → do_GET. ThreadingHTTPServer catches it at the thread boundary and closes the connection, leaving the browser with a network error rather than partial data. Any container that is alive but has a frozen shell (precisely the "crashed launch" scenario the gone phase targets) will repeatedly time out every 20 s and render the dashboard blank until manually cleared. Adding except (subprocess.TimeoutExpired, FileNotFoundError): return "" in _docker would match the existing non-zero-exit fallback.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/dashboard.py Line: 150-154 Comment: **`subprocess.TimeoutExpired` is not caught, crashing the HTTP handler** `subprocess.run(..., timeout=15)` raises `subprocess.TimeoutExpired` if the process doesn't finish in time. `_docker` catches neither this nor `FileNotFoundError` (docker not on PATH), so both propagate through `snapshot_sidecar` → `collect` → `do_GET`. `ThreadingHTTPServer` catches it at the thread boundary and closes the connection, leaving the browser with a network error rather than partial data. Any container that is alive but has a frozen shell (precisely the "crashed launch" scenario the `gone` phase targets) will repeatedly time out every 20 s and render the dashboard blank until manually cleared. Adding `except (subprocess.TimeoutExpired, FileNotFoundError): return ""` in `_docker` would match the existing non-zero-exit fallback. How can I resolve this? If you propose a fix, please make it concise.

…banners Computed verdicts render as PASSED/FAILED (was WIN/MISS); scorecard rows take a result field (passed / failed / no change / cancelled) shown as a color-coded chip in a dedicated verdict column.

greptile-apps Bot reviewed Jul 4, 2026

View reviewed changes

feat(harbor): PASSED/FAILED verdict chips on the scorecard and final …

1d4b033

…banners Computed verdicts render as PASSED/FAILED (was WIN/MISS); scorecard rows take a result field (passed / failed / no change / cancelled) shown as a color-coded chip in a dedicated verdict column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harbor): operator dashboard for watching optimization trials live#29

feat(harbor): operator dashboard for watching optimization trials live#29
shehabyasser-scale wants to merge 3 commits into
harbor-2-sidecar-observabilityfrom
harbor-2-exp-dashboard

shehabyasser-scale commented Jul 4, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jul 4, 2026

Uh oh!

greptile-apps Bot Jul 4, 2026

Uh oh!

greptile-apps Bot Jul 4, 2026

Uh oh!

greptile-apps Bot Jul 4, 2026

Uh oh!

greptile-apps Bot Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 4, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How it works

Design constraints

Usage

Tests

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 4, 2026 •

edited by greptile-apps Bot

Loading