Skip to content

feat(harbor): operator dashboard for watching optimization trials live#29

Open
shehabyasser-scale wants to merge 3 commits into
harbor-2-sidecar-observabilityfrom
harbor-2-exp-dashboard
Open

feat(harbor): operator dashboard for watching optimization trials live#29
shehabyasser-scale wants to merge 3 commits into
harbor-2-sidecar-observabilityfrom
harbor-2-exp-dashboard

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

What

vero harbor dashboard serves a local, self-refreshing status page for Harbor optimization trials: which experiments are running, what each is doing right now, and how finished ones scored against their pre-registered bars. Built while operating a 13-experiment campaign where "what's running and where is it?" was answered by hand-rolled shell monitors; this replaces them with one page.

How it works

  • Live trials: discovers *-eval-sidecar-1 containers via docker ps, then tails each trial's ledger through the admin /experiments endpoint (this PR stacks on the observability PR that added it), using the in-container admin token via docker exec. Remaining budget comes from /status.
  • Finals: scans harbor jobs dirs (--jobs, repeatable) for reward.json, keyed by the trial dir's task name so finished trials stay visible after teardown.
  • Context: an optional --meta JSON supplies what containers cannot know: each experiment's question, pre-registered bars (baseline / win bar / healthy ceiling), notes, and a scorecard of completed experiments. Verdicts (WIN/MISS) are computed against the win bar.
  • Phases are derived, not guessed: building (container up, no ledger rows) -> running -> done (reward.json), and gone when there is no container and no final, so a crashed launch is conspicuous instead of silently absent. Live trials missing from the meta file still appear.

Design constraints

  • Zero new dependencies: stdlib http.server + inline HTML/JS, subprocess for docker.
  • Binds 127.0.0.1 by default: ledger rows include hidden-split scores, so the page is for the operator, not the network.
  • Host-side admin tool (same trust posture as the /experiments endpoint it consumes); it never reads agent volumes.

Usage

vero harbor dashboard --meta exp-meta.json --jobs /tmp/gaia-exp10-jobs --jobs /tmp/gaia-exp11-jobs
# -> http://127.0.0.1:8300

Tests

The status-assembly layer is pure and covered without docker (16 tests): container-name to experiment-key derivation, phase transitions, WIN/MISS verdicts, the meta/live/final join (including unlisted-live-trial and meta-only-gone cases), ledger normalization with null rows, and reward.json scanning.

🤖 Generated with Claude Code

Greptile Summary

This PR introduces vero harbor dashboard, a zero-dependency local status page for watching Harbor optimization trials live — discovering sidecar containers via docker ps, tailing ledgers through the admin /experiments endpoint, scanning jobs dirs for reward.json finals, and joining everything with an optional --meta context file. The pure status-assembly layer is well-separated and covered by 16 tests, but several issues identified in the review remain open.

  • History scorecard field name mismatch: the frontend template reads x.result for the verdict chip, but the test fixture (and therefore the effective meta-file documentation) uses "verdict" — the column will silently show "-" for every historical experiment in any meta file following the example. Values like "WIN"/"MISS" also won't match the vclass lookup.
  • Error handling gaps in _docker: subprocess.TimeoutExpired and FileNotFoundError are uncaught and propagate through collect() into do_GET, causing the browser to receive a network error for the entire poll cycle rather than partial data when a container shell stalls.
  • Admin token on command line: the Bearer token is passed as a -H argument in the docker exec curl call, making it readable via ps//proc/<pid>/cmdline inside the container.
  • verdict() picks the first numeric value from reward.json regardless of field name; scan_finals silently overwrites on key collision when multiple reward files share the same task key.

Confidence Score: 3/5

Safe to run locally but the history scorecard will silently malfunction for any meta file written following the test example, and unhandled docker-exec timeouts can blank the dashboard entirely during active trials.

The history scorecard's verdict column will show "-" for every historical experiment in any meta file an operator writes using the "verdict" key demonstrated by the test fixture — the frontend reads x.result. Several other issues from prior review rounds (unhandled TimeoutExpired crashing the HTTP handler, admin token in the container process table, innerHTML injection, first-value verdict pick) are also still unresolved, compounding the overall risk for a decision-critical operator tool.

The history fixture in test_harbor_dashboard.py and the frontend template in dashboard.py (around the x.result / vclass lines) are the immediate focus; the _docker error-handling and verdict() key-selection issues in dashboard.py also warrant attention before this lands.

Important Files Changed

Filename Overview
vero/src/vero/harbor/dashboard.py New dashboard module: clean pure-layer / side-effect separation, but several flagged issues remain open — unhandled TimeoutExpired/FileNotFoundError crashes the handler thread, admin token exposed on container process list, innerHTML injection from operator-controlled strings, first-numeric-value pick in verdict(), and last-writer-wins in scan_finals().
vero/tests/test_harbor_dashboard.py Good coverage of pure helpers and join logic; test fixture uses wrong field key ("verdict" vs "result") that will cause silent rendering failure in the history scorecard.
vero/src/vero/harbor/cli.py Adds the "dashboard" CLI command with appropriate options; straightforward delegation to dashboard.py with no issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Browser
    participant Handler as HTTP Handler (ThreadingHTTPServer)
    participant Collect as collect()
    participant Docker as docker CLI
    participant FS as Filesystem

    Browser->>Handler: GET /api/status (every 20s)
    Handler->>Collect: collect(meta_path, jobs_dirs)
    Collect->>FS: read meta JSON (optional)
    Collect->>Docker: "docker ps --format {{.Names}}"
    Docker-->>Collect: container list
    loop for each sidecar (sequential)
        Collect->>Docker: docker exec python3 -c _TOKEN_SNIPPET
        Docker-->>Collect: admin token
        Collect->>Docker: docker exec curl /experiments (Bearer token)
        Docker-->>Collect: ledger JSON
        Collect->>Docker: docker exec curl /status
        Docker-->>Collect: budget JSON
    end
    Collect->>FS: "glob **/reward.json in jobs_dirs"
    FS-->>Collect: reward.json files
    Collect-->>Handler: merged status dict
    Handler-->>Browser: 200 application/json

    Browser->>Handler: GET /
    Handler-->>Browser: 200 text/html (inline PAGE template)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Browser
    participant Handler as HTTP Handler (ThreadingHTTPServer)
    participant Collect as collect()
    participant Docker as docker CLI
    participant FS as Filesystem

    Browser->>Handler: GET /api/status (every 20s)
    Handler->>Collect: collect(meta_path, jobs_dirs)
    Collect->>FS: read meta JSON (optional)
    Collect->>Docker: "docker ps --format {{.Names}}"
    Docker-->>Collect: container list
    loop for each sidecar (sequential)
        Collect->>Docker: docker exec python3 -c _TOKEN_SNIPPET
        Docker-->>Collect: admin token
        Collect->>Docker: docker exec curl /experiments (Bearer token)
        Docker-->>Collect: ledger JSON
        Collect->>Docker: docker exec curl /status
        Docker-->>Collect: budget JSON
    end
    Collect->>FS: "glob **/reward.json in jobs_dirs"
    FS-->>Collect: reward.json files
    Collect-->>Handler: merged status dict
    Handler-->>Browser: 200 application/json

    Browser->>Handler: GET /
    Handler-->>Browser: 200 text/html (inline PAGE template)
Loading

Reviews (3): Last reviewed commit: "feat(harbor): PASSED/FAILED verdict chip..." | Re-trigger Greptile

vero harbor dashboard serves a local single-page UI (stdlib HTTP server, zero
new dependencies) showing every trial on the machine: which experiments are
running, each one's live ledger (commit, split, score, error rate per recorded
eval, via the admin /experiments endpoint through docker exec), remaining
budget (/status), pre-registered bars from an optional --meta JSON, finals
scanned from harbor jobs dirs (reward.json), and a WIN/MISS verdict against
the win bar. A scorecard section renders finished experiments from the meta
file.

Phases are derived, not guessed: building (container up, no ledger rows),
running, done (reward.json exists), gone (no container and no final, so a
crashed launch is conspicuous rather than silently absent). Live trials the
meta file does not mention still appear. Binds 127.0.0.1 by default since
ledger rows include hidden-split scores.

The status-assembly layer (key derivation, phase, verdicts, meta/live/final
join, reward.json scanning) is pure and unit-tested without docker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread vero/src/vero/harbor/dashboard.py Outdated
Comment on lines +71 to +78
def verdict(final: dict | None, bars: dict | None) -> str | None:
"""WIN / MISS against the pre-registered win bar, when both are known."""
if not final or not bars or "win_bar" not in bars:
return None
scores = [v for v in final.values() if isinstance(v, (int, float))]
if not scores:
return None
return "WIN" if scores[0] > bars["win_bar"] else "MISS"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 verdict picks the first numeric value regardless of which field it is

scores = [v for v in final.values() if isinstance(v, (int, float))] then uses scores[0]. Dict iteration order is insertion order in Python 3.7+, so if reward.json ever contains multiple numeric fields — e.g. {"total_tasks": 120, "accuracy": 0.47}scores[0] is 120 and the verdict is evaluated against the win bar with the wrong number. All tests use a single-key dict, so this case is not covered. Since the dashboard is decision-critical (WIN/MISS governs what the operator does next), a key field to compare against, or at minimum a tie-breaking policy, should be specified explicitly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 71-78

Comment:
**`verdict` picks the first numeric value regardless of which field it is**

`scores = [v for v in final.values() if isinstance(v, (int, float))]` then uses `scores[0]`. Dict iteration order is insertion order in Python 3.7+, so if `reward.json` ever contains multiple numeric fields — e.g. `{"total_tasks": 120, "accuracy": 0.47}``scores[0]` is `120` and the verdict is evaluated against the win bar with the wrong number. All tests use a single-key dict, so this case is not covered. Since the dashboard is decision-critical (WIN/MISS governs what the operator does next), a key field to compare against, or at minimum a tie-breaking policy, should be specified explicitly.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +201 to +213
finals: dict[str, dict] = {}
for jd in jobs_dirs:
for rj in Path(jd).glob("**/reward.json"):
key = None
for part in rj.parts:
if "__" in part:
key = part.split("__")[0]
if key is None:
key = Path(jd).name
try:
finals[key] = json.loads(rj.read_text())
except (OSError, json.JSONDecodeError):
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent last-writer-wins when a task has multiple reward.json files

finals[key] = json.loads(rj.read_text()) unconditionally overwrites the dict entry on every matching file. If the same task was run more than once (two trial dirs like gaia-exp11-task__run1 and gaia-exp11-task__run2, each with a reward.json), the key resolves identically and whichever file glob yields last wins — glob order is not guaranteed to be chronological. The displayed final (and therefore the WIN/MISS verdict) can be arbitrary when re-trials exist under the same jobs dir.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 201-213

Comment:
**Silent last-writer-wins when a task has multiple reward.json files**

`finals[key] = json.loads(rj.read_text())` unconditionally overwrites the dict entry on every matching file. If the same task was run more than once (two trial dirs like `gaia-exp11-task__run1` and `gaia-exp11-task__run2`, each with a `reward.json`), the key resolves identically and whichever file `glob` yields last wins — `glob` order is not guaranteed to be chronological. The displayed final (and therefore the WIN/MISS verdict) can be arbitrary when re-trials exist under the same jobs dir.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +177 to +182
if token:
raw = _docker(
"exec", container, "curl", "-s",
"-H", f"Authorization: Bearer {token}",
"http://localhost:8000/experiments",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security Admin token visible in container process table

The token is passed as -H "Authorization: Bearer <token>" in the docker exec curl command line. Inside the container, the full argument list is readable via /proc/<pid>/cmdline (and by any process with access to ps aux inside the container). Curl supports reading headers from stdin with -H @-, which would avoid putting the secret on the command line entirely.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 177-182

Comment:
**Admin token visible in container process table**

The token is passed as `-H "Authorization: Bearer <token>"` in the `docker exec curl` command line. Inside the container, the full argument list is readable via `/proc/<pid>/cmdline` (and by any process with access to `ps aux` inside the container). Curl supports reading headers from stdin with `-H @-`, which would avoid putting the secret on the command line entirely.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment thread vero/src/vero/harbor/dashboard.py Outdated
Comment on lines +271 to +296
document.getElementById("grid").innerHTML = d.experiments.map(e => {
const rounds = e.rounds.slice(-10).map(x =>
`<tr><td>${x.commit}</td><td>${x.split ?? "-"}</td><td>${fmt(x.score)}</td><td>${fmt(x.error_rate)}</td></tr>`).join("");
const bars = e.bars ? `baseline ${fmt(e.bars.baseline)} &middot; win &gt; ${fmt(e.bars.win_bar)}` +
(e.bars.healthy !== undefined ? ` &middot; healthy ${fmt(e.bars.healthy)}` : "") : "";
const budget = (e.budget || []).map(s =>
s.remaining_run_budget !== undefined && s.remaining_run_budget !== null
? `${s.split}: ${s.remaining_run_budget} evals left` : "").filter(Boolean).join(" &middot; ");
const final = e.final ? `<div class="final ${e.verdict ?? "noverdict"}">final ${
Object.entries(e.final).map(([k,v]) => k+" = "+fmt(v)).join(", ")}${
e.verdict ? " &middot; " + e.verdict : ""}</div>` : "";
return `<div class="card">
<div class="hdr"><span class="name">${e.title}</span><span class="chip ${e.phase}">${e.phase}</span></div>
${e.question ? `<div class="q">${e.question}</div>` : ""}
${bars ? `<div class="bars">${bars}</div>` : ""}
${budget ? `<div class="bars">${budget}</div>` : ""}
${e.rounds.length ? `<table><tr><th>commit</th><th>split</th><th>score</th><th>err</th></tr>${rounds}</table>` : ""}
${final}
${e.note ? `<div class="note">${e.note}</div>` : ""}
</div>`;
}).join("");
const h = d.history || [];
document.getElementById("hist").innerHTML = h.length ? `<h1>scorecard</h1><table>
<tr><th>exp</th><th>question</th><th>baseline</th><th>bar</th><th>final</th><th>verdict</th></tr>` +
h.map(x => `<tr><td>${x.exp}</td><td>${x.question}</td><td>${x.baseline ?? "-"}</td>
<td>${x.bar ?? "-"}</td><td>${x.final ?? "-"}</td><td>${x.verdict ?? "-"}</td></tr>`).join("") + "</table>" : "";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unsanitized data injected into innerHTML

Fields from the API response — e.title, e.question, e.note, and the history columns x.exp, x.question — are interpolated directly into HTML template literals that are assigned to innerHTML. Any <script> tags or event-handler attributes in those strings will execute. The meta file is operator-controlled and the server is localhost-only, so the practical risk is low, but a maliciously crafted (or accidentally munged) meta file would have full script execution in the operator's browser. Consider running these strings through a small escaping helper (htmlEscape) before insertion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 271-296

Comment:
**Unsanitized data injected into `innerHTML`**

Fields from the API response — `e.title`, `e.question`, `e.note`, and the history columns `x.exp`, `x.question` — are interpolated directly into HTML template literals that are assigned to `innerHTML`. Any `<script>` tags or event-handler attributes in those strings will execute. The meta file is operator-controlled and the server is localhost-only, so the practical risk is low, but a maliciously crafted (or accidentally munged) meta file would have full script execution in the operator's browser. Consider running these strings through a small escaping helper (`htmlEscape`) before insertion.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

…ns, legend)

Feedback from first real use: terse verdicts mean nothing without campaign
context. The scorecard now carries a one-sentence plain-language summary per
experiment and a legend defining baseline / bar / final.
Comment on lines +150 to +154
def _docker(*args: str, timeout: int = 15) -> str:
res = subprocess.run(
["docker", *args], capture_output=True, text=True, timeout=timeout
)
return res.stdout if res.returncode == 0 else ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 subprocess.TimeoutExpired is not caught, crashing the HTTP handler

subprocess.run(..., timeout=15) raises subprocess.TimeoutExpired if the process doesn't finish in time. _docker catches neither this nor FileNotFoundError (docker not on PATH), so both propagate through snapshot_sidecarcollectdo_GET. ThreadingHTTPServer catches it at the thread boundary and closes the connection, leaving the browser with a network error rather than partial data. Any container that is alive but has a frozen shell (precisely the "crashed launch" scenario the gone phase targets) will repeatedly time out every 20 s and render the dashboard blank until manually cleared. Adding except (subprocess.TimeoutExpired, FileNotFoundError): return "" in _docker would match the existing non-zero-exit fallback.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/dashboard.py
Line: 150-154

Comment:
**`subprocess.TimeoutExpired` is not caught, crashing the HTTP handler**

`subprocess.run(..., timeout=15)` raises `subprocess.TimeoutExpired` if the process doesn't finish in time. `_docker` catches neither this nor `FileNotFoundError` (docker not on PATH), so both propagate through `snapshot_sidecar``collect``do_GET`. `ThreadingHTTPServer` catches it at the thread boundary and closes the connection, leaving the browser with a network error rather than partial data. Any container that is alive but has a frozen shell (precisely the "crashed launch" scenario the `gone` phase targets) will repeatedly time out every 20 s and render the dashboard blank until manually cleared. Adding `except (subprocess.TimeoutExpired, FileNotFoundError): return ""` in `_docker` would match the existing non-zero-exit fallback.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

…banners

Computed verdicts render as PASSED/FAILED (was WIN/MISS); scorecard rows take a
result field (passed / failed / no change / cancelled) shown as a color-coded
chip in a dedicated verdict column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant