fix(harbor): make baseline scoring at finalize durable and retried#27
Open
shehabyasser-scale wants to merge 1 commit into
Open
fix(harbor): make baseline scoring at finalize durable and retried#27shehabyasser-scale wants to merge 1 commit into
shehabyasser-scale wants to merge 1 commit into
Conversation
A live trial silently skipped baseline scoring: the nested baseline eval failed
transiently, the failure was swallowed by a warn-and-continue except, and the only
record (a log line) died with the sidecar container at teardown. The admin volume
that baseline.json was written to does not survive teardown either, so there was no
durable evidence of whether the baseline was scored, skipped, or crashed.
Two changes:
- Retry the baseline eval (default 2 attempts) so a single transient nested-run
failure does not drop the regression check.
- finalize() now returns {"rewards": ..., "baseline": ...}; the baseline outcome
(scores, a skip reason, or an error) is surfaced in the finalize response. The CLI
writes only rewards to reward.json (the outer harness consumes its keys, unchanged)
and echoes the full payload to stdout, which is captured into the trial's stdout on
the host, the one channel that survives teardown. The CLI tolerates the old
bare-rewards shape for a mixed-version sidecar.
Baseline scoring still never fails the trial.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines
+195
to
+199
| logger.exception( | ||
| "baseline scoring failed after %d attempt(s); reward.json is unaffected", | ||
| self._baseline_score_attempts, | ||
| exc_info=last_error, | ||
| ) |
There was a problem hiding this comment.
logger.exception(..., exc_info=last_error) is called outside an except block with an exception instance as exc_info. In Python 3.11 (the minimum required by pyproject.toml), the logging.LogRecord constructor treats any non-tuple exc_info value by falling through to sys.exc_info(), which returns (None, None, None) when there is no active exception context — so the traceback of last_error is silently dropped from the final log message. Exception-instance support for exc_info was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.
Suggested change
| logger.exception( | |
| "baseline scoring failed after %d attempt(s); reward.json is unaffected", | |
| self._baseline_score_attempts, | |
| exc_info=last_error, | |
| ) | |
| logger.error( | |
| "baseline scoring failed after %d attempt(s); reward.json is unaffected", | |
| self._baseline_score_attempts, | |
| exc_info=(type(last_error), last_error, last_error.__traceback__), | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 195-199
Comment:
`logger.exception(..., exc_info=last_error)` is called outside an `except` block with an exception instance as `exc_info`. In Python 3.11 (the minimum required by `pyproject.toml`), the `logging.LogRecord` constructor treats any non-tuple `exc_info` value by falling through to `sys.exc_info()`, which returns `(None, None, None)` when there is no active exception context — so the traceback of `last_error` is silently dropped from the final log message. Exception-instance support for `exc_info` was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.
```suggestion
logger.error(
"baseline scoring failed after %d attempt(s); reward.json is unaffected",
self._baseline_score_attempts,
exc_info=(type(last_error), last_error, last_error.__traceback__),
)
```
How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (observed live, then root-caused)
A trial silently skipped
score_baselineat finalize: reward.json landed ~2 minutes after the winner's validation eval with no baseline.json and no baseline ledger row, while a byte-identical config produced one in another run. A three-agent investigation traced the mechanism: the nested baseline eval failed transiently within seconds,_maybe_score_baseline's warn-and-continueexceptswallowed it, and the only record (alogger.exceptionto sidecar stderr) was destroyed with the container at teardown. The admin volume holding baseline.json does not survive teardown either. Two subsequent identical runs scored the baseline fine, confirming the failure is intermittent and environmental, not config- or state-dependent.Fix
baseline_score_attemptstotal attempts (default 2, wired throughServeConfig), so a single transient nested-run failure no longer drops the regression check.finalize()now returns{"rewards": ..., "baseline": ...}wherebaselineis{"scores": ...},{"skipped": reason}or{"error": ..., "error_type": ..., "attempts": N}. The CLI writes onlyrewardsto reward.json (the outer harness contract is unchanged) and echoes the full payload to stdout, which harbor captures into the trial'stest-stdout.txton the host: the one channel that survives teardown. The CLI tolerates the old bare-rewards response shape for a mixed-version sidecar.Baseline scoring still never fails the trial.
Tests
attempts: 2/finalizemock updated to the real wrapper shapeStacked on #25 (
harbor-2-free-baseline-eval). An adversarial review panel (4 lenses + verification) ran pre-open; all confirmed findings addressed.🤖 Generated with Claude Code
Greptile Summary
This PR makes the baseline-scoring step at finalize both retryable and durably observable: instead of swallowing a transient failure into a log line that dies with the container,
finalize()now returns{"rewards": {...}, "baseline": {...}}wherebaselineholds either the scores, a skip reason, or a structured error after all retry attempts are exhausted._maybe_score_baselineloops up tobaseline_score_attempts(default 2) times; a single transient nested-run failure no longer silently drops the regression check.rewardstoreward.json(outer harness contract unchanged) and echoes the full response — including the baseline outcome — to stdout, which Harbor captures intotest-stdout.txton the host, the one channel that survives sidecar teardown.reward.json.Confidence Score: 4/5
Safe to merge once the logging issue in verifier.py is addressed; reward.json output and the retry semantics are correct.
The final 'all attempts failed' log message calls logger.exception(..., exc_info=last_error) outside any except block, passing an exception instance rather than a 3-tuple. Python 3.11 (the project's minimum version per pyproject.toml) only accepts tuples or True for exc_info — a plain exception object causes the logging module to fall back to sys.exc_info(), which returns (None, None, None) outside an exception context, silently discarding the traceback. Improved traceback observability at failure was one of the stated goals of this PR. Everything else — the retry loop, the response wrapper, the CLI back-compat guard, and the new tests — is correct and well-tested.
vero/src/vero/harbor/verifier.py — the logger.exception call at the end of _maybe_score_baseline
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant H as Harbor (host) participant CLI as vero harbor finalize (CLI) participant App as /finalize endpoint (FastAPI) participant V as Verifier participant E as EvaluationEngine H->>CLI: invoke finalize --token-file --output CLI->>App: POST /finalize (Bearer token) App->>V: finalize() V->>E: _select_commit() E-->>V: sha loop for each target V->>E: "evaluate_admin(commit=sha)" E-->>V: score end V->>V: _maybe_score_baseline(rewards) loop attempt 1..N (baseline_score_attempts) V->>E: "evaluate_admin(commit=base_commit)" alt success E-->>V: baseline score V-->>V: "return {scores, attempts}" else transient failure E-->>V: Exception V->>V: log warning, retry end end alt all attempts failed V-->>V: "return {error, error_type, attempts}" end V-->>App: "{rewards, baseline}" App-->>CLI: HTTP 200, JSON payload CLI->>CLI: extract rewards (or bare resp for back-compat) CLI->>CLI: write rewards to reward.json CLI->>H: echo full resp to stdout (captured into test-stdout.txt)%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant H as Harbor (host) participant CLI as vero harbor finalize (CLI) participant App as /finalize endpoint (FastAPI) participant V as Verifier participant E as EvaluationEngine H->>CLI: invoke finalize --token-file --output CLI->>App: POST /finalize (Bearer token) App->>V: finalize() V->>E: _select_commit() E-->>V: sha loop for each target V->>E: "evaluate_admin(commit=sha)" E-->>V: score end V->>V: _maybe_score_baseline(rewards) loop attempt 1..N (baseline_score_attempts) V->>E: "evaluate_admin(commit=base_commit)" alt success E-->>V: baseline score V-->>V: "return {scores, attempts}" else transient failure E-->>V: Exception V->>V: log warning, retry end end alt all attempts failed V-->>V: "return {error, error_type, attempts}" end V-->>App: "{rewards, baseline}" App-->>CLI: HTTP 200, JSON payload CLI->>CLI: extract rewards (or bare resp for back-compat) CLI->>CLI: write rewards to reward.json CLI->>H: echo full resp to stdout (captured into test-stdout.txt)Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix(harbor): make baseline scoring at fi..." | Re-trigger Greptile