Skip to content

feat(eval): added isolated/apply/merge options to agent() helper#3205

Open
roboomp wants to merge 18 commits into
mainfrom
farm/d00e307f/add-isolated-mode-support-to-workflowz-e
Open

feat(eval): added isolated/apply/merge options to agent() helper#3205
roboomp wants to merge 18 commits into
mainfrom
farm/d00e307f/add-isolated-mode-support-to-workflowz-e

Conversation

@roboomp

@roboomp roboomp commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Repro

In a workspace with task.isolation.mode !== "none", dispatching parallel agent() calls from a workflowz eval cell — e.g.

parallel([
    lambda pkg=pkg: agent(f"Migrate {pkg}")
    for pkg in ["pi-ai", "pi-catalog", "pi-tui"]
])

all run inside the parent worktree. The task tool's isolated: true wrapper isn't reachable from the eval bridge, so overlapping edits across fan-out items race in the same working tree, which defeats the point of using workflowz for safe parallel migrations/audits.

Cause

packages/coding-agent/src/eval/agent-bridge.ts accepts only prompt, agentType, model, label, and schema, then calls taskExecutor.runSubprocess(...) directly — it never touches the isolation lifecycle that TaskTool.#runSpawn runs in packages/coding-agent/src/task/index.ts (repo baseline capture → ensureIsolation(...)runSubprocess({ worktree: handle.mergedDir, ... }) → patch/branch merge → cleanupIsolation(...)). The Python and JS preludes also never expose or forward an isolated option.

Fix

  • packages/coding-agent/src/task/isolation-runner.ts (new): extracted prepareIsolationContext, runIsolatedSubprocess, mergeIsolatedChanges so the eval bridge and TaskTool.#runSpawn share one lifecycle implementation. TaskTool is refactored to consume the helpers — behavior unchanged.
  • packages/coding-agent/src/eval/agent-bridge.ts: added isolated, apply, merge to the arktype schema. Default tracks task.isolation.mode (isolated when not "none", off when "none"); isolated=True while task.isolation.mode === "none" throws the same clear error as the task tool; apply=false skips the merge phase and surfaces the patch path / branch name in details; merge=false forces patch mode even when task.isolation.merge === "branch".
  • packages/coding-agent/src/eval/py/prelude.py, packages/coding-agent/src/eval/js/shared/prelude.txt: both agent() signatures accept and forward isolated, apply, merge. JS supports both options-object (agent(prompt, { isolated: true })) and positional forms.
  • packages/coding-agent/src/prompts/system/workflow-notice.md: documented isolated=True/False, apply=False, merge=False.
  • packages/coding-agent/src/eval/__tests__/agent-bridge.test.ts: 5 new tests for disabled-isolation rejection, default-from-settings, explicit isolated=false, merge: false → patch mode + worktree cwd plumbing, apply: false skipping merge and surfacing patch/branch.

Verification

  • bun test packages/coding-agent/src/eval/__tests__/agent-bridge.test.ts — 22/22 pass (5 new isolation tests included).
  • bun test packages/coding-agent/test/task — 160/160 pass (regression coverage for the refactored TaskTool isolation lifecycle: subagent-lsp.test.ts, task-batch.test.ts, render-call.test.ts, worktree.test.ts, executor-pass-through.test.ts).
  • bun test packages/coding-agent/src/eval/__tests__/prelude-agent.test.ts packages/coding-agent/src/eval/py/__tests__/prelude.test.ts — 5/5 pass.
  • bun test packages/coding-agent/test/modes/workflow.test.ts packages/coding-agent/test/agent-session-magic-keywords.test.ts — 10/10 pass.

Fixes #3196

roboomp added 2 commits June 21, 2026 17:20
The workflowz eval path bypasses the task tool's isolation wrapper and
calls runSubprocess() directly, so parallel agent() fan-outs that edit
overlapping files all land in the parent worktree.

Extends the eval agent bridge schema with isolated/apply/merge, forwards
them through the Python and JS preludes, and adds a shared
task/isolation-runner.ts so the lifecycle (prepare context → run in
worktree → capture patch/branch → merge → cleanup) is implemented once
for both TaskTool and the bridge.

Default mirrors task.isolation.mode: isolated by default when settings
allow it, off when mode === 'none'. isolated=False explicitly disables;
isolated=True with mode === 'none' errors out to match the task tool.
apply=false keeps captured changes inside the worktree and surfaces the
patch path / branch name in details. merge=false forces patch mode even
when task.isolation.merge === 'branch'.

Fixes #3196

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a22499dcd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


return {
text: result.output,
text: result.output + mergeSummary,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep structured agent() output parseable under isolation

When a workflow calls agent(..., schema=...) while isolation is enabled, mergeIsolatedChanges() appends human-readable text like “No changes to apply.” to the subagent's JSON output here, and both JS/Python preludes parse the returned text as JSON when a schema was supplied. That makes successful structured isolated runs throw in the eval runtime instead of returning the validated object; keep merge/apply status in details or otherwise avoid appending it to schema-backed output.

Useful? React with 👍 / 👎.

// invariant). A failed apply (`changesApplied === false`) keeps the dir so
// the caller can recover from `result.patchPath` manually.
const shouldCleanupTempArtifacts =
tempArtifactsDir && (!isIsolated || changesApplied === true || changesApplied === null);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve temp patch artifacts when apply=false

For sessions without a session file, getArtifacts() creates a temp artifact directory; with apply=false, changesApplied intentionally stays null, so this cleanup condition deletes that temp directory before returning. In patch mode the returned details.patchPath then points at a removed file, defeating the documented/manual-apply path for agent(..., apply=False) unless the session happened to have persistent artifacts.

Useful? React with 👍 / 👎.

Kept isolation merge/apply summaries out of agent() text when a schema is supplied so Python and JS eval helpers can still parse the JSON payload. The summary now lands in details.isolationSummary for callers that need the human-readable apply state.

Added a regression test that exercises an isolated schema-backed eval agent with a merge summary.

Fixes #3196
@roboomp

roboomp commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the schema/isolation regression and pushed 91cf5ec97ca7.

  • Kept runEvalAgent().text unchanged for schema-backed isolated calls so JS/Python preludes can still JSON.parse it.
  • Moved the human-readable isolation apply/merge summary into details.isolationSummary.
  • Added a regression test covering an isolated agent(..., schema=...) run with a merge summary; focused bridge test passes (23/23).

roboomp added 2 commits June 21, 2026 17:28
…pply=false

The cleanup gate was treating changesApplied===null (apply=false) the same as a clean apply, deleting the temp artifacts dir before returning details.patchPath. Sessions without a session file (which fall back to a per-call tmp dir) ended up with a patchPath pointing at a removed file, defeating the documented manual-apply path.

Tightened the cleanup condition to remove the temp dir only on a confirmed clean apply (changesApplied===true); apply=false and failed applies both keep the artifact for the caller.

Added regression tests for the apply=false preserve case and the apply-succeeds cleanup case.

Fixes #3196
@roboomp

roboomp commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the temp-artifact cleanup regression and pushed 9b33e0e64489.

  • Tightened the cleanup gate so the temp artifacts dir survives apply=false runs (changesApplied === null); only a clean apply (changesApplied === true) deletes it now.
  • Added regression tests: apply=false keeps details.patchPath on disk; the apply-succeeds path still deletes the temp dir.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b33e0e644

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


return {
text: result.output,
text: structured ? result.output : result.output + mergeSummary,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expose apply=false artifacts for structured calls

When schema and apply=False are combined, this returns only the structured JSON text while the patch/branch reference is placed in details; checked the shipped JS/Python preludes and agent() returns the parsed object, while returnHandle only includes id/agent/handle, so details.patchPath/branchName is not reachable from the workflow. In that scenario the isolation worktree is cleaned up and the caller has no artifact reference to manually apply the captured changes, which defeats apply=false for structured workflows.

Useful? React with 👍 / 👎.

… node

When agent() ran with schema and apply=false, the bridge correctly returned the captured patch/branch in details, but the preludes only forwarded id/agent/handle/data on the returnHandle node. Structured workflows had no way to recover the artifact for a manual apply.

Both runtimes now copy isolated, patchPath/branchName, changesApplied, and isolationSummary onto the returnHandle node (snake_case in Python, camelCase in JS), keeping null changesApplied so apply=false stays distinguishable from a successful apply. Updated the workflow notice and the Python agent() docstring to point callers at return_handle as the artifact escape hatch for isolated+apply=false runs. Added prelude tests locking the new node shape in both runtimes.

Fixes #3196
@roboomp

roboomp commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the structured-call artifact-loss regression and pushed 54f4652573cd.

  • JS prelude now copies isolated, patchPath, branchName, changesApplied, and isolationSummary onto the returnHandle node (preserving null for apply=false).
  • Python prelude does the same, translating to patch_path / branch_name / changes_applied / isolation_summary on the node dict.
  • Documented return_handle=True as the artifact escape hatch for apply=False (+schema) workflows in both workflow-notice.md and the Python agent() docstring.
  • Added prelude regression tests in both runtimes locking the new node shape; bun test packages/coding-agent/src/eval/__tests__/prelude-agent.test.ts packages/coding-agent/src/eval/__tests__/agent-bridge.test.ts packages/coding-agent/src/eval/py/__tests__/prelude.test.ts passes 32/32.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54f4652573

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +492 to +494
const shouldCleanupTempArtifacts = tempArtifactsDir && (!isIsolated || changesApplied === true);
if (shouldCleanupTempArtifacts) {
await fs.rm(artifactsDir, { recursive: true, force: true });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep temp artifacts for returnHandle outputs

In sessions where getSessionFile() is null, this deletes the temp artifacts directory immediately after a non-isolated run or a successful isolated apply. That directory is where runSubprocess wrote <id>.md/<id>.jsonl, and the JS/Python preludes strip returnHandle before calling the bridge, so the bridge cannot know whether the caller will use the returned agent://<id> handle. In SDK/in-memory eval workflows using agent(..., returnHandle=True) or { returnHandle: true }, the handle is returned but agent:// resolution then fails because the backing artifacts were removed.

Useful? React with 👍 / 👎.

Comment on lines +200 to +206
if (!result.branchName || result.exitCode !== 0 || result.aborted) {
return {
summary: "\n\nNo changes to apply.",
changesApplied: true,
hadAnyChanges: false,
mergedBranchForNestedPatches: false,
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply nested-only changes in branch isolation

When branch-mode isolation changes only a nested git repo, commitToBranch() returns no branchName because there is no root patch, but it can still return nestedPatches. This early return treats that result as “No changes to apply” and leaves mergedBranchForNestedPatches false, so the callers' branch-mode gate never applies the nested patches; after the isolation worktree is cleaned up, those nested-repo edits are lost for task.isolation.merge="branch" runs that touch only nested repositories.

Useful? React with 👍 / 👎.

@can1357

can1357 commented Jun 22, 2026

Copy link
Copy Markdown
Owner

@korri123 could you check this? see if it makes sense & achieves what you wanted

roboomp added 2 commits June 22, 2026 03:59
Eval preludes now forward returnHandle to the bridge so no-session eval runs can preserve the temp artifacts backing returned agent:// handles. The bridge keeps those temporary artifact directories whenever returnHandle is requested, including non-isolated runs and successful isolated applies.

Branch-mode isolation now treats nested-only changes as merge-eligible even when no root branch was produced, letting callers apply nested patches instead of dropping them when the root repo had no diff.

Added regression coverage for returnHandle artifact preservation and nested-only branch isolation.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the two latest Codex P2 findings and pushed 636fbb4fccf6.

  • Forwarded returnHandle from both JS/Python preludes into the eval bridge, and preserved no-session temp artifacts whenever a returned agent:// handle will depend on them.
  • Kept existing cleanup behavior for ordinary non-handle eval calls and successful isolated applies; added bridge regressions for both preserved and cleaned temp dirs.
  • Fixed branch-mode isolation so nested-only changes (no root branch) still mark nested patches as apply-eligible instead of dropping them.
  • Added mergeIsolatedChanges() unit coverage for nested-only branch patches and failed branch runs; focused eval/isolation tests pass 36/36, and packages/coding-agent/test/task passes 162/162.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 636fbb4fcc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


return {
text: result.output,
text: structured ? result.output : result.output + mergeSummary,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface failed applies to schema callers

When an isolated schema-backed agent() run hits a patch/branch conflict (mergeIsolatedChanges() returns changesApplied === false), this returns only the raw JSON, so the JS/Python preludes parse and return the object unless returnHandle was requested; the failure warning lives only in details and is discarded by default. In that scenario the workflow observes a successful structured result while none of the edits were applied, so failed applies should throw or otherwise be surfaced to schema callers.

Useful? React with 👍 / 👎.

Comment on lines +483 to +484
} else {
mergeSummary = "\n\nIsolation: no changes captured.";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve nested apply=false artifacts

Fresh evidence is the new apply=false branch skips mergeIsolatedChanges() and only considers branchName/patchPath: in branch mode, a nested-repo-only change has no root branch (commitToBranch() returns branchName: undefined with nestedPatches), so this path reports no changes captured and details expose no nested patch before the isolation worktree is removed. That makes agent(..., apply=False) lose nested repository edits instead of giving the caller an artifact to apply manually.

Useful? React with 👍 / 👎.

Branch-mode isolation can capture nested repository changes without creating a root branch. Eval agent() with apply=false previously treated that shape as no captured changes and returned no recoverable nested patch payload after the isolation worktree was removed.

Expose captured nested patches in EvalAgentResult details and copy them onto JS/Python returnHandle nodes (nestedPatches / nested_patches). Document the return_handle escape hatch and add regression coverage for branch-mode nested-only apply=false runs.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the nested apply=false artifact loss and pushed 3677e71e9dac.

  • Eval bridge now surfaces captured nested repo patches in details.nestedPatches for isolated runs, including branch-mode nested-only apply=false calls with no root branch/patch.
  • JS/Python returnHandle nodes now carry those nested patches (nestedPatches / nested_patches) alongside patch/branch/apply metadata.
  • Workflow notice and Python agent() docstring now document nested patches as part of the return_handle=True artifact payload.
  • Added regressions for branch-mode nested-only apply=false and runtime handle propagation; focused eval tests pass 35/35.

A failed isolated apply (changesApplied === false) previously only set details.isolationSummary and returned the subagent text. Schema-backed agent() calls then parsed the JSON and returned the object, so workflows saw a successful structured result while none of the edits had landed.

Throw a ToolError when mergeIsolatedChanges reports a failed apply, with the merge summary plus a recovery hint pointing at the preserved patch/branch/nested artifacts so the caller can apply manually.

Added regression tests for the schema and non-schema apply-failure paths.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the failed-apply-as-success regression and pushed c1b01f1cb922.

  • Eval bridge now throws a ToolError whenever an isolated apply fails (changesApplied === false), so schema-backed callers can no longer mistake a parsed JSON result for a successful apply.
  • The error includes the merge failure summary plus a recovery hint pointing at the preserved patch / branch / nested-patch artifacts.
  • Removed the redundant outcome.changesApplied !== false gate that the throw made dead code; tightened the cleanup comment to reflect the new flow.
  • Added regression tests covering schema and non-schema apply failures; focused eval tests pass 37/37.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1b01f1cb9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +466 to +469
if (result.nestedPatches?.length) {
recoveryParts.push(
`Captured nested repository patches (${result.nestedPatches.length}) preserved in details.nestedPatches.`,
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist nested patches before throwing apply failures

When an isolated run changes a nested repo and the root patch/branch apply fails, this path throws after runIsolatedSubprocess() has already cleaned the isolation worktree. The thrown ToolError does not return EvalAgentResult.details, so the promised details.nestedPatches recovery path is unreachable and the nested patch text is lost; write those patches to a recoverable artifact or include them in an error payload before throwing.

Useful? React with 👍 / 👎.

Comment on lines +516 to +519
const shouldCleanupTempArtifacts =
tempArtifactsDir && !parsed.returnHandle && (!isIsolated || changesApplied === true);
if (shouldCleanupTempArtifacts) {
await fs.rm(artifactsDir, { recursive: true, force: true });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Register temp artifacts before returning handles

For SDK/in-memory eval sessions with no sessionFile, preserving this temp directory is not enough to make agent://<id> work. Fresh evidence in this revision is that the fix only skips rm, while agent:// resolution still searches artifactsDirsFromRegistry() via registered sessions' getArtifactsDir()/sessionFile; both remain null for these in-memory runs, so the returned handle cannot find the .md even though the directory still exists.

Useful? React with 👍 / 👎.

When an isolated apply fails the bridge throws a ToolError and never returns details, so the nested-patch payload that previously lived in details.nestedPatches was unrecoverable after the isolation worktree was torn down.

The bridge now writes each captured nested patch to a file under the per-call artifacts dir (e.g. <agentId>.nested-<index>-<slug>.patch) before throwing and includes the resolved paths in the error message so the caller can apply them manually.

Added a regression test verifying the persisted file exists with the original patch contents and that the path is surfaced in the thrown error.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the lost-nested-patches regression and pushed a48f1ea8137f.

  • Bridge now writes each captured nested patch to <artifactsDir>/<agentId>.nested-<index>-<slug>.patch before throwing on an isolated apply failure, since the isolation worktree is already gone and the throw skips details.nestedPatches.
  • The thrown ToolError now lists the persisted nested-patch file paths so the caller can recover them out of band.
  • Added a regression test that asserts the persisted file exists with the original patch contents and that the path is named in the error message.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a48f1ea813

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +513 to +516
} catch {
// Nested patch failures are non-fatal to the parent merge
mergeSummary +=
"\n\n<system-notification>Some nested repository patches failed to apply.</system-notification>";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface nested apply failures to schema callers

When an isolated schema-backed agent() changes a nested repo and applyNestedPatches() fails (for example, a nested patch conflict), this catch only appends a warning to mergeSummary. The returned text remains the raw structured JSON, and the JS/Python helpers return the parsed object unless returnHandle was requested, so the workflow observes success even though those nested edits were not applied; this nested-repo failure path should throw or otherwise make schema callers observe the failed apply.

Useful? React with 👍 / 👎.

@korri123

Copy link
Copy Markdown
Contributor

@roboomp This seems to make eval agent() default to isolated whenever task.isolation.mode !== "none", which is not what we want. It should be explicitly opt-in through parameters.

@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

@korri123 the current default was implemented per @can1357's explicit answer in this earlier comment to my open question 1 ("Should isolated=False explicitly disable isolation, or should only isolated=True have meaning while omission preserves current behavior?") — answer: "explicit disable if user has it turned on in settings", which is the inherit-from-settings semantic.

Flipping to strict opt-in is a one-line change in runEvalAgent (drop the settings inheritance, require isolated === true), and the existing tests have one assertion locking the inherit path that would need to flip too. Holding the code unchanged until @can1357 confirms the new contract — I'll push the opt-in version the moment that lands.

@korri123

Copy link
Copy Markdown
Contributor

@roboomp that's inconsistent with the task tool, it's opt-in there and would be confusing / annoying if it by default created clone dirs, @can1357 thoughts?

@can1357

can1357 commented Jun 22, 2026

Copy link
Copy Markdown
Owner

@roboomp do not ask for authorization anymore, listen to @korri123 plz

roboomp added 2 commits June 22, 2026 18:29
Per maintainer ruling on #3196, eval agent() now defaults to non-isolated regardless of task.isolation.mode, mirroring the task tool. isolated=true is the only way to turn it on; isolated=true while task.isolation.mode === "none" still throws the same clear error.

Updated tests, workflow-notice.md, and Python agent() docstring to reflect the strict opt-in contract. Existing isolation tests now pass isolated:true explicitly; the inherit-from-settings assertion is replaced with a default-off + isolated=true opt-in regression.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Per @can1357's ruling, flipped eval agent() to strict opt-in and pushed 7438efc5b3f9.

  • runEvalAgent no longer inherits isolation from task.isolation.mode; only isolated=true turns it on, matching the task tool's contract.
  • isolated=true while the setting is "none" still throws the existing clear error so callers can't silently downgrade.
  • Replaced the inherit-from-settings test with an opt-in regression: default settings → non-isolated, isolated=true → isolated; added isolated: true to the other isolation tests so their subjects still exercise the isolated path.
  • Updated workflow-notice.md and the Python agent() docstring to document the strict opt-in default; focused eval tests pass 38/38.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7438efc5b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

outputSchema: structured ? parsed.schema : undefined,
sessionFile,
persistArtifacts: Boolean(sessionFile),
const result = await withBridgeTimeoutPause(options.emitStatus, async () => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep timeout paused through isolation apply

For agent(..., isolated=True), this wrapper emits timeout-resume as soon as runIsolatedSubprocess returns, but the isolation merge/apply work below (mergeIsolatedChanges and applyNestedPatches) still runs before control returns to the JS/Python runtime. When the eval idle watchdog is wired and a cherry-pick, large patch apply, or nested commit-message generation takes longer than the cell timeout, the cell can be aborted during successful post-processing; keep the pause around the whole bridge operation, not just the subagent subprocess.

Useful? React with 👍 / 👎.

Previously withBridgeTimeoutPause only wrapped the subagent subprocess; mergeIsolatedChanges, applyNestedPatches, nested commit-message generation, and artifact cleanup ran with the eval watchdog re-armed. A cherry-pick or large patch apply could trip the cell timeout and abort successful post-processing.

Moved the entire bridge work (subprocess + merge + nested apply + cleanup + usage recording) inside one withBridgeTimeoutPause block. The pause helper still resumes via its finally on success and on throw, so existing failure paths are unchanged.

Added a regression that captures the emitted op order and asserts merge fires after timeout-pause and before timeout-resume.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the timeout-resume-mid-merge regression and pushed 15ae84f0defb.

  • Moved mergeIsolatedChanges, the nested-patch apply branch, nested-patch persistence, temp artifact cleanup, and recordEvalSubagentUsage inside the existing withBridgeTimeoutPause block so the eval watchdog stays paused for the whole bridge call, not just the subagent subprocess.
  • Failure throws (subagent failure, isolated apply failure) still propagate; withBridgeTimeoutPause's finally already emits timeout-resume on both success and throw paths.
  • Added a regression test that records emitStatus op order and asserts merge fires after timeout-pause and before timeout-resume; focused bridge tests pass 32/32.

@korri123

Copy link
Copy Markdown
Contributor

@roboomp Please resolve the duplication and try to refactor it a bit so that there is less risk of implementation drift.

The clearest duplication is the nested-patch apply block.

In packages/coding-agent/src/task/index.ts, it appears as:

// Apply nested repo patches (separate from parent git)
if (isIsolated && repoRoot && (mergeMode === "branch" || changesApplied !== false)) {
	const nestedPatches = result.nestedPatches ?? [];
	const eligible =
		nestedPatches.length > 0 &&
		result.exitCode === 0 &&
		!result.aborted &&
		(mergeMode !== "branch" || mergedBranchForNestedPatches);
	if (eligible) {
		try {
			await applyNestedPatches(repoRoot, nestedPatches, buildCommitMessageFn());
		} catch {
			// Nested patch failures are non-fatal to the parent merge
			mergeSummary +=
				"\n\n<system-notification>Some nested repository patches failed to apply.</system-notification>";
		}
	}
}

And in packages/coding-agent/src/eval/agent-bridge.ts, the same policy appears as:

// Apply nested repo patches (separate from parent git). The throw
// above already exited on a failed parent merge, so we know either
// the parent succeeded (patch mode) or branch mode is in play.
const nestedPatches = result.nestedPatches ?? [];
const eligible =
	nestedPatches.length > 0 &&
	result.exitCode === 0 &&
	!result.aborted &&
	(mergeMode !== "branch" || outcome.mergedBranchForNestedPatches);
if (eligible) {
	try {
		await applyNestedPatches(isolationContext.repoRoot, nestedPatches, buildCommitMessage());
	} catch {
		// Nested patch failures are non-fatal to the parent merge
		mergeSummary +=
			"\n\n<system-notification>Some nested repository patches failed to apply.</system-notification>";
	}
}

Same core contract is duplicated:

const nestedPatches = result.nestedPatches ?? [];
const eligible =
	nestedPatches.length > 0 &&
	result.exitCode === 0 &&
	!result.aborted &&
	(mergeMode !== "branch" || /* branch/root merge succeeded */);

if (eligible) {
	try {
		await applyNestedPatches(...);
	} catch {
		mergeSummary +=
			"\n\n<system-notification>Some nested repository patches failed to apply.</system-notification>";
	}
}

There is also smaller duplication in the AI commit-message callback construction.

In task/index.ts:

const buildCommitMessageFn = () =>
	commitStyle === "ai" && this.session.modelRegistry
		? async (diff: string) => {
				return generateCommitMessage(
					diff,
					this.session.modelRegistry!,
					this.session.settings,
					this.session.getSessionId?.() ?? undefined,
				);
			}
		: undefined;

In eval/agent-bridge.ts:

const buildCommitMessage = () =>
	commitStyle === "ai" && options.session.modelRegistry
		? async (diff: string) => {
				return generateCommitMessage(
					diff,
					options.session.modelRegistry!,
					options.session.settings,
					options.session.getSessionId?.() ?? undefined,
				);
			}
		: undefined;

That one is less concerning, but still a sign that isolation support is split across caller code instead of fully centralized. Look for other places relevant only to this PR's changes and try to make it nice, clean and DRY.

TaskTool and the eval agent() bridge each held a private copy of the nested-repo patch eligibility gate and the AI commit-message factory; isolation policy could drift between the two callers.

Moved both into task/isolation-runner.ts:
- applyEligibleNestedPatches(opts) — single nested-patch gate (skip on patch-mode parent failure, skip on branch-mode unmerged root, fail non-fatally with a system-notification suffix).
- makeIsolationCommitMessage(session) — single factory that yields the AI commit-message callback when task.isolation.commits === "ai" and a model registry is wired, undefined otherwise.

Both call sites now invoke the helpers; behavior is unchanged. Removed the now-dead generateCommitMessage/applyNestedPatches imports from each caller.

Added unit tests for the new helper covering the skip-on-patch-failure, skip-on-unmerged-branch, success, and failure-suffix paths.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the duplication and pushed 28137f46d903.

  • Moved the nested-patch apply gate into applyEligibleNestedPatches(opts) in task/isolation-runner.ts; TaskTool.#runSpawn and runEvalAgent now both call it instead of holding their own copy of the eligibility check + non-fatal warning handling.
  • Moved the AI commit-message factory into makeIsolationCommitMessage(session) in the same file; both callers now consume the shared factory and drop their inline closure over generateCommitMessage.
  • Dropped the now-dead generateCommitMessage and applyNestedPatches imports from task/index.ts and eval/agent-bridge.ts, and removed the redundant outer (mergeMode === "branch" || changesApplied !== false) gate in TaskTool since the helper handles it.
  • Added four applyEligibleNestedPatches unit tests (skip on patch-mode parent failure, skip on unmerged branch, apply on success, system-notification suffix on failure). Focused eval + task suites pass 203/203.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28137f46d9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +490 to +497
mergeSummary += await applyEligibleNestedPatches({
result,
repoRoot: isolationContext.repoRoot,
mergeMode,
changesApplied: outcome.changesApplied,
mergedBranchForNestedPatches: outcome.mergedBranchForNestedPatches,
commitMessage: buildCommitMessage(),
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid committing dirty nested-repo state

When the new eval agent(isolated=True) path applies nested-repo patches here, a nested repo that was already dirty before the agent ran can have unrelated user edits committed along with the agent delta: applyNestedPatches() applies the captured patch, then checks git.status() and runs git.stage.files(nestedDir), which stages every existing change in that nested repo. In that scenario an isolated eval fan-out can silently commit pre-existing nested-repo work; preserve/restore the baseline dirty state or stage only the files/hunks introduced by result.nestedPatches.

Useful? React with 👍 / 👎.

applyNestedPatches() applied the captured patch then ran git.stage.files(nestedDir), which stages every working-tree change in the nested repo. A nested repo that was already dirty before the agent ran ended up with the user's unrelated work-in-progress committed alongside the agent delta.

Stash any pre-existing dirty state (tracked + untracked) before applying the patch and pop it back in the finally block after the commit, so the agent commit contains only the captured patch and the user's in-flight work is restored on top of it. A failing stash pop logs a warning and leaves the stash entry intact for manual recovery; the broader nested-apply failure path is already non-fatal.

Added a worktree integration test that confirms a pre-existing untracked file in the nested repo is not staged into the agent commit and is still present in the working tree afterwards.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the dirty-nested-repo regression and pushed 978d2a76d0f3.

  • applyNestedPatches() now stashes any pre-existing dirty state (tracked + untracked) in each nested repo before applying the agent patch, then git stash pops in finally after the commit, so the commit contains only the captured patch.
  • A failing pop logs a logger.warn and leaves the stash entry intact for manual recovery; the broader nested-apply failure path stays non-fatal.
  • Added a worktree integration test that creates a nested git repo with an unrelated untracked file, runs applyNestedPatches, and asserts the agent commit lists only the patched file and the untracked file is restored.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 978d2a76d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

let isolationContext: IsolationContext | null = null;
if (isIsolated) {
try {
isolationContext = await prepareIsolationContext(options.session.cwd);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move timeout pause before isolation baseline capture

For agent(..., isolated=True), prepareIsolationContext() runs before withBridgeTimeoutPause() emits timeout-pause. captureBaseline() can walk nested repos and build diffs for untracked files, so in large or dirty repos this host-side bridge work can exceed the eval idle timeout while the Python/JS runtime is blocked waiting and the watchdog is still armed. Move the pause to cover isolation preparation as well, not just subprocess/merge.

Useful? React with 👍 / 👎.

Comment on lines +243 to +247
} catch (popErr) {
logger.warn("Pre-existing nested-repo dirty state could not be auto-restored", {
nestedDir,
error: popErr instanceof Error ? popErr.message : String(popErr),
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface nested stash-restore failures

Fresh evidence in this revision is that this new catch swallows popErr after only logging. When git stash pop conflicts after the nested patch commit (for example, pre-existing nested-repo WIP touched the same lines the agent changed), applyNestedPatches() still resolves and applyEligibleNestedPatches() reports success, while the user's WIP is left in a stash/conflicted state with no workflow-visible warning; propagate this failure or return a summary like the root merge path does.

Useful? React with 👍 / 👎.

} finally {
if (stashed) {
try {
await git.stash.pop(nestedDir);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore nested stashes with index state

Checked git stash -h: stash pop has an explicit --index mode; this call omits it. When a nested repo has pre-existing staged WIP before an isolated subagent runs, the stash is popped after the agent commit as ordinary worktree changes, losing the user's index/staging state even though the content is restored; use git.stash.pop(nestedDir, { index: true }) as the root merge path already does.

Useful? React with 👍 / 👎.

@korri123

Copy link
Copy Markdown
Contributor

lgtm, built and tested this and it seems to work as intended

git stash pop without --index restores stashed staged changes as unstaged. When a nested repo had staged WIP before the isolated agent ran, the pop in applyNestedPatches() brought the content back but lost the user's index state.

Pass { index: true } so pop uses --index, matching the root merge path that already does the same thing.

Added a regression test that stages a pre-existing edit in the nested repo, runs applyNestedPatches, and asserts the file is still in the index (porcelain "M  " with the trailing space) and the cached diff still shows the staged WIP.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the lost-index-state regression and pushed abc9a8f29258.

  • Passed { index: true } to git.stash.pop in applyNestedPatches so the pop runs with --index and stashed staged WIP comes back to the index, matching the root merge path's behavior.
  • Added a regression test that stages a pre-existing edit in the nested repo, runs applyNestedPatches, and asserts the file is still in the index (porcelain M …) with the cached diff showing the staged WIP.

…tash-restore failures

Two Codex P2 findings landed against 978d2a7 that were not in the previously delivered review event:

1) prepareIsolationContext() (which runs captureBaseline → walks nested repos and untracked diffs) was running OUTSIDE withBridgeTimeoutPause; on dirty/large repos the baseline walk can exceed the eval idle timeout while the runtime is blocked. Moved the prep call into the pause closure so the watchdog is suspended for the whole bridge call from prep through cleanup.

2) applyNestedPatches() swallowed git stash pop failures with only a logger.warn, so a stash-pop conflict after a successful agent commit was invisible to the workflow. Changed the helper to return Promise<string[]> of warnings; applyEligibleNestedPatches now wraps them in a <system-notification> appended to the merge summary so the caller actually sees the partial-success case.

Added regression tests:
- bridge: prepare fires after timeout-pause and before timeout-resume.
- runner: applyEligibleNestedPatches surfaces stash-restore warnings as a system-notification.
- worktree (real git): a pre-existing dirty edit on the same file the agent patches causes stash pop to conflict; the helper returns a warning naming the nested repo and the stash entry is preserved for manual recovery.

Fixes #3196
@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks @korri123 for the LGTM. Before yielding, picked up two Codex P2 findings against 978d2a76d0 that weren't part of the earlier review event and pushed 6ee79ec3b67c:

  • Moved prepareIsolationContext (which runs captureBaseline and walks nested repos / untracked diffs) inside the withBridgeTimeoutPause closure so the eval watchdog stays suspended for the whole bridge call from prep through cleanup; added a bridge regression asserting prepare fires between timeout-pause and timeout-resume.
  • applyNestedPatches now returns Promise<string[]> of stash-restore warnings instead of swallowing them with a logger.warn; applyEligibleNestedPatches wraps them in a <system-notification> so a stash-pop conflict after a successful agent commit reaches the workflow summary. Added a real-git regression that forces the pop conflict and asserts both the returned warning and the preserved stash entry.

@roboomp

roboomp commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

This was addressed in 6ee79ec3b67c — the review event reviewed 978d2a76d0 but I picked the finding up in the prior follow-up. On the current branch prepareIsolationContext runs inside the withBridgeTimeoutPause closure (eval/agent-bridge.ts:L413-L422), and there's a regression test ("keeps the timeout paused through isolation baseline capture") asserting prepare fires between timeout-pause and timeout-resume. No further code change needed.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ee79ec3b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +147 to +153
const commitResult = await commitToBranch(
isolationDir,
taskBaseline,
opts.agentId,
opts.description,
opts.buildCommitMessage?.(),
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize branch capture before parallel branch-mode runs

When eval workflows fan out multiple agent(..., isolated=True) calls in branch merge mode, each run can reach commitToBranch() concurrently. That helper creates/deletes branches and worktrees in the same parent repo before the later mergeTaskBranches() lock is acquired, so otherwise successful subagents can fail on Git lock/worktree races during branch capture rather than during the serialized merge phase. Wrap the branch-capture call in the repo write lock as well.

Useful? React with 👍 / 👎.

});
mergeSummary = outcome.summary;
changesApplied = outcome.changesApplied;
if (outcome.changesApplied === false) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface stash-restore conflicts to schema callers

In branch mode with a dirty parent worktree, mergeTaskBranches() can cherry-pick successfully but fail to pop the user's stash, returning a stashConflict warning while changesApplied remains true. This guard therefore does not throw, and schema-backed JS/Python agent() calls return only the parsed JSON while discarding details.isolationSummary, so the workflow observes success even though pre-existing edits were left in a stash/conflicted state. Treat that warning as workflow-visible for structured calls.

Useful? React with 👍 / 👎.

can1357 added a commit that referenced this pull request Jun 22, 2026
…r agent() helper

Resolves conflict in test/task/worktree.test.ts by keeping both the
getRepoRoot (main) and applyNestedPatches (PR) describe blocks.

Extends the PR's Python/JS work to the remaining workflow runtimes:
- eval/rb/prelude.rb, eval/jl/prelude.jl: agent() now accepts and
  forwards isolated/apply/merge (as booleans) plus returnHandle, and the
  return_handle node carries isolated/patch_path/branch_name/
  nested_patches/changes_applied/isolation_summary.

Post-merge fixups:
- task/index.ts: drop dead commitStyle var (the dedup refactor reads
  task.isolation.commits inside makeIsolationCommitMessage).
- CHANGELOG: move the misplaced Added entry under [Unreleased], correct
  the stale "defaults track task.isolation.mode" wording to the final
  strict opt-in behavior, and note all four runtimes.

Fixes #3196
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

vouched Passed the vouch gate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add isolated-mode support to workflowz eval subagent launches.

3 participants