Skip to content

feat(lib): capture client-attested build provenance#454

Open
max-parke-scale wants to merge 4 commits into
nextfrom
maxparke/agx1-418-build-provenance-capture
Open

feat(lib): capture client-attested build provenance#454
max-parke-scale wants to merge 4 commits into
nextfrom
maxparke/agx1-418-build-provenance-capture

Conversation

@max-parke-scale

@max-parke-scale max-parke-scale commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Adds agentex.lib.utils.build_provenance — the shared capture util for client-attested build provenance: git coordinates (repo/commit/ref/subpath), a deterministic working_tree_hash over the build inputs (not the tarball), a dirty flag (Go vcs.modified / Nix dirtyRev shape), and normalize_remote. Capture is best-effort and never raises into a build. Also makes the build archive’s member order deterministic via a sorted enumeration shared with the hash.

First of three surfaces for AGX1-418 (Phase 1, client-attested). Provenance is delivered via the build-record sink — source_* columns on POST /v5/builds (Surface C, scaleapi) consumed by the sgpctl + CI uploaders (Surface B, scaleapi/sgp). This PR lands the util + archive determinism where agentex.lib lives; the uploaders/columns follow.

Scope notes

  • No build-info.json / runtime sink. An earlier revision wrote build-info.json into the build context for the register_agent()registration_metadata path. Greptile (T-Rex) correctly flagged it as dead-on-arrival (written to the archive root, which the templates’ Dockerfiles don’t COPY and locate_build_info_path() doesn’t read). It’s also redundant: AgentexCloudDeploy.build_id is an FK to AgentexCloudBuild, so a deployment’s source provenance derives from the build record over that join — the same Build→Deploy edge lineage already traverses. Dropped; can be revived (correctly placed) if a real consumer for deployment-history provenance ever appears.

Identity model

working_tree_hash is always computed (content identity); commit/ref/repo anchor it to source when in a git work tree; dirty records uncommitted changes (None outside git).

Tests

20 provenance unit tests (clean/dirty/untracked/detached-HEAD/no-remote/non-git/monorepo-subpath, hash determinism + one-byte/added/exec-bit/symlink sensitivity, and a never-raises-on-hash-failure guard). ruff/pyright clean; full lib suite green.

🧑‍💻🤖 — posted via Claude Code

Greptile Summary

Adds agentex.lib.utils.build_provenance — the canonical source-identity util for agent builds. It captures git coordinates (repo/commit/ref/subpath/dirty) plus a deterministic working_tree_hash over the build inputs, with full best-effort degradation so no provenance failure can abort a build. It also makes BuildContextManager.zipped() use the same sorted iter_context_files enumeration as the hash, ensuring the archive member order is deterministic.

  • build_provenance.py: new module with capture_build_provenance, working_tree_hash, iter_context_files, normalize_remote, and _safe_working_tree_hash; every git probe wraps its own failure path; the hash is wrapped in a separate try/except so filesystem errors are logged and swallowed rather than propagated.
  • agent_manifest.py: BuildContextManager.zipped() now delegates file enumeration to iter_context_files, aligning archive contents (including symlinks) with the hash definition.
  • test_build_provenance.py: 20 tests covering clean/dirty/untracked/detached-HEAD/no-remote/non-git/monorepo scenarios, hash sensitivity properties, and the never-raises guard.

Confidence Score: 5/5

Safe to merge — the new util is additive and best-effort, and the archive change is a determinism improvement only.

All capture paths degrade gracefully to nulls; the hash is wrapped in its own try/except; git probes each handle their own failure. The only open item is a stale docstring on a method with no current callers.

No files require special attention; the build_info() docstring in build_provenance.py is worth a follow-up cleanup but does not affect runtime behavior.

Important Files Changed

Filename Overview
src/agentex/lib/utils/build_provenance.py New module: best-effort git provenance capture (commit/ref/repo/subpath/dirty) plus deterministic content hash. Well-structured with proper degradation at every step; _safe_working_tree_hash wraps the hash in a try/except so a filesystem error never aborts a build. iter_context_files is the shared canonical enumeration used by both the hash and the archive packer.
src/agentex/lib/sdk/config/agent_manifest.py Minor refactor of BuildContextManager.zipped(): delegates file enumeration to iter_context_files instead of an inline rglob, making archive member order deterministic across machines and aligning it with the hash computation. Behavioral change: dangling symlinks and symlinks-to-directories are now included in archives (consistent with the new hash semantics).
tests/lib/test_build_provenance.py 20 unit tests covering clean/dirty/untracked/detached-HEAD/no-remote/non-git/monorepo scenarios plus hash sensitivity (byte change, file add, exec-bit, symlink target) and the never-raises guard. Good coverage; the monkeypatch test for _safe_working_tree_hash correctly validates the degradation path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["capture_build_provenance(repo_path, context_root)"] --> B["_safe_working_tree_hash(hash_root)"]
    B --> C{hash raises?}
    C -->|No| D["working_tree_hash(root)"]
    C -->|Yes – logs warning| E["tree_hash = None"]
    D --> F["iter_context_files(root)\nsorted rglob, files + symlinks"]
    F -.->|shared enumeration| G["BuildContextManager.zipped()\ntar archive – deterministic order"]
    A --> H["_git rev-parse --show-toplevel"]
    H --> I{in git repo?}
    I -->|No| J["BuildProvenance\nworking_tree_hash only\ndirty=None"]
    I -->|Yes| K["_git rev-parse HEAD\n_git symbolic-ref / describe-tags\n_git remote get-url origin\n_git log -1 author\n_git status --porcelain"]
    K --> L["normalize_remote(url)\nstrip scheme / credentials / .git\nlowercase host"]
    L --> M["dirty = status output is not None"]
    M --> N["BuildProvenance\nfull provenance"]
    N --> O["source_fields()\nomits None + author PII\nfor POST /v5/builds"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["capture_build_provenance(repo_path, context_root)"] --> B["_safe_working_tree_hash(hash_root)"]
    B --> C{hash raises?}
    C -->|No| D["working_tree_hash(root)"]
    C -->|Yes – logs warning| E["tree_hash = None"]
    D --> F["iter_context_files(root)\nsorted rglob, files + symlinks"]
    F -.->|shared enumeration| G["BuildContextManager.zipped()\ntar archive – deterministic order"]
    A --> H["_git rev-parse --show-toplevel"]
    H --> I{in git repo?}
    I -->|No| J["BuildProvenance\nworking_tree_hash only\ndirty=None"]
    I -->|Yes| K["_git rev-parse HEAD\n_git symbolic-ref / describe-tags\n_git remote get-url origin\n_git log -1 author\n_git status --porcelain"]
    K --> L["normalize_remote(url)\nstrip scheme / credentials / .git\nlowercase host"]
    L --> M["dirty = status output is not None"]
    M --> N["BuildProvenance\nfull provenance"]
    N --> O["source_fields()\nomits None + author PII\nfor POST /v5/builds"]
Loading

Comments Outside Diff (1)

  1. General comment

    P1 Clean committed git provenance still emits working_tree_hash

    • Bug
      • The stated identity rule requires clean committed repositories to key identity on the clean commit and omit/null working_tree_hash. Runtime evidence from the head checkout shows clean_git_committed_tree returns commit 9d5b3e883d0470449c554424c322277f5b0ddaf4 with dirty=false, but also returns working_tree_hash 06ae68b1d1ef47dbe60829f2fd3bff3367e51853bea8781c5fb647f222cf85a0 in BuildProvenance, source_fields, and build_info.
    • Cause
      • capture_build_provenance computes tree_hash before checking git state and always passes working_tree_hash=tree_hash into BuildProvenance for git captures. The changed return path is anchored at src/agentex/lib/utils/build_provenance.py:224-230, specifically working_tree_hash=tree_hash on line 229; the docstring/comments also say it is always computed.
    • Fix
      • Only compute/assign working_tree_hash when there is no clean commit identity: non-git, unborn/no HEAD, or dirty work tree. For clean committed git captures, return working_tree_hash=None so source_fields/build_info omit it while keeping commit/ref/repo and dirty=false.

    T-Rex Ran code and verified through T-Rex

Reviews (4): Last reviewed commit: "Merge remote-tracking branch 'origin/nex..." | Re-trigger Greptile

Add agentex.lib.utils.build_provenance — the single producer of source
identity for agent builds (git coordinates + a deterministic content hash
of the build context). prepare_cloud_build_context now writes
build-info.json into the staged context (populates runtime
registration_metadata with no server change) and exposes provenance on
CloudBuildContext so the upload can send source_* fields. Archive member
order is now deterministic via a sorted enumeration shared with the hash.

The hash is computed only when there is no clean commit to identify the
build (dirty tree or non-git context). First of three surfaces for
AGX1-418 (Phase 1, client-attested); the SGP build-record columns and the
sgpctl/Gitea uploaders follow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@max-parke-scale max-parke-scale changed the base branch from main to next June 30, 2026 01:04
Comment thread src/agentex/lib/utils/build_provenance.py Outdated
Comment thread src/agentex/lib/utils/build_provenance.py Outdated
Address Greptile review on the build-provenance capture util:

- Always compute working_tree_hash (drop the "skip on clean commit"
  path). A `git status` clean tree can still contain .gitignore'd-but-not-
  .dockerignore'd files the commit can't reproduce; an always-present
  content hash identifies the exact shipped bytes and closes that gap.
- Guard the hash (_safe_working_tree_hash) so a permission error or
  filesystem race degrades to None instead of aborting the build — the
  module contract is that capture never raises into a build.
- Record dirtiness as a first-class `dirty` flag (surfaced as `source_dirty`
  / `dirty`) rather than overloading hash-presence, matching Go's
  vcs.modified and Nix's dirtyRev. None outside a git work tree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@max-parke-scale

Copy link
Copy Markdown
Contributor Author

Addressed both Greptile findings in cf9994d:

  1. Ignored files lose hashing — fixed by removing the "skip hash on clean commit" path entirely: working_tree_hash is now always computed over the staged context, so a .gitignored-but-not-.dockerignored file is captured by the content hash regardless of git status. (Identity = the always-present hash; commit anchors it to source. Dedupe is unaffected since the hash is deterministic.)
  2. Hash failures abort builds — wrapped the computation in _safe_working_tree_hash, which degrades to None and logs on any error, honoring the “capture never raises into a build” contract.

Also, per design discussion: dirtiness is now a first-class dirty flag (surfaced as source_dirty / dirty) rather than implied by hash-presence — matching Go’s vcs.modified and Nix’s dirtyRev; None outside a git work tree.

🧑‍💻🤖 — posted via Claude Code

Comment thread src/agentex/lib/cli/handlers/agent_handlers.py Outdated
Greptile (T-Rex repro) showed build-info.json was written to the archive
root, which the templates' Dockerfiles don't COPY and the runtime
locate_build_info_path() doesn't read — so it never reached the image and
the registration_metadata sink stayed empty.

Beyond the placement bug, the sink is redundant: AgentexCloudDeploy.build_id
is an FK to AgentexCloudBuild, so a deployment's source provenance derives
from the build record (the source_* columns this work adds, Surface C) over
that join — the same Build->Deploy edge lineage already traverses. No need
to denormalize provenance onto registration_metadata/DeploymentHistory
(which has had no producer since its read path landed 2025-09, so its git
fields have never been populated).

#454 now ships only the shared capture util (agentex.lib.build_provenance)
plus a deterministic build-archive ordering. Provenance is delivered via the
build-record sink; the runtime sink can be revived (correctly placed) if a
real consumer for deployment-history provenance ever appears.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@socket-security

socket-security Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​agentex-sdk@​0.13.0 ⏵ 0.16.294 +1100100100 +50100
Updatedpypi/​agentex-client@​0.13.0 ⏵ 0.16.299 +1100100100100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant