Skip to content

feat(audit): integrate hash chain + Ed25519 signing (closes #212)#237

Open
naveen-kurra wants to merge 4 commits into
mainfrom
feat/gov-r5-r6-integrated
Open

feat(audit): integrate hash chain + Ed25519 signing (closes #212)#237
naveen-kurra wants to merge 4 commits into
mainfrom
feat/gov-r5-r6-integrated

Conversation

@naveen-kurra

@naveen-kurra naveen-kurra commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Closes #212. Supersedes #222 (closed).

Reconciles the R5 hash chain into a single tamper-evidence pipeline on top of #220 (now merged). Every event carries prev_hash (always) + kid + sig (both omitempty). The signature covers prev_hash, so tampering with the chain link breaks BOTH the chain walk AND the signature check.

What changed vs. the original #222 (addresses @initializ-mk's review)

  • 🔴 Deadlock fix, chain-lock variant. Chain order requires holding a.mu across sink writes (concurrent Emits must not race a.lastHash; disk order must equal chain order). So I can't release the lock before writes like feat(audit): Ed25519 per-event signing + JWKS endpoint #220 did. Instead: collect sink errors under lock, call logSinkErrorOnce after an explicit Unlock(). sync.Mutex non-reentrancy sidestepped. Regression test TestEmit_NoDeadlockOnSinkErrorWithOpsLog hangs pre-fix, passes now.
  • 🟠 Precision-hole fix as suggested. Producer + verifier both hash raw line bytes (writer-authored, minus trailing \n), not a re-marshaled event. Closes the >2^53 landmine where json.Marshal(json.Unmarshal(x)) isn't a fixed point.
  • 🟠 Signature covers prev_hash. Chain-mint runs before canonicalize+sign, so the signature's coverage extends to the chain link. Proved in TestIntegration_SigCoversPrevHash — tampering with just prev_hash breaks the sig even with --skip-chain.
  • 🟡 Unified VerifyAuditLog(r, opts VerifyOptions) walks chain + (when pubkeys supplied) signatures. --skip-chain on the CLI for SIEM tail-ingestion. Head-truncation is a soft warning per your intentional-limitation note.
  • 🟡 Single CLI forge audit verify <file> [--pubkey <jwks>] [--skip-chain].

Emit sequencing

Under one mutex:

  1. Stamp prev_hash = a.lastHash (or genesis on first event).
  2. If signer wired: stamp kid, sign canonical bytes (with sig blanked).
  3. Marshal.
  4. Update a.lastHash = sha256(raw line bytes).
  5. Write to sinks; collect errors into a local slice.

After Unlock(): log any sink errors via logSinkErrorOnce.

Test coverage

  • audit_chain_sig_integration_test.go — chain + sig compose; sig covers prev_hash; unsigned streams still chain; head-truncation soft-warns.
  • audit_hash_chain_test.go (from feat(audit): hash-chained audit stream for tamper detection #222) — genesis progression, clean walk, tampering, deletion, 200-goroutine concurrency, malformed line, prev_hash-always-written pin.
  • audit_verify_test.gofeat(audit): Ed25519 per-event signing + JWKS endpoint #220's tests + cross-checks that chain catches tampering when sig-check is skipped and vice versa.
  • audit_emit_deadlock_test.go — the earlier deadlock regression continues to pass with the chain-lock-plus-post-unlock-logging pattern.

Test plan

  • go test ./forge-core/... ./forge-cli/... — full sweep green
  • gofmt -w + golangci-lint run clean

Opt-in tamper-evidence for the audit stream. When
FORGE_AUDIT_SIGNING_KEY_B64 is set, every emitted event carries a
sig (Ed25519 over canonical JSON with sig blanked) and a kid. The
runtime advertises the corresponding public key at
GET /.well-known/forge-audit-keys in RFC 8037 JWKS shape.

Verification lives in two places:
- coreruntime.VerifyAuditLog walks NDJSON and checks structural
  integrity + signatures when a JWKS is supplied.
- `forge audit verify --pubkey <jwks>` is the offline CLI on top.

When no signing key is configured the wire shape is byte-identical
to the pre-#213 stream (Kid + Sig use omitempty).

Docs at docs/security/audit-signing.md.
Review from initializ-mk on PR #220 flagged a self-deadlock in
AuditLogger.Emit: the top-level a.mu.Lock() was held across the
sink-write loop, so a failing sink triggered a.logSinkErrorOnce which
tried to reacquire a.mu — sync.Mutex is not reentrant, so any sink
write error with an ops logger configured would freeze audit emission.

Emit now snapshots signer/sinks/opsLog under a short critical
section, then does canonicalize/sign/marshal/write outside the lock.
ed25519.Sign is stateless so this is safe. Also fixes the review's
"silent event drop" note by logging signing/marshal failures via the
ops logger before dropping the event — an operator running a signed
pipeline needs to see when signing breaks.

Regression tests:
- 32 concurrent Emits with a always-failing sink + ops logger; test
  fails on 5s timeout if the deadlock returns.
- signing failure produces an operator-visible Error line.

Minor review fixes:
- TestLoadEd25519KeyFromEnv_RSARejected now uses a real RSA PKCS#8
  key instead of truncated garbage, so the type-assertion branch is
  actually exercised.
- TestVerifyAuditLog_UnknownKidFails drops the dead otherPub line.
- Documented that the "canonical JSON" is Go encoding/json output,
  not JCS; non-Go verifiers must replicate Go's marshaler quirks
  (JCS adoption tracked separately).
- Dropped forward-references to #212 PrevHash in audit_verify.go /
  audit_signing.go comments (the two features are being reviewed
  in parallel; the reference was aspirational not concrete).
Merges the R5 (hash chain) and R6 (signing) tracks into one coherent
tamper-evidence pipeline per Manoj's cross-PR review. Previously
#220 and #222 rewrote the same files with incompatible signatures.

## Wire shape

Every event now carries `prev_hash` (always, no omitempty). Signed
deployments additionally carry `kid` + `sig`. The signature covers
`prev_hash` — tampering with the chain link breaks BOTH the chain
walk AND the signature verification.

## Emit sequencing

Under a single mutex:
  1. Stamp prev_hash from a.lastHash (or genesis on first event).
  2. If signer wired: stamp kid, sign canonical bytes (Sig blanked).
  3. Marshal.
  4. Update a.lastHash = sha256(raw line bytes, no trailing newline).
  5. Write to sinks; collect errors into a local slice.

After releasing the mutex, log any sink errors via logSinkErrorOnce.

Deadlock fix: chain order requires holding the lock across sink
writes (concurrent Emit must not race a.lastHash and must not
reorder on disk). logSinkErrorOnce reacquires a.mu — non-reentrant —
so we collect errors under lock and log after Unlock().

## Precision-hole fix

Chain hashing runs over the raw line bytes (writer-authored), not
over a re-marshaled event. Closes the reviewer's flagged latent bug
where json.Marshal(json.Unmarshal(x)) is not a fixed point when
Fields carries integers > 2^53 (they round-trip through float64).
The verifier likewise hashes raw line bytes as read from the stream.

## Unified verifier

One VerifyAuditLog(r, opts VerifyOptions) walks the stream, checking
chain + (when opts.Pubkeys is non-empty) signatures. Soft warnings
for head-of-stream truncation (first event not genesis) and for
signed streams verified without --pubkey. --skip-chain flag on the
`forge audit verify` CLI covers the SIEM tail-ingestion case.

## Test coverage

- audit_chain_sig_integration_test.go — proves chain + sig compose;
  proves the signature covers prev_hash (tampering with just the
  chain link breaks the sig); proves unsigned streams still chain;
  head-truncation surfaces as soft warning.
- audit_hash_chain_test.go — the r5 tests (genesis progression,
  clean walk, tampering, deletion, 200-goroutine concurrency,
  malformed-line handling, prev_hash-always-written pin) adapted to
  the unified API.
- audit_verify_test.go — the r6 tests + a new
  Sig-catches-tamper-with-SkipChain and a chain-catches-tamper
  -without-sig to cross-check the two checks catch each other's
  failure modes.
- audit_emit_deadlock_test.go — the earlier deadlock regression
  test continues to pass with the new "hold lock across writes,
  log after unlock" pattern.

## PR reconciliation

This branch obsoletes #220 and #222 as independent PRs. Plan:
land #220 first (its self-contained review comments are addressed
on -v2), then this branch supersedes #222 as the integration PR.
@naveen-kurra naveen-kurra changed the title Feat/gov r5 r6 integrated feat(audit): integrate hash chain + Ed25519 signing (closes #212) Jul 2, 2026
Per initializ-mk's follow-up on #220. Adopting RFC 8785 (JCS) as the
signature-preimage canonicalization closes two things at once:

1. Portability. Non-Go verifiers converge on the same preimage via
   any RFC 8785 impl — no need to replicate Go encoding/json's field
   order, key sort, or HTML-safe escaping quirks.
2. Latent precision bug. Verifiers re-marshal parsed events; Go's
   json.Marshal(json.Unmarshal(x)) is NOT a fixed point when Fields
   carries integers > 2^53 (they decode to float64 and re-marshal
   rounded). JCS normalizes both sides through the same ES6-double
   rule, so an untampered stream with a big int now verifies. See
   TestJCS_LargeIntegerFieldsRoundTrip.

## Wire shape

Signed events now carry `sigp: "jcs-1"` alongside `kid` + `sig`. The
scheme identifier is covered by the signature (canonicalize runs
after Sigp is stamped, with only Sig blanked), so a tamperer can't
rewrite it to force a weaker verification path — the verifier
explicitly rejects unknown sigp values. See
TestJCS_UnsupportedSigpRejected.

Unsigned events have no sigp (all three: sigp/kid/sig are omitempty).

## Numbers-as-strings caveat

JCS numbers are IEEE-754 double per ES6 §6.1.6. Field values that
MUST preserve 64-bit-exact precision (nanosecond epochs, 64-bit IDs)
MUST be carried as JSON strings in Fields — this is a producer-side
discipline, documented in canonicalBytesForSigning and in
docs/security/audit-signing.md. Not enforced at library level.

## Cross-language reference

docs/security/audit-signing.md now shows a Python verifier snippet
using the `jcs` package + `cryptography` — parse, drop sig, JCS,
Ed25519.Verify. No Go-marshaler emulation.

## Dependency

New forge-core dep: github.com/gowebpki/jcs (v1.0.1, MIT, pure Go,
no transitive deps beyond stdlib). ~150 LOC reference impl matches
Forge's "prefer narrow proven crypto libraries" posture — number
formatting is the fiddly part and the reference impl is spec-tested.
@naveen-kurra

Copy link
Copy Markdown
Collaborator Author

Folded the JCS proposal (@initializ-mk's follow-up on #220) into this PR — commit e37851f. It composes cleanly with the raw-line-bytes chain hashing that's already in the branch: chain hashes over producer-authored bytes, signatures over the RFC 8785 canonical form of the parsed value.

What landed

  • canonicalBytesForSigning now returns jcs.Transform(json.Marshal(evt with Sig blanked)). Any RFC 8785 impl in any language converges on the same preimage from the parsed JSON — no Go-marshaler emulation.
  • sigp: "jcs-1" stamped on every signed event alongside kid + sig. Signature covers sigp itself (canonicalize runs after Sigp is stamped, with only Sig blanked), so a tamperer can't downgrade the scheme. Verifier explicitly rejects unknown sigp values with an actionable message rather than a generic sig-verify failure.
  • Big-int fix (TestJCS_LargeIntegerFieldsRoundTrip): emits an event with fields.big = int64(9007199254740993) (one past the fp64 mantissa boundary). Pre-JCS this signed as …993 and verified as …992; now both sides converge.
  • Numbers-as-strings caveat documented on canonicalBytesForSigning and in docs/security/audit-signing.md: any field that MUST preserve 64-bit exact precision (nanosecond epoch, 64-bit ID) must be carried as a JSON string. Not library-enforced.
  • Cross-language verifier snippet (Python) added to the docs — parse → drop sig → jcs.canonicalizeEd25519.Verify. No Go marshaler emulation needed.

New tests

  • TestJCS_SigCanonicalizationStampedOnSignedEvents — every signed emit carries sigp="jcs-1".
  • TestJCS_UnsignedEventOmitsSigp — wire shape stays clean when signing is off.
  • TestJCS_LargeIntegerFieldsRoundTrip — big-int precision fix.
  • TestJCS_UnsupportedSigpRejected — rewritten sigp gets an actionable error.
  • TestJCS_CanonicalizationIsDeterministic — same value → same bytes regardless of Go map iteration order.
  • TestJCS_KeysSortedInOutput — probes JCS's UTF-16 key-sort requirement.

Dependency

New forge-core dep: github.com/gowebpki/jcs v1.0.1 — MIT, pure Go, no transitive deps beyond stdlib, ~150 LOC. Chose to depend rather than vendor because the ES6 number-to-string rule is subtle enough that the spec-tested reference impl is the safer bet.

Full sweep + gofmt + lint clean. Docs at docs/security/audit-signing.md rewritten to feature JCS as the canonicalization.

@initializ-mk

Copy link
Copy Markdown
Contributor

Reviewed for correctness and the #220 integration — this is a clean reconciliation that resolves the entire #222 review (deadlock, precision, sig-covers-prev_hash, unified verify) and adopts the JCS + sigp scheme-marker from the #220 follow-up. Verified end to end: builds on merged main, audit suite passes including -race (200-goroutine chain concurrency + the deadlock regression), the chain hashes raw line bytes while the signature uses JCS (both reproducible by the verifier), and TestJCS_LargeIntegerFieldsRoundTrip covers the 2^53+1 case. #220 dependency is clean: the branch is cut from #220's tip, and the only shared file (canonicalBytesForSigning) is a conflict-free re-marshal→JCS upgrade (git merge-tree clean, GitHub MERGEABLE). Nice work.

One fix to make before merge:

  • go.mod: github.com/gowebpki/jcs is marked // indirect but it's a direct import (forge-core/runtime/audit_signing.go). Please run go mod tidy in forge-core to promote it to the direct require block (drops the // indirect comment). Trivial, doesn't affect builds, but keeps go.mod honest / passes a go mod tidy CI check.

Non-blocking notes (fine to defer):

  • Threat-model doc line: the chain detects deletion of written events, not suppression-at-emit (a sign/marshal failure drops the event and the chain skips to the next written one — only opsLog records it). The fail-closed knob is the mitigation for a "must be signed" posture.
  • Consider a --require-genesis strict mode so operators verifying a complete stream (vs a SIEM tail) can treat missing genesis as a hard failure rather than a soft warning.

Once go mod tidy is run I'm happy to approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

R5 (governance): add hash-chained audit for tamper evidence

2 participants