LLM inference callback support by SteveSandersonMS · Pull Request #1689 · github/copilot-sdk

SteveSandersonMS · 2026-06-16T18:23:39Z

Summary

This PR adds SDK support for intercepting LLM inference requests and handling them in user code across all six SDK languages: Node.js, .NET, Python, Go, Rust, and Java.

Consumers register one client-global LlmRequestHandler (constructed once, no args). The runtime invokes it over JSON-RPC (llmInference.*) whenever it would otherwise issue a model-layer HTTP or WebSocket request — for both BYOK and CAPI — fully replacing the outbound call. A handler that overrides nothing is a transparent pass-through.

It includes the full feature work on this branch:

wire up LLM inference provider registration and generated RPC types
add the raw chunked callback protocol for outbound inference requests and responses
cover plain HTTP, streaming responses, cancellation (runtime- and consumer-initiated), error mapping, session-id threading, and WebSocket transport
port the feature to every SDK language with idiomatic HTTP types
keep the public callback surface to a single LlmRequestHandler model with forwarding helpers

What changed

Shared protocol and plumbing

add generated RPC / session event types needed for LLM inference callbacks across all SDK surfaces
add SDK-side registration for a process-global LLM inference provider (llmInference.setProvider)
route outbound inference requests through the callback bridge instead of requiring provider-specific hooks

Per-language ports

Each language exposes the same LlmRequestHandler model, mapped onto the most canonical HTTP representation available in that ecosystem:

Language	HTTP request/response type	WebSocket type
Node.js	`Request` / `Response` (Fetch)	per-connection handler
.NET	`HttpRequestMessage` / `HttpResponseMessage`	`ClientWebSocket`
Python	`httpx` request/response	per-connection handler
Go	`http.Request` / `http.Response`	per-connection handler
Rust	`http::Request` / `http::Response`	per-connection handler
Java	`java.net.http` `HttpRequest` / `HttpResponse`	`java.net.http.WebSocket`

All ports thread cancellation and session id through the request context, and provide a ForwardingWebSocketHandler (or equivalent) for the common mutate-and-forward case.

API shape

collapse HTTP interception to a single send hook (SendRequestAsync / sendRequest / sendHttp / language equivalents)
expose WebSocket interception through OpenWebSocketAsync / openWebSocket, returning a per-connection handler object
allow consumers to mutate, drop, duplicate, or fully replace request/response messages while keeping the common forwarding case straightforward

Usage examples

C#

using GitHub.Copilot;
using System.Net.Http;

sealed class MyHandler : LlmRequestHandler
{
    protected override async Task<HttpResponseMessage> SendRequestAsync(
        HttpRequestMessage request,
        LlmRequestContext ctx)
    {
        request.Headers.Add("X-Debug-Session", ctx.SessionId ?? "none");
        return await base.SendRequestAsync(request, ctx);
    }

    protected override Task<CopilotWebSocketHandler> OpenWebSocketAsync(LlmRequestContext ctx)
        => Task.FromResult<CopilotWebSocketHandler>(new MyForwardingSocket(ctx));
}

sealed class MyForwardingSocket : ForwardingWebSocketHandler
{
    public MyForwardingSocket(LlmRequestContext ctx)
        : base(ctx)
    {
    }

    public override Task SendRequestMessageAsync(LlmWebSocketMessage message)
    {
        var text = message.GetText().Replace("model-A", "model-B");
        return base.SendRequestMessageAsync(LlmWebSocketMessage.Text(text));
    }
}

Node.js

import {
    ForwardingWebSocketHandler,
    LlmRequestContext,
    LlmRequestHandler,
} from "@github/copilot";

class MyHandler extends LlmRequestHandler {
    protected override async sendRequest(request: Request, ctx: LlmRequestContext): Promise<Response> {
        const headers = new Headers(request.headers);
        headers.set("x-debug-session", ctx.sessionId ?? "none");

        return super.sendRequest(new Request(request, { headers }), ctx);
    }

    protected override async openWebSocket(ctx: LlmRequestContext) {
        return new MyForwardingSocket(ctx);
    }
}

class MyForwardingSocket extends ForwardingWebSocketHandler {
    override sendRequestMessage(data: string | Uint8Array) {
        if (typeof data === "string") {
            return super.sendRequestMessage(data.replace("model-A", "model-B"));
        }
        return super.sendRequestMessage(data);
    }
}

Java

import com.github.copilot.ForwardingWebSocketHandler;
import com.github.copilot.LlmRequestContext;
import com.github.copilot.LlmRequestHandler;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.io.InputStream;

final class MyHandler extends LlmRequestHandler {
    @Override
    protected HttpResponse<InputStream> sendHttp(HttpRequest request, LlmRequestContext ctx) throws Exception {
        HttpRequest mutated = HttpRequest.newBuilder(request, (n, v) -> true)
                .header("X-Debug-Session", ctx.sessionId() == null ? "none" : ctx.sessionId())
                .build();
        return super.sendHttp(mutated, ctx);
    }

    @Override
    protected CopilotWebSocketHandler openWebSocket(LlmRequestContext ctx) {
        return new ForwardingWebSocketHandler(ctx.url(), ctx.headers()) {
            @Override
            protected byte[] onSendRequestMessage(byte[] data, boolean binary) {
                String text = new String(data, java.nio.charset.StandardCharsets.UTF_8).replace("model-A", "model-B");
                return text.getBytes(java.nio.charset.StandardCharsets.UTF_8);
            }
        };
    }
}

Register the handler when constructing the client, e.g. in Java:

CopilotClient client = new CopilotClient(
    new CopilotClientOptions().setLlmInference(new LlmInferenceConfig().setHandler(new MyHandler())));

Tests

Each language adds e2e coverage (mirroring a shared reference suite) for:

callback provider registration
HTTP inference interception
streaming inference interception
error mapping
runtime-initiated and consumer-initiated cancellation
session-id threading
WebSocket callback handling
an idiomatic handler test exercising mutate-and-forward over both HTTP and WebSocket

Plus per-language unit coverage where applicable (e.g. .NET handler/adapter behavior, Node.js unit flows).

Resolves github/copilot-sdk-internal#88

Adds an opt-in llmInference config to CopilotClientOptions that lets SDK consumers register a callback the runtime invokes whenever it would otherwise issue an outbound non-streaming LLM HTTP request itself. v1 scope is TS-only/non-streaming, mirroring the runtime support added in github/copilot-agent-runtime. Streaming SSE and WebSocket transports are out of scope for v1 and continue to bypass the callback. - New `LlmInferenceProvider` interface with a single `onLlmRequest` method. - `createLlmInferenceAdapter` converts the provider into the wire-shape `LlmInferenceHandler` consumed by the RPC dispatcher. - Client wiring: `llmInference.setProvider` is sent on connect; per-session adapter is attached alongside the existing sessionFs hook. - New `llm_inference.e2e.test.ts` exercises the full RPC round-trip against the runtime. Resolves github/copilot-sdk-internal#88 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Matches the runtime move of `llmInference.httpRequest` out of the session-scoped client API and onto a new `clientGlobal` schema root. - Codegen emits a new `registerClientGlobalApiHandlers` alongside the existing `registerClientSessionApiHandlers`. Handlers passed to it are dispatched directly (no per-session `getHandlers` callback) and carry no implicit sessionId — sessionId, when present, is just a payload field on the call. - `CopilotClient` now constructs the LLM inference adapter once and registers it process-wide via `registerClientGlobalApiHandlers` during connection setup. The per-session `setupLlmInference` path and the `SessionConfigBase.createLlmInferenceProvider` override are removed — there is no longer any per-session notion of which provider to use. - `LlmInferenceConfig.createLlmInferenceProvider` is now `() => LlmInferenceProvider` (was `(session) => ...`). - `LlmInferenceRequest` exposes the new optional `sessionId` field so consumers can correlate requests with a runtime session when one is in scope. E2E test updated to verify the global registration works and that sessionId is populated on in-session traffic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

With the Rust runtime intercept chokepoint in place, every model-layer HTTP request - including /models and /models/session - is now dispatched through the SDK callback. Update the e2e test to: - Stub realistic responses for non-streaming model catalog and session endpoints (so the runtime can proceed past model resolution). - Hard-assert the catalog request is intercepted (no more 'either-or' fallback for the pre-rust-intercept state). Streaming inference requests still pass through to the recorded CAPI proxy; a fully-mocked end-to-end inference test will land alongside the streaming-intercept commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extends LlmInferenceProvider with an optional onLlmStreamRequest method that returns a response head synchronously and pushes body chunks via the provided sink. The adapter implements the generated httpStreamStart RPC method and forwards chunks back to the runtime via the typed server-RPC client (llmInference.streamChunk / streamEnd). Adds a fully-mocked e2e test (test/e2e/llm_inference_stream.e2e.test.ts) that drives a complete user->assistant turn through the callback alone: the runtime hits the callback for /models, /models/session, and the chat completion itself, the assistant text returned to the SDK consumer is the synthetic text supplied by the stub. - nodejs/src/llmInferenceProvider.ts: LlmInferenceStreamSink, onLlmStreamRequest, httpStreamStart adapter - nodejs/src/client.ts: pass a lazy server-RPC accessor into the adapter - nodejs/src/index.ts: re-export new types - nodejs/test/e2e/llm_inference_stream.e2e.test.ts: full-mock e2e - nodejs/src/generated/*, python/*, go/*, rust/*: codegen for new RPC methods - dotnet/src/Generated/*: codegen for new RPC methods Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds test/e2e/llm_inference_errors.e2e.test.ts that wires a callback whose inference handler throws a synthetic transport error and verifies the failure surfaces to the SDK consumer (the call does not hang and any error caught is non-empty). Confirms the runtime's existing retry / error reporting path handles callback-side failures the same way it handles real transport failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mirrors the runtime-side cleanup: the callback wire no longer carries providerType / endpointKind / wireApi / transport / modelId. Adapter stops forwarding the field, e2e tests filter by URL instead of metadata, and the missing LlmInferenceStreamSink / LlmInferenceStreamStartResponse re-exports in types.ts are added so index.ts type-checks cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[Phase 3] Realign the Node SDK with the runtime's new four-method chunk protocol. One unified provider callback: interface LlmInferenceProvider { onLlmRequest(req: LlmInferenceRequest): Promise<void>; } LlmInferenceRequest exposes: * url / method / headers / sessionId * requestBody: AsyncIterable<Uint8Array> // body delivered as chunks * responseBody: LlmInferenceResponseSink // start/write/end/error The sink enforces start -> 0..N writes -> exactly one of end/error and maps each call to the corresponding httpResponseStart / httpResponseChunk RPC. createLlmInferenceAdapter maintains a per-requestId state map; the generated httpRequestStart handler registers state synchronously and fires onLlmRequest in the background, so the runtime's RPC reply isn't gated on consumer I/O. The body queue iterator now latches a 'done' flag so a consumer that calls .next() again after end:true gets done back instead of blocking forever waiting for chunks the runtime will never send. Removes the previous onLlmRequest + onLlmStreamRequest split and the LlmInferenceResponse / LlmInferenceStreamSink / LlmInferenceStreamStartResponse public types. All three e2e tests rewritten against the unified callback (one of them URL-dispatches /responses -> SSE and /chat/completions -> buffered JSON; the consumer can also branch on whether the request body has stream:true). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Phase 4.1: expose an AbortSignal on the request envelope, abort it on a cancel chunk from the runtime, and map consumer-side aborts to a 499 + error{code:cancelled} response. Adds the cancellation e2e test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an e2e test asserting that when the SDK consumer signals a terminal error via responseBody.error({ code: 'cancelled' }) the runtime surfaces it faithfully as a request failure rather than hanging. Completes the consumer->runtime direction of Phase 4.1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Surface the new `transport` discriminator on `LlmInferenceRequest` so consumers can tell an `"http"` request (plain HTTP / SSE) from a `"websocket"` one (full-duplex: each request-body chunk is one inbound WS message, each response-body write one outbound message). The adapter threads `params.transport` through, defaulting to `"http"`. Regenerate rpc.ts against the runtime schema for the new field and add an e2e test exercising the full-duplex path: the fake model advertises `ws:/responses`, the runtime's WebSocket flag is enabled via env var, and the consumer pumps `/responses` events back per inbound message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Friendly product-code starting point for SDK consumers who want to observe or mutate LLM inference requests/responses by overriding virtual methods on a base class. Implements LlmInferenceProvider, so an instance can be returned directly from createLlmInferenceProvider. Default behaviour is a transparent pass-through: each request is forwarded to its original URL via the WHATWG fetch global (HTTP) or WebSocket global (WebSocket), and the upstream response is streamed back unchanged. The same subclass handles both transports - onLlmRequest dispatches on req.transport. Virtual hooks: - HTTP: transformRequest, forward, transformResponse - WebSocket: forwardWebSocket, transformRequestMessage, transformResponseMessage E2e test (llm_inference_handler.e2e.test.ts) demonstrates a single TestHandler subclass servicing both an HTTP turn (single-shot title generation) and a WebSocket turn (main agent turn) against a per-test in-process http+ws upstream that speaks the real CAPI shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Review fixes for github/copilot-sdk-internal#88 (Node SDK side). - Honor the runtime's accepted=false ack: the response sink now aborts the provider's signal and stops emitting once the runtime drops the request (I1). - Add a staging backstop in the adapter so a body chunk that arrives before its start frame is buffered and replayed rather than silently dropped (B1). - Run the WebSocket request/response pumps concurrently and race their terminal states, so an upstream-closes-first (or runtime-cancels-first) case tears the other side down instead of hanging on a parked iterator (B2). - Buffer inbound WS frames in wrapGlobalWebSocket until onMessage is registered so the first frames of a fast upstream aren't dropped. - Collapse the dead send branch, hoist TextEncoder/TextDecoder singletons, and correct the LlmWebSocketUpstream.onClose contract doc. - Update CopilotClientOptions.llmInference docs: streaming SSE and WebSocket are intercepted, not bypassed (I6). - Add unit tests: chunk-before-start staging, accepted=false abort, WS upstream-close-first finalisation, and WS upstream-error propagation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Drives a CAPI session and a BYOK (openai/responses) session entirely through the LLM inference callback — the consumer fabricates every model-layer response, so the CAPI record/replay proxy is never the inference endpoint. Asserts each in-session inference request carries req.sessionId === session.sessionId and that the two session ids differ. The mock branches /responses on the request stream flag: BYOK turns whose config-derived model does not advertise streaming issue a buffered (non-streaming) /responses request expecting a single JSON response object, whereas the CAPI turn streams via SSE. This mirrors real upstream behaviour and confirms the callback transport faithfully delivers both shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mirrors the TypeScript LLM inference callback feature in the .NET SDK so consumers can observe/mutate the model-layer HTTP/WebSocket requests the runtime issues (CAPI and BYOK), with the runtime session id threaded into each callback. - scripts/codegen/csharp.ts: emit the clientGlobal handler interface + registration so Rpc.cs gains the llmInference handler surface. - LlmInferenceProvider.cs: low-level ILlmInferenceProvider API + adapter (request staging, response sink state machine) behind an internal ILlmInferenceResponseChannel seam for unit testing. - LlmRequestHandler.cs: idiomatic pass-through base class mapping to HttpRequestMessage/HttpResponseMessage and ClientWebSocket, with virtual transform/forward hooks for both transports. - Types.cs/Client.cs: wire LlmInferenceConfig into the client and register the provider on start. - Tests: factored unit-test infra (recording channel/sink, inline provider, frame builders) with adapter + handler tests, plus CAPI+BYOK e2e tests asserting the session id reaches the callback. e2e provider emits raw JSON (reflection-free STJ) and serves all model-layer traffic off-network. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Hide the redundant low-level provider interface and adapter from the public surface in both SDKs; the sole public extension point is now the LlmRequestHandler base class. Replace the LlmInferenceConfig provider factory with a direct handler instance (the provider is client-global, constructed once with no args). .NET: ILlmInferenceProvider + the LlmInferenceRequest/ResponseInit/ResponseSink DTOs become internal; LlmRequestHandler implements the interface explicitly so OnLlmRequestAsync leaves its public surface. LlmInferenceConfig.Handler replaces the Func<LlmRequestHandler> factory. TS: stop exporting LlmInferenceProvider and createLlmInferenceAdapter from index.ts; LlmInferenceConfig.handler replaces createLlmInferenceProvider. The request/sink DTOs stay exported as onLlmRequest's contract (TS lacks explicit interface implementation). E2E providers become LlmRequestHandler subclasses overriding onLlmRequest. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Collapse the HTTP callback seam to SendRequest/sendRequest, replace websocket hooks with per-connection handlers, and update tests to use the forwarding handler model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…schema Regenerate generated RPC/session-event code for ts/csharp/python/go/rust from the current runtime contract schema (runtime main + llm inference callbacks). The opaque-json fields (e.g. SQLite bind params) are now typed as `unknown` since the Rust contract emitter drops the `type` annotation on `x-opaque-json` fields. Update normalizeSqliteParams to accept the opaque `unknown`-valued param map and validate each value at runtime (string/number/null), narrowing safely at the boundary instead of casting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+        catch (Exception ex)
+        {
+            if (state.Cancelled || state.Abort.IsCancellationRequested)
+            {
+                // The runtime already cancelled this request; the provider's
+                // throw is just the abort propagating out of its upstream call.
+                await FinishCancelled(sink, state).ConfigureAwait(false);
+                return;
+            }
+
+            await FailViaSink(sink, state, ex.Message).ConfigureAwait(false);
+        }


+        catch
+        {
+            // Best-effort — the connection may already be dead.
+        }


+        catch
+        {
+            // Best-effort — the runtime already dropped the request on cancel.
+        }


+{
+    private readonly string _url;
+    private readonly IReadOnlyDictionary<string, IReadOnlyList<string>> _headers;
+    private WebSocket? _upstream;


+    private readonly string _url;
+    private readonly IReadOnlyDictionary<string, IReadOnlyList<string>> _headers;
+    private WebSocket? _upstream;
+    private CancellationTokenSource? _pumpCts;


+            catch
+            {
+                // Some headers are managed by the handshake; ignore rejections.
+            }


+        catch (Exception ex)
+        {
+            await CloseAsync(new LlmWebSocketCloseStatus
+            {
+                Description = ex.Message,
+                Error = ex,
+            }).ConfigureAwait(false);
+        }


+        catch
+        {
+            // Best-effort; the socket may already be closed.
+        }


+        catch
+        {
+            // Best-effort teardown only.
+        }


Port the LLM inference callback feature to the Python SDK, mirroring the existing Node.js and .NET implementations. Consumers subclass `LlmRequestHandler` and override `send_request` (idiomatic httpx) for HTTP or `open_web_socket` (websockets) for the WebSocket transport; both default to transparent pass-through. Wired through `LlmInferenceConfig` on the client, registered on the `clientGlobal.llmInference` scope. Adds the low-level provider/adapter, the httpx-based handler base class, client wiring, public exports, and httpx as a core dependency. Extends the Python codegen to emit clientGlobal handler registration and regenerates the generated RPC bindings. Includes 8 e2e test files (10 tests) mirroring the Node.js suite — round trip, session-id threading (CAPI + BYOK), streaming SSE, error mapping, runtime cancel, consumer cancel, WebSocket transport, and the idiomatic handler against a real local HTTP+WebSocket upstream. All pass off-network. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Mirror the existing Node/.NET/Python LLM inference callback support in the Go SDK. Consumers register an LlmInferenceProvider (or the idiomatic LlmRequestHandler over net/http + coder/websocket) via ClientOptions.LlmInference; the runtime routes every model-layer HTTP and WebSocket request through it for both CAPI and BYOK sessions. - Codegen (scripts/codegen/go.ts) now emits the clientGlobal handler registration, regenerating go/rpc/zrpc.go. - New low-level provider types + adapter (llm_inference_provider.go) and the idiomatic forwarding handler (llm_request_handler.go). - Wire LlmInferenceConfig into ClientOptions and the connect/start paths. - 8 off-network e2e scenarios mirroring the other SDKs (basic, session id, stream, errors, cancel, consumer cancel, websocket, handler). Also fixes a pre-existing Go e2e compile break (AttachmentBlob.Data became *string in the Rust contract regen baseline) that blocked the e2e package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-19T17:49:41Z

Cross-SDK Consistency Review

This PR adds LLM inference callback support to Go and .NET, bringing the feature to 4 of the 6 SDK implementations (Node.js and Python already had it). Overall the API shapes are well-aligned across languages. A few observations:

✅ Well-aligned across the four covered SDKs

Concept	Node.js	Python	Go	.NET
Entry class	`LlmRequestHandler`	`LlmRequestHandler`	`LlmRequestHandler`	`LlmRequestHandler`
HTTP hook	`sendRequest(req, ctx)`	`send_request(req, ctx)`	`Transport` field (`http.RoundTripper`)	`SendRequestAsync(req, ctx)`
WS hook	`openWebSocket(ctx)`	`open_web_socket(ctx)`	`OpenWebSocket` field (func)	`OpenWebSocketAsync(ctx)`
Context	`LlmRequestContext`	`LlmRequestContext`	`LlmRequestContext`	`LlmRequestContext`
WS handler	`CopilotWebSocketHandler`	`CopilotWebSocketHandler`	`CopilotWebSocketHandler`	`CopilotWebSocketHandler`
Forwarding WS	`ForwardingWebSocketHandler`	`ForwardingWebSocketHandler`	`ForwardingWebSocketHandler`	`ForwardingWebSocketHandler`
Config	`LlmInferenceConfig { handler }`	`LlmInferenceConfig { handler }`	`LlmInferenceConfig { Handler }`	`LlmInferenceConfig { Handler }`
Low-level interface	`LlmInferenceProvider`	`LlmInferenceProvider`	`LlmInferenceProvider`	`ILlmInferenceProvider`

Naming and structure follow each language's conventions (camelCase/snake_case/PascalCase, AbortSignal/asyncio.Event/context.Context/CancellationToken for cancellation, etc.).

⚠️ Remaining feature gap: Java and Rust SDKs

LLM inference callback support is not yet implemented in the Java and Rust SDKs. These two SDKs currently have no LlmInferenceProvider, LlmRequestHandler, or related types. If Java and Rust parity is intended, they'd need equivalent implementations. Is there a tracking issue or follow-up planned for those two?

i️ Language-idiomatic design differences (not issues)

These are expected given the languages involved:

Go's LlmRequestHandler uses struct fields for customization (Transport http.RoundTripper, OpenWebSocket func(...)) instead of method overrides. This is the idiomatic Go approach with no class inheritance.
Go's ForwardingWebSocketHandler adds OnSendRequestMessage and OnSendResponseMessage function fields for inline message transformation, avoiding the need for a full subtype. The other SDKs accomplish the same thing via subclassing.
.NET uses LlmWebSocketMessage (a value-type wrapper with IsBinary/GetText()) where Node.js and Python use string | Uint8Array / str | bytes union types. Both representations carry the same semantic content.

Generated by SDK Consistency Review Agent for issue #1689 · sonnet46 2M · ◷

Wires the per-client llmInference callback into the Rust SDK: an LlmInferenceProvider trait and the higher-level LlmRequestHandler base (idiomatic http/reqwest types, transparent pass-through default), the client-global dispatcher and router intercept, and ProviderConfig/ SessionConfig plumbing. Covers both BYOK and CAPI for HTTP and WebSocket transports, with cancellation wired in both directions. Adds eight off-network e2e tests (round-trip, streaming/SSE, session-id threading across CAPI+BYOK, handler errors, runtime- and consumer-driven cancellation, WebSocket transport, and idiomatic handler forwarding through hand-rolled local upstreams). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port the LLM inference callback feature to the Java SDK, the final of six language bindings (Node, .NET, Python, Go, Rust, Java). Consumers register one client-global `LlmRequestHandler` via `CopilotClientOptions.setLlmInference(...)`. The runtime invokes it over JSON-RPC (`llmInference.*`) whenever it would issue a model-layer HTTP or WebSocket request, for both BYOK and CAPI, fully replacing the outbound call. The public surface uses idiomatic `java.net.http` types (`HttpRequest`/`HttpResponse`/`WebSocket`). `LlmRequestHandler` exposes overridable `sendHttp` and `openWebSocket` seams; `ForwardingWebSocketHandler` provides transparent pass-through by default. Inbound request frames are hand-parsed in `LlmInferenceAdapter`; outbound response frames go through the generated `ServerLlmInferenceApi`. Adds 8 off-network e2e tests mirroring the Go reference suite (round-trip, streaming, session-id threading, errors, runtime cancel, consumer cancel, WebSocket, and an idiomatic handler test with a hand-rolled `FakeUpstreamServer`). Regenerates the Java codegen baseline from the runtime schema. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SteveSandersonMS · 2026-06-22T10:39:41Z

+    /// request/response.
+    /// </summary>
+    public LlmRequestHandler? Handler { get; set; }
+}


Why do we have this wrapper type and not just put LlmRequestHandler directly onto the create/resume config?

SteveSandersonMS · 2026-06-22T10:40:02Z

  </PropertyGroup>

+    <ItemGroup>
+        <InternalsVisibleTo Include="GitHub.Copilot.SDK.Test" />


Is this IVT really needed?

SteveSandersonMS · 2026-06-22T10:40:18Z

+/// model-layer request.
+/// </summary>
+[Experimental(Diagnostics.Experimental)]
+public enum LlmInferenceTransport


Why is this public when everything else in here is internal? Maybe it could move to the file with all the public API.

SteveSandersonMS · 2026-06-22T10:44:03Z

+        // Return from httpRequestStart immediately (after registering state) so
+        // the runtime's RPC reply is not gated on the consumer's I/O. The actual
+        // provider work runs asynchronously.
+        _ = RunProviderAsync(llmRequest, state, sink);


Why are you doing that? What's wrong with holding the RPC reply until the consumer's I/O completes? Isn't tat necessary to achieve natural backpressure and error propagation?

Altogether I'm surprised by how much complexity there is in this file and wonder if you're trying to make it too clever. Why isn't there a significantly simpler way of adapting the RPC protocol to the public base class, for example by eliminating any separate queueing/buffering/etc?

SteveSandersonMS · 2026-06-22T10:45:03Z

+            }
+
+            return result;
+        }


Seems really odd to have an adapter like this when the source data is already in the right format. What stops us from simplifying all this?

Collapse the two-layer .NET implementation into a single file. The low-level LlmInferenceProvider abstraction existed only as a test seam and indirection layer; fold its essentials into LlmRequestHandler. - Merge the request DTO, response sink, body channel, and response channel into one internal LlmInferenceExchange - Have LlmInferenceAdapter talk to ServerRpc directly, removing the channel-interface indirection and the dead _staged backstop - Flatten LlmInferenceConfig wrapper to a flat LlmInferenceHandler option property - Move the public LlmInferenceTransport enum into the public file - Remove InternalsVisibleTo and the 3 mock-based unit test files; the HTTP round-trip is fully covered by the e2e tests - De-dup the two forbidden-header sets into one shared static Net: 1375 lines across 2 production files to 1099 in 1 file, minus 513 lines of mock test scaffolding. Public API surface unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trim three pieces of remaining machinery without changing behavior the e2e tests cover: - Drop the accepted:false abort plumbing (RejectedByRuntime + the per- frame ack checks). Runtime cancellation already arrives as an explicit cancel frame, so the ack was a redundant second signal. - Collapse the WebSocket response bridge: emit the start(101) frame lazily on first message/terminal under the existing lock instead of buffering messages in a queue until an explicit StartAsync. This also preserves the clean 502 on upstream-connect failure (eager start would have surfaced 101 + error). - Fold the LlmWebSocketHelpers statics into ForwardingWebSocketHandler, their only caller. LlmRequestHandler.cs: 1099 to 994 lines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rename the consumer-facing types so the public surface reads as "Copilot request" interception rather than "LLM inference": - LlmRequestHandler -> CopilotRequestHandler - LlmRequestContext -> CopilotRequestContext - LlmInferenceTransport -> CopilotRequestTransport - ForwardingWebSocketHandler -> ForwardingCopilotWebSocketHandler - LlmWebSocketMessage -> CopilotWebSocketMessage - LlmWebSocketCloseStatus -> CopilotWebSocketCloseStatus - CopilotClientOptions.LlmInferenceHandler -> .RequestHandler Properties/methods keep succinct names; only types carry the Copilot prefix. Generated RPC/wire types (ILlmInferenceHandler, LlmInferenceHttp* DTOs) are untouched - they follow the shared schema. Renames the source file and the two e2e test files to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mirror the .NET simplification + terminology rename in the Node SDK: consolidate the provider/handler two-layer design into a single copilotRequestHandler.ts, trim the accepted:false plumbing and the staged backstop, and rename the public Llm* types to Copilot* (types carry the prefix; properties/methods stay succinct). The session option becomes requestHandler?: CopilotRequestHandler. The WebSocket response bridge starts the 101 upgrade head eagerly (ctx[kBridge].start()) because the runtime gates the connect on it; a lazy first-write start deadlocks. Generated RPC/wire types are left untouched. Drop the mock unit test and the six fabrication e2e tests (covered by the handler e2e); keep and rename the handler and session-id e2e tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The WebSocket response bridge emitted the 101 upgrade head lazily (on the first upstream message), which deadlocks: the runtime gates the WebSocket connect on receiving the 101 before it sends any request chunks, but the upstream stays silent until it gets a request message — so the head never fires. Emit it eagerly via LlmWebSocketResponseBridge.StartAsync() right after OpenAsync(), mirroring the Node SDK fix; the lazy start-on-first-write path remains a harmless backstop. Add CopilotRequestWebSocketE2ETests, a WebSocket e2e regression test that drives a full turn over the WS transport through a ForwardingCopilotWebSocket Handler against an in-process HttpListener upstream — the .NET counterpart to the Node handler e2e that originally caught this. Gated to net8.0+. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SteveSandersonMS changed the title ~~Simplify LLM inference callback handlers~~ LLM inference callback support Jun 16, 2026

stevesa and others added 17 commits June 19, 2026 15:39

Refine LLM inference callback handlers

92a829b

Collapse the HTTP callback seam to SendRequest/sendRequest, replace websocket hooks with per-connection handlers, and update tests to use the forwarding handler model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SteveSandersonMS force-pushed the stevesandersonms/llm-inference-callbacks branch from 815bbd0 to 7bc95c0 Compare June 19, 2026 15:00

This comment has been minimized.

Sign in to view

github-advanced-security AI found potential problems Jun 19, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

SteveSandersonMS and others added 2 commits June 22, 2026 10:27

SteveSandersonMS commented Jun 22, 2026

View reviewed changes

SteveSandersonMS and others added 5 commits June 22, 2026 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM inference callback support#1689

LLM inference callback support#1689
SteveSandersonMS wants to merge 26 commits into
mainfrom
stevesandersonms/llm-inference-callbacks

SteveSandersonMS commented Jun 16, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

github-advanced-security AI left a comment

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 19, 2026

Uh oh!

SteveSandersonMS Jun 22, 2026

Uh oh!

SteveSandersonMS Jun 22, 2026

Uh oh!

SteveSandersonMS Jun 22, 2026 •

edited

Loading

Uh oh!

SteveSandersonMS Jun 22, 2026

Uh oh!

SteveSandersonMS Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SteveSandersonMS commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Shared protocol and plumbing

Per-language ports

API shape

Usage examples

C#

Node.js

Java

Tests

Uh oh!

This comment has been minimized.

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 19, 2026

Cross-SDK Consistency Review

✅ Well-aligned across the four covered SDKs

⚠️ Remaining feature gap: Java and Rust SDKs

i️ Language-idiomatic design differences (not issues)

Uh oh!

SteveSandersonMS Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SteveSandersonMS commented Jun 16, 2026 •

edited

Loading

SteveSandersonMS Jun 22, 2026 •

edited

Loading