docs(design-proposals): add ephemeral VM sessions (VMSession) proposal#28
docs(design-proposals): add ephemeral VM sessions (VMSession) proposal#28Andrei Kvapil (kvaps) wants to merge 3 commits into
Conversation
Propose VMSession, an on-demand short-lived isolated VM created by cloning an existing master VM as-is, starting it, and reclaiming it on teardown. Built on existing primitives (vm-disk source.disk clone, vm-instance, tenant-level Cilium isolation). Includes an alternative VMTemplate + templateRef shape and open questions on placement, A<->B isolation, and cross-tenancy. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Warning Review limit reached
Next review available in: 28 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new design proposal document ( ChangesEphemeral VM Sessions Design Proposal
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for Ephemeral VM sessions (VMSession) in Cozystack, which provides on-demand, short-lived isolated environments by cloning existing VMs. The review feedback highlights two key areas for improvement: first, routing session traffic through a tenant-level proxy or ingress gateway instead of exposing each VM directly to prevent external IP and port exhaustion; second, ensuring the controller explicitly handles volume detachment locks and timeouts to avoid stuck ReadWriteOnce (RWO) mounts during rapid session handovers.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| ### Reachability | ||
|
|
||
| The session VM is reached through the VM's own Service: `vm-instance` exposes ports with `external: true` and `externalPorts` (e.g. `[22]` for SSH, with public keys injected via cloud-init `nocloud`). A browser web-terminal is just another exposed port. |
There was a problem hiding this comment.
Exposing each ephemeral session VM using external: true (which provisions a LoadBalancer or external port) can quickly exhaust the cluster's external IP pool and NodePort range, especially under high-frequency churn (e.g., AI coding agents or interactive playgrounds). Consider routing traffic through a shared ingress/gateway or a proxy pod within the tenant namespace that multiplexes connections to the active session VMs internally.
| The session VM is reached through the VM's own Service: `vm-instance` exposes ports with `external: true` and `externalPorts` (e.g. `[22]` for SSH, with public keys injected via cloud-init `nocloud`). A browser web-terminal is just another exposed port. | |
| The session VM is reached through the VM's own Service. To prevent external IP and port exhaustion from high-frequency ephemeral sessions, instead of exposing each VM directly with `external: true`, traffic should be routed through a tenant-level proxy or ingress gateway that multiplexes connections to the active session VMs internally. |
|
|
||
| ### Persistence (optional) | ||
|
|
||
| "Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource. |
There was a problem hiding this comment.
When re-attaching a persistent disk (typically ReadWriteOnce) across sessions, Kubernetes volume detachment can be slow or get stuck if the previous session VM is uncleanly terminated. This can block the spin-up of the new session. The proposal should address how the controller handles volume attachment locks and handles safe, rapid detachment before the new session starts.
| "Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource. | |
| "Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource. Note that the controller must explicitly handle volume detachment locks and timeouts to prevent stuck ReadWriteOnce (RWO) mounts during rapid session handovers. |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
design-proposals/ephemeral-vm-sessions/README.md (4)
78-78: 🧹 Nitpick | 🔵 TrivialDefine "A↔B" on first use.
The notation "A↔B" appears without explicit definition. While context makes it clear, consider adding "(session-to-session)" or "(inter-session)" on first use for clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design-proposals/ephemeral-vm-sessions/README.md` at line 78, The README text uses “A↔B” without defining it on first use, so update the wording in the ephemeral VM sessions proposal to spell out what it means the first time it appears, using a clarifying parenthetical like “session-to-session” or “inter-session” in the sentence about per-session isolation. Keep the existing meaning, but make the reference explicit where the A↔B notation is introduced.
91-93: 🧹 Nitpick | 🔵 TrivialExpand testing section to cover persistence and latency benchmarks.
The testing plan is solid for the core loop. Consider adding:
- Validation of the persistent workspace disk re-attachment and state survival across sessions.
- Latency benchmarks for cold clone+boot versus warm pool (if pursued), to inform the spin-up latency open question.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design-proposals/ephemeral-vm-sessions/README.md` around lines 91 - 93, Expand the Testing section to explicitly cover two missing areas: persistent workspace disk re-attachment/state survival and latency benchmarking for cold clone+boot versus any warm-pool path. Update the testing plan in the README’s testing section so it names the re-attach/resume flow and includes a measurable benchmark step for spin-up time, while keeping the existing VMSession clone/start/teardown/GC and policy coverage references.
74-82: 🔒 Security & Privacy | 🔵 TrivialAdd disk-inherited secrets and secure-deletion considerations to security section.
The "clone as-is" model means sessions inherit all data from the master disk, including any baked-in secrets, credentials, or sensitive configuration. The design should explicitly address:
Master image hygiene: Recommend that masters be built without secrets, with secrets injected at runtime via cloud-init (already mentioned) rather than baked into the disk image.
Ephemeral disk cleanup: When a session is reclaimed, are blocks securely wiped or merely dereferenced? With CSI copy-on-write clones, unreferenced blocks from the session's writes may remain on physical storage. If multi-tenant isolation is required at the storage layer, clarify whether storage encryption or secure-erasure on teardown is needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design-proposals/ephemeral-vm-sessions/README.md` around lines 74 - 82, Update the Security section in the README to explicitly cover disk-inherited secrets and teardown hygiene: in the existing security discussion for the clone/session model, add guidance that master images must be built without baked-in secrets and that sensitive material should be injected at runtime via the existing cloud-init flow rather than stored in the disk image. Also add a note under the ephemeral VM/session lifecycle that CSI copy-on-write clones may leave session-written blocks on storage after reclamation, so the design should state whether storage encryption, secure wipe, or other secure-erasure controls are required for the session teardown path.
101-107: 🧹 Nitpick | 🔵 TrivialAdd open questions for session lifetime, concurrency limits, and observability.
The open questions are thorough on shape and isolation. Consider adding:
Session lifetime / TTL: Is there a built-in timeout or maximum session duration, or do sessions persist until explicitly deleted? This affects resource exhaustion and abuse prevention.
Concurrency limits: How many concurrent sessions per tenant or per master? Should the
VMSessioncontroller enforce quotas?Observability: How are session lifecycle events (create, start, teardown, failure) exposed — Kubernetes Events, metrics, structured logs? What alerts are needed for stuck sessions or failed clones?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design-proposals/ephemeral-vm-sessions/README.md` around lines 101 - 107, The “Open questions” section in the design proposal is missing key operational concerns around session lifetime, concurrency, and observability. Update the README’s open questions near the existing `VMSession` / `VMTemplate` discussion to add explicit questions about TTL or maximum session duration, per-tenant or per-master concurrency quotas and whether the `VMSession` controller should enforce them, and how lifecycle events are surfaced through Kubernetes Events, metrics, structured logs, and alerts for stuck or failed sessions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@design-proposals/ephemeral-vm-sessions/README.md`:
- Line 78: The README text uses “A↔B” without defining it on first use, so
update the wording in the ephemeral VM sessions proposal to spell out what it
means the first time it appears, using a clarifying parenthetical like
“session-to-session” or “inter-session” in the sentence about per-session
isolation. Keep the existing meaning, but make the reference explicit where the
A↔B notation is introduced.
- Around line 91-93: Expand the Testing section to explicitly cover two missing
areas: persistent workspace disk re-attachment/state survival and latency
benchmarking for cold clone+boot versus any warm-pool path. Update the testing
plan in the README’s testing section so it names the re-attach/resume flow and
includes a measurable benchmark step for spin-up time, while keeping the
existing VMSession clone/start/teardown/GC and policy coverage references.
- Around line 74-82: Update the Security section in the README to explicitly
cover disk-inherited secrets and teardown hygiene: in the existing security
discussion for the clone/session model, add guidance that master images must be
built without baked-in secrets and that sensitive material should be injected at
runtime via the existing cloud-init flow rather than stored in the disk image.
Also add a note under the ephemeral VM/session lifecycle that CSI copy-on-write
clones may leave session-written blocks on storage after reclamation, so the
design should state whether storage encryption, secure wipe, or other
secure-erasure controls are required for the session teardown path.
- Around line 101-107: The “Open questions” section in the design proposal is
missing key operational concerns around session lifetime, concurrency, and
observability. Update the README’s open questions near the existing `VMSession`
/ `VMTemplate` discussion to add explicit questions about TTL or maximum session
duration, per-tenant or per-master concurrency quotas and whether the
`VMSession` controller should enforce them, and how lifecycle events are
surfaced through Kubernetes Events, metrics, structured logs, and alerts for
stuck or failed sessions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ce54e2a6-56c6-4e52-9426-353f3bb5e6db
📒 Files selected for processing (1)
design-proposals/ephemeral-vm-sessions/README.md
…twork isolation Add a declarative spec.state (Running/Paused/Stopped) reconciled via KubeVirt runStrategy and the pause/unpause subresource, with an honest note on the warm-resume gap (no suspend-to-disk-and-free on stable KubeVirt). Add spec.networkIsolation realised through the merged SecurityGroup (sdn.cozystack.io/v1alpha1), noting that hard deny/A<->B enforcement depends on the planned default-deny tenant baseline. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…hart Drop the SecurityGroup dependency. Bake a per-session CiliumNetworkPolicy with deny rules into the VMSession chart instead: Cilium deny takes precedence over allow, so deny-private / deny-A<->B enforces regardless of the blanket-allow tenant baseline, with no dependency on a default-deny baseline. Precedent: tenant and cilium-networkpolicy charts. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Great base, thanks Andrei Kvapil (@kvaps) — the VM boundary is clearly the right call for untrusted code. I wanted to share feedback from a customer use case (agentic coding sandboxes, E2B/Daytona-class) and get your thoughts, because I'm not sure the full-VM approach fits it perfectly. My main worry is that a full KubeVirt VM might be too heavy for this pattern: the agent spins up many short-lived sandboxes, so cold clone+boot in "seconds" adds up, and cloning a master VM disk means maintaining VM images rather than starting from the OCI/container images the workflow already produces. What we'd really want is container-speed startup with an OCI image as the rootfs, but without losing the VM-grade isolation you're building here. We looked at Kata Containers, which seems to hit that middle ground — OCI image runs directly in a lightweight microVM (still a real KVM boundary, no nested-virt needed on bare metal), so faster start and higher density than a full VM. But I might be missing tradeoffs. What do you think — does something like a Kata-backed runtime make sense next to KubeVirt for the ephemeral case, or do you see it differently? Are there other options in this space you'd lean toward? The other thing I couldn't figure out is resume — "park a workspace and come back to it later where it left off". As far as I can tell there's no turnkey warm-resume in either KubeVirt or Kata: both give pause (RAM held) or stop (cold boot), not scale-to-zero-with-state. The one difference I noticed is that Kata with a Firecracker/Cloud-Hypervisor VMM exposes the memory-snapshot primitive (the same one E2B builds resume on), so it feels like a better foundation to get there eventually — but I'm not sure how you're thinking about that side. Out of scope for v1, or something you'd want to design for? Happy to prototype the Kata path on our lab and share real startup/density numbers if that'd be useful. |
|
Thanks Mattia, this is exactly the right pushback — here's my thinking. On adding Kata as a second runtime — I don't think it gives us anything we don't already have. Kata's value is really two things: VM-grade isolation, and transparent compatibility with Kubernetes primitives — you run an ordinary Pod / OCI image and it's silently backed by a microVM via a RuntimeClass, no rewrite. That second part is Kata's main selling point, but it's not something we need here: we're offering a purpose-built VM session, not trying to make arbitrary Pods secretly VM-isolated behind the CRI. And the first — VM-grade isolation — KubeVirt already gives at the same level, because it's the same KVM boundary (Kata's VMMs are QEMU, Cloud Hypervisor, Firecracker, Dragonball — all KVM; QEMU is the very one KubeVirt runs under the hood). So Kata would be a parallel runtime to operate on our nodes for isolation we already have plus a compatibility feature we don't need — so we'd rather solve the real pains (startup, image maintenance) inside the KVM stack we already run. On startup speed — the legitimate concern. We'd tackle it two ways instead of switching runtime: (1) a pre-warmed pool of ready On resume — the primitive already exists on our stack: QEMU/libvirt Lighter options, for completeness: LXC inside the Cozystack VMs / tenant clusters — denser and lighter, but back to a shared guest kernel between sandboxes, so weaker A↔B isolation; only where that trade-off is fine. gVisor — genuinely appealing as a lightweight tier, but its syscall interception is incomplete and can break workloads (arbitrary agent-generated code is exactly what trips over missing syscalls); worth keeping as an optional tier with that caveat. Your offer to prototype Kata on your lab is very welcome regardless — real startup/density numbers would be genuinely useful to sanity-check the pre-warmed-pool + Alpine path against. If they show a big gap, we'll happily reconsider. |
|
Two more things: On "maintaining VM images vs the OCI images your workflow already produces" — KubeVirt can pull disk images straight from a container registry via On resume — you're right that there's no turnkey warm-resume in KubeVirt today, and I dug into why. The naive route (libvirt I think we can contribute though |
Design proposal (RFC, Status: Draft) for VMSession — an on-demand, short-lived isolated VM created by cloning an existing master VM as-is, starting it, and reclaiming it on teardown.
Built on existing Cozystack primitives (
vm-disksource.diskclone,vm-instance, tenant-level Cilium isolation); the only new piece is a thin session resource. Covers the isolation boundary (VM/KVM on bare-metal), reachability, optional persistence, and security (tenant-scoped policy + the per-session A↔B caveat).Includes an alternative
VMTemplate+templateRefshape and open questions on entity shape, placement / A↔B granularity, cross-tenancy, spin-up latency, and cloning a running vs stopped source.Feedback welcome.
Summary by CodeRabbit