Skip to content

docs(design-proposals): add ephemeral VM sessions (VMSession) proposal#28

Open
Andrei Kvapil (kvaps) wants to merge 3 commits into
mainfrom
design/ephemeral-vm-sessions
Open

docs(design-proposals): add ephemeral VM sessions (VMSession) proposal#28
Andrei Kvapil (kvaps) wants to merge 3 commits into
mainfrom
design/ephemeral-vm-sessions

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 30, 2026

Copy link
Copy Markdown
Member

Design proposal (RFC, Status: Draft) for VMSession — an on-demand, short-lived isolated VM created by cloning an existing master VM as-is, starting it, and reclaiming it on teardown.

Built on existing Cozystack primitives (vm-disk source.disk clone, vm-instance, tenant-level Cilium isolation); the only new piece is a thin session resource. Covers the isolation boundary (VM/KVM on bare-metal), reachability, optional persistence, and security (tenant-scoped policy + the per-session A↔B caveat).

Includes an alternative VMTemplate + templateRef shape and open questions on entity shape, placement / A↔B granularity, cross-tenancy, spin-up latency, and cloning a running vs stopped source.

Feedback welcome.

Summary by CodeRabbit

  • Documentation
    • Added a new design proposal for ephemeral VM sessions, outlining the user experience, isolation model, access methods, optional persistence, rollout approach, and key edge cases.

Propose VMSession, an on-demand short-lived isolated VM created by cloning
an existing master VM as-is, starting it, and reclaiming it on teardown.
Built on existing primitives (vm-disk source.disk clone, vm-instance,
tenant-level Cilium isolation). Includes an alternative VMTemplate +
templateRef shape and open questions on placement, A<->B isolation, and
cross-tenancy.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@kvaps, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 28 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff81e45c-f970-4df5-a649-4b0958508a23

📥 Commits

Reviewing files that changed from the base of the PR and between cbef8f8 and d78105c.

📒 Files selected for processing (1)
  • design-proposals/ephemeral-vm-sessions/README.md
📝 Walkthrough

Walkthrough

Adds a new design proposal document (design-proposals/ephemeral-vm-sessions/README.md) defining the VMSession concept: ephemeral KVM microVM sessions mapped to existing vm-disk and vm-instance primitives, with sections on reachability, persistence, security, failure cases, testing, rollout, open questions, and alternatives.

Changes

Ephemeral VM Sessions Design Proposal

Layer / File(s) Summary
VMSession design proposal
design-proposals/ephemeral-vm-sessions/README.md
Full proposal document covering isolation boundary (KVM microVM), primitive mapping (vm-disk fast-clone, vm-instance boot/teardown GC), reachability via VM Service/SSH/cloud-init, optional external workspace disk persistence, security model, failure/edge cases, testing approach, rollout plan, open questions, and alternatives.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

A tiny VM blinks to life, then fades away,
No crumbs left behind, no disk here to stay.
The rabbit hops in, does its work with glee,
Then poof — gone like dew — ephemeral and free!
🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the new design proposal for ephemeral VM sessions and matches the main change.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch design/ephemeral-vm-sessions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for Ephemeral VM sessions (VMSession) in Cozystack, which provides on-demand, short-lived isolated environments by cloning existing VMs. The review feedback highlights two key areas for improvement: first, routing session traffic through a tenant-level proxy or ingress gateway instead of exposing each VM directly to prevent external IP and port exhaustion; second, ensuring the controller explicitly handles volume detachment locks and timeouts to avoid stuck ReadWriteOnce (RWO) mounts during rapid session handovers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


### Reachability

The session VM is reached through the VM's own Service: `vm-instance` exposes ports with `external: true` and `externalPorts` (e.g. `[22]` for SSH, with public keys injected via cloud-init `nocloud`). A browser web-terminal is just another exposed port.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Exposing each ephemeral session VM using external: true (which provisions a LoadBalancer or external port) can quickly exhaust the cluster's external IP pool and NodePort range, especially under high-frequency churn (e.g., AI coding agents or interactive playgrounds). Consider routing traffic through a shared ingress/gateway or a proxy pod within the tenant namespace that multiplexes connections to the active session VMs internally.

Suggested change
The session VM is reached through the VM's own Service: `vm-instance` exposes ports with `external: true` and `externalPorts` (e.g. `[22]` for SSH, with public keys injected via cloud-init `nocloud`). A browser web-terminal is just another exposed port.
The session VM is reached through the VM's own Service. To prevent external IP and port exhaustion from high-frequency ephemeral sessions, instead of exposing each VM directly with `external: true`, traffic should be routed through a tenant-level proxy or ingress gateway that multiplexes connections to the active session VMs internally.


### Persistence (optional)

"Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When re-attaching a persistent disk (typically ReadWriteOnce) across sessions, Kubernetes volume detachment can be slow or get stuck if the previous session VM is uncleanly terminated. This can block the spin-up of the new session. The proposal should address how the controller handles volume attachment locks and handles safe, rapid detachment before the new session starts.

Suggested change
"Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource.
"Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource. Note that the controller must explicitly handle volume detachment locks and timeouts to prevent stuck ReadWriteOnce (RWO) mounts during rapid session handovers.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
design-proposals/ephemeral-vm-sessions/README.md (4)

78-78: 🧹 Nitpick | 🔵 Trivial

Define "A↔B" on first use.

The notation "A↔B" appears without explicit definition. While context makes it clear, consider adding "(session-to-session)" or "(inter-session)" on first use for clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/ephemeral-vm-sessions/README.md` at line 78, The README text
uses “A↔B” without defining it on first use, so update the wording in the
ephemeral VM sessions proposal to spell out what it means the first time it
appears, using a clarifying parenthetical like “session-to-session” or
“inter-session” in the sentence about per-session isolation. Keep the existing
meaning, but make the reference explicit where the A↔B notation is introduced.

91-93: 🧹 Nitpick | 🔵 Trivial

Expand testing section to cover persistence and latency benchmarks.

The testing plan is solid for the core loop. Consider adding:

  • Validation of the persistent workspace disk re-attachment and state survival across sessions.
  • Latency benchmarks for cold clone+boot versus warm pool (if pursued), to inform the spin-up latency open question.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/ephemeral-vm-sessions/README.md` around lines 91 - 93,
Expand the Testing section to explicitly cover two missing areas: persistent
workspace disk re-attachment/state survival and latency benchmarking for cold
clone+boot versus any warm-pool path. Update the testing plan in the README’s
testing section so it names the re-attach/resume flow and includes a measurable
benchmark step for spin-up time, while keeping the existing VMSession
clone/start/teardown/GC and policy coverage references.

74-82: 🔒 Security & Privacy | 🔵 Trivial

Add disk-inherited secrets and secure-deletion considerations to security section.

The "clone as-is" model means sessions inherit all data from the master disk, including any baked-in secrets, credentials, or sensitive configuration. The design should explicitly address:

  1. Master image hygiene: Recommend that masters be built without secrets, with secrets injected at runtime via cloud-init (already mentioned) rather than baked into the disk image.

  2. Ephemeral disk cleanup: When a session is reclaimed, are blocks securely wiped or merely dereferenced? With CSI copy-on-write clones, unreferenced blocks from the session's writes may remain on physical storage. If multi-tenant isolation is required at the storage layer, clarify whether storage encryption or secure-erasure on teardown is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/ephemeral-vm-sessions/README.md` around lines 74 - 82,
Update the Security section in the README to explicitly cover disk-inherited
secrets and teardown hygiene: in the existing security discussion for the
clone/session model, add guidance that master images must be built without
baked-in secrets and that sensitive material should be injected at runtime via
the existing cloud-init flow rather than stored in the disk image. Also add a
note under the ephemeral VM/session lifecycle that CSI copy-on-write clones may
leave session-written blocks on storage after reclamation, so the design should
state whether storage encryption, secure wipe, or other secure-erasure controls
are required for the session teardown path.

101-107: 🧹 Nitpick | 🔵 Trivial

Add open questions for session lifetime, concurrency limits, and observability.

The open questions are thorough on shape and isolation. Consider adding:

  1. Session lifetime / TTL: Is there a built-in timeout or maximum session duration, or do sessions persist until explicitly deleted? This affects resource exhaustion and abuse prevention.

  2. Concurrency limits: How many concurrent sessions per tenant or per master? Should the VMSession controller enforce quotas?

  3. Observability: How are session lifecycle events (create, start, teardown, failure) exposed — Kubernetes Events, metrics, structured logs? What alerts are needed for stuck sessions or failed clones?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/ephemeral-vm-sessions/README.md` around lines 101 - 107, The
“Open questions” section in the design proposal is missing key operational
concerns around session lifetime, concurrency, and observability. Update the
README’s open questions near the existing `VMSession` / `VMTemplate` discussion
to add explicit questions about TTL or maximum session duration, per-tenant or
per-master concurrency quotas and whether the `VMSession` controller should
enforce them, and how lifecycle events are surfaced through Kubernetes Events,
metrics, structured logs, and alerts for stuck or failed sessions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@design-proposals/ephemeral-vm-sessions/README.md`:
- Line 78: The README text uses “A↔B” without defining it on first use, so
update the wording in the ephemeral VM sessions proposal to spell out what it
means the first time it appears, using a clarifying parenthetical like
“session-to-session” or “inter-session” in the sentence about per-session
isolation. Keep the existing meaning, but make the reference explicit where the
A↔B notation is introduced.
- Around line 91-93: Expand the Testing section to explicitly cover two missing
areas: persistent workspace disk re-attachment/state survival and latency
benchmarking for cold clone+boot versus any warm-pool path. Update the testing
plan in the README’s testing section so it names the re-attach/resume flow and
includes a measurable benchmark step for spin-up time, while keeping the
existing VMSession clone/start/teardown/GC and policy coverage references.
- Around line 74-82: Update the Security section in the README to explicitly
cover disk-inherited secrets and teardown hygiene: in the existing security
discussion for the clone/session model, add guidance that master images must be
built without baked-in secrets and that sensitive material should be injected at
runtime via the existing cloud-init flow rather than stored in the disk image.
Also add a note under the ephemeral VM/session lifecycle that CSI copy-on-write
clones may leave session-written blocks on storage after reclamation, so the
design should state whether storage encryption, secure wipe, or other
secure-erasure controls are required for the session teardown path.
- Around line 101-107: The “Open questions” section in the design proposal is
missing key operational concerns around session lifetime, concurrency, and
observability. Update the README’s open questions near the existing `VMSession`
/ `VMTemplate` discussion to add explicit questions about TTL or maximum session
duration, per-tenant or per-master concurrency quotas and whether the
`VMSession` controller should enforce them, and how lifecycle events are
surfaced through Kubernetes Events, metrics, structured logs, and alerts for
stuck or failed sessions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce54e2a6-56c6-4e52-9426-353f3bb5e6db

📥 Commits

Reviewing files that changed from the base of the PR and between 62ecd0b and cbef8f8.

📒 Files selected for processing (1)
  • design-proposals/ephemeral-vm-sessions/README.md

Andrei Kvapil (kvaps) and others added 2 commits June 30, 2026 17:38
…twork isolation

Add a declarative spec.state (Running/Paused/Stopped) reconciled via
KubeVirt runStrategy and the pause/unpause subresource, with an honest
note on the warm-resume gap (no suspend-to-disk-and-free on stable
KubeVirt). Add spec.networkIsolation realised through the merged
SecurityGroup (sdn.cozystack.io/v1alpha1), noting that hard deny/A<->B
enforcement depends on the planned default-deny tenant baseline.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…hart

Drop the SecurityGroup dependency. Bake a per-session CiliumNetworkPolicy
with deny rules into the VMSession chart instead: Cilium deny takes
precedence over allow, so deny-private / deny-A<->B enforces regardless
of the blanket-allow tenant baseline, with no dependency on a
default-deny baseline. Precedent: tenant and cilium-networkpolicy charts.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@mattia-eleuteri

Copy link
Copy Markdown

Great base, thanks Andrei Kvapil (@kvaps) — the VM boundary is clearly the right call for untrusted code. I wanted to share feedback from a customer use case (agentic coding sandboxes, E2B/Daytona-class) and get your thoughts, because I'm not sure the full-VM approach fits it perfectly.

My main worry is that a full KubeVirt VM might be too heavy for this pattern: the agent spins up many short-lived sandboxes, so cold clone+boot in "seconds" adds up, and cloning a master VM disk means maintaining VM images rather than starting from the OCI/container images the workflow already produces. What we'd really want is container-speed startup with an OCI image as the rootfs, but without losing the VM-grade isolation you're building here.

We looked at Kata Containers, which seems to hit that middle ground — OCI image runs directly in a lightweight microVM (still a real KVM boundary, no nested-virt needed on bare metal), so faster start and higher density than a full VM. But I might be missing tradeoffs. What do you think — does something like a Kata-backed runtime make sense next to KubeVirt for the ephemeral case, or do you see it differently? Are there other options in this space you'd lean toward?

The other thing I couldn't figure out is resume — "park a workspace and come back to it later where it left off". As far as I can tell there's no turnkey warm-resume in either KubeVirt or Kata: both give pause (RAM held) or stop (cold boot), not scale-to-zero-with-state. The one difference I noticed is that Kata with a Firecracker/Cloud-Hypervisor VMM exposes the memory-snapshot primitive (the same one E2B builds resume on), so it feels like a better foundation to get there eventually — but I'm not sure how you're thinking about that side. Out of scope for v1, or something you'd want to design for?

Happy to prototype the Kata path on our lab and share real startup/density numbers if that'd be useful.

@kvaps

Copy link
Copy Markdown
Member Author

Thanks Mattia, this is exactly the right pushback — here's my thinking.

On adding Kata as a second runtime — I don't think it gives us anything we don't already have. Kata's value is really two things: VM-grade isolation, and transparent compatibility with Kubernetes primitives — you run an ordinary Pod / OCI image and it's silently backed by a microVM via a RuntimeClass, no rewrite. That second part is Kata's main selling point, but it's not something we need here: we're offering a purpose-built VM session, not trying to make arbitrary Pods secretly VM-isolated behind the CRI. And the first — VM-grade isolation — KubeVirt already gives at the same level, because it's the same KVM boundary (Kata's VMMs are QEMU, Cloud Hypervisor, Firecracker, Dragonball — all KVM; QEMU is the very one KubeVirt runs under the hood). So Kata would be a parallel runtime to operate on our nodes for isolation we already have plus a compatibility feature we don't need — so we'd rather solve the real pains (startup, image maintenance) inside the KVM stack we already run.

On startup speed — the legitimate concern. We'd tackle it two ways instead of switching runtime: (1) a pre-warmed pool of ready VMSessions, so a request gets an already-booted VM instead of paying clone+boot; (2) a minimal base image (e.g. Alpine) so clone+boot is small to begin with. Pre-warmed + minimal image gets close to container-speed without a second runtime. On the OCI point — we can look at building the master from your existing image so you don't maintain a separate VM image; the runtime stays KubeVirt.

On resume — the primitive already exists on our stack: QEMU/libvirt virsh save / managedsave writes guest RAM+device state to disk, frees host CPU/RAM, and start restores it where it left off (literally "migrate to disk"). That's scale-to-zero-with-state on the same qemu-KVM KubeVirt uses; the gap is only that KubeVirt doesn't surface it yet — a controller/API problem, not a runtime one. Firecracker's snapshot (what E2B builds on) is more optimized for high-frequency restore, so if resume latency becomes the bottleneck we'll revisit — but it's not a reason to adopt a second runtime up front.

Lighter options, for completeness: LXC inside the Cozystack VMs / tenant clusters — denser and lighter, but back to a shared guest kernel between sandboxes, so weaker A↔B isolation; only where that trade-off is fine. gVisor — genuinely appealing as a lightweight tier, but its syscall interception is incomplete and can break workloads (arbitrary agent-generated code is exactly what trips over missing syscalls); worth keeping as an optional tier with that caveat.

Your offer to prototype Kata on your lab is very welcome regardless — real startup/density numbers would be genuinely useful to sanity-check the pre-warmed-pool + Alpine path against. If they show a big gap, we'll happily reconsider.

@kvaps

Andrei Kvapil (kvaps) commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Two more things:

On "maintaining VM images vs the OCI images your workflow already produces" — KubeVirt can pull disk images straight from a container registry via containerDisk: you package the VM disk (qcow2/raw, under /disk) inside a container image, push it to your registry, and KubeVirt serves it as an ephemeral disk (multiple VMs can share one). It's not a Docker container image — it's a VM disk wrapped in a container image — but it lives in the same registry and rides an OCI-artifact workflow, so the master/base can come from the pipeline you already have rather than a separate VM-image store.

On resume — you're right that there's no turnkey warm-resume in KubeVirt today, and I dug into why. The naive route (libvirt virsh snapshot-create-as --memspec) can't include memory state on raw disks, which is what KubeVirt uses for block-mode PVCs — so that path is a dead end. But the memory-to-file primitive itself (QEMU migrate-to-file / managedsave) is disk-format-agnostic and lives in the same qemu-KVM stack, and KubeVirt is actively working on surfacing native memory snapshots / hibernation — see the open tracking issue kubevirt/kubevirt#18054 (updated this month), which proposes exactly a QEMU-migration-to-PVC approach. So warm-resume is a surfacing effort upstream on the runtime we already run, rather than a reason to switch runtimes.

I think we can contribute though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants