Skip to content

feat: kubernetes operator for agent lifecycle management#2139

Open
alexm-redhat wants to merge 1 commit into
NVIDIA:mainfrom
alexm-redhat:feat/1719-kubernetes-operator
Open

feat: kubernetes operator for agent lifecycle management#2139
alexm-redhat wants to merge 1 commit into
NVIDIA:mainfrom
alexm-redhat:feat/1719-kubernetes-operator

Conversation

@alexm-redhat

Copy link
Copy Markdown

Summary

Adds a kube-rs based Kubernetes operator (openshell-operator crate) that provides CRD-driven declarative sandbox lifecycle management, porting the operator pattern from
Kagenti.

New crate: openshell-operator (16 files)

  • AgentSandbox CRD with spec for image, resources, policy, and provider refs
  • Reconciler loop with exponential backoff and status condition reporting
  • Admission webhooks (validating + mutating) for CRD validation
  • Manifest builders for sandbox Pod, Service, and RBAC resources
  • Label conventions (openshell.nvidia.com/managed-by, openshell.nvidia.com/agent-name) for discovery and ownership

Gateway integration:

  • SandboxRuntimeManager gRPC service (proto/sandbox_runtime_manager.proto)
  • Multiplex listener registration in openshell-server
  • CLI --enable-operator flag and config file support

28 files changed, +2,832/-9 lines across openshell-operator, openshell-core, and openshell-server.

Test plan

  • cargo build -p openshell-operator compiles
  • cargo test -p openshell-operator — 69 passed, 0 failed
  • cargo test -p openshell-core — 297 passed, 0 failed
  • cargo test --workspace (excl. z3-dependent crates) — 2,239 passed, 0 failed
  • Manual E2E: deploy operator with AgentSandbox CR, verify pod creation and reconciliation
  • Webhook validation: submit invalid CRs, verify rejection

Addresses #1719

  Add a kube-rs based Kubernetes operator (openshell-operator crate) that
  provides CRD-driven declarative sandbox lifecycle management.

  Components:
  - AgentSandbox CRD with spec for image, resources, policy, and provider refs
  - Reconciler loop with exponential backoff and status condition reporting
  - Admission webhooks (validating + mutating) for CRD validation
  - Manifest builders for sandbox Pod, Service, and RBAC resources
  - Label conventions for sandbox discovery and ownership tracking
  - SandboxRuntimeManager gRPC service for operator-gateway communication
  - Gateway integration via multiplex listener and config flags

  New crate: crates/openshell-operator (16 files)
  New proto: proto/sandbox_runtime_manager.proto
  Modified: openshell-core (config, proto), openshell-server (cli, grpc, multiplex)

  Tests: 2,239 passed, 0 failed

  Closes NVIDIA#1719
@copy-pr-bot

copy-pr-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@alexm-redhat

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant