Skip to content

feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent#2207

Open
Gugaarleo wants to merge 1 commit into
github:mainfrom
Gugaarleo:feat/chaos-engineering-test-generator
Open

feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent#2207
Gugaarleo wants to merge 1 commit into
github:mainfrom
Gugaarleo:feat/chaos-engineering-test-generator

Conversation

@Gugaarleo

Copy link
Copy Markdown

Pull Request Checklist

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have read and followed the Guidance for submissions involving paid services.
  • My contribution adds a new instruction, prompt, agent, skill, workflow, or canvas extension file in the correct directory.
  • The file follows the required naming convention.
  • The content is clearly structured and follows the example format.
  • I have tested my instructions, prompt, agent, skill, workflow, or canvas extension with GitHub Copilot.
  • I have run npm start and verified that README.md is up to date.
  • I am targeting the main branch for this pull request.

Description

Adds a Skill + Agent bundle focused on resilience engineering and incident response.

Skill: chaos-engineering-test-generator (skills/chaos-engineering-test-generator/)
Generates ready-to-run chaos engineering experiments (LitmusChaos, Chaos Mesh, Chaos Monkey-style, AWS FIS) to validate system resilience against network latency, pod/instance failure, CPU/memory stress, and dependency outages. Every generated experiment is hypothesis-driven and always includes a defined blast radius, a duration cap, and an abort condition tied to a monitored metric — never a fault injected without a rollback path. Includes three worked examples (pod-delete, network latency injection, CPU stress) and a progressive blast-radius escalation pattern for taking an experiment from a single pod in non-prod to a full production Game Day.

Agent: sre-incident-responder (agents/sre-incident-responder.agent.md)
Acts as a Site Reliability Engineer during both live production incidents and planned resilience exercises. Follows a consistent triage → mitigate → diagnose → communicate → learn flow, always checking for a fast mitigation (rollback, feature flag, failover, scale-out) before deep root-causing. Drafts factual, non-speculative status updates and blameless post-mortems (timeline, contributing factors, action items — no individual blame). When the session is a planned Game Day rather than a live incident, it hands off to the chaos-engineering-test-generator skill to scaffold the experiment (hypothesis, blast radius, abort condition) before running the same response flow against the simulated failure.

Together, the skill and agent form a bundle: the skill designs the fault-injection experiment, and the agent conducts the response to it — covering both the Cloud Monitoring/SRE and Security/Resilience areas of the course.


Type of Contribution

  • New instruction file.
  • New prompt file.
  • New agent file.
  • New plugin.
  • New skill file.
  • New agentic workflow.
  • New canvas extension.
  • Update to existing instruction, prompt, agent, plugin, skill, workflow, or canvas extension.
  • Other (please specify):

Additional Notes

  • Skill triggers automatically when a prompt mentions resilience testing, chaos experiments, Game Days, or Kubernetes/cloud fault injection.
  • Agent can be invoked directly via @sre-incident-responder in Copilot Chat.
  • Both were validated against npm run skill:validate and npm run build.

By submitting this pull request, I confirm that my contribution abides by the Code of Conduct and will be licensed under the MIT License.

@Gugaarleo Gugaarleo requested a review from aaronpowell as a code owner July 3, 2026 14:29
Copilot AI review requested due to automatic review settings July 3, 2026 14:29
@github-actions github-actions Bot added agent PR touches agents new-submission PR adds at least one new contribution skills PR touches skills labels Jul 3, 2026
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🔒 PR Risk Scan Results

Scanned 4 changed file(s).

Severity Count
🔴 High 0
🟠 Medium 2
ℹ️ Info 0
Severity Rule File Line Match
🟠 package-exec-command docs/README.skills.md 31 | [acreadiness-assess](../skills/acreadiness-assess/SKILL.md)<br />`gh skills install github/awesome-copilot acreadiness-assess` | Run the AgentRC readiness assessment on the curre
🟠 unpinned-version-indicator skills/chaos-engineering-test-generator/SKILL.md 122 Abort condition: stop immediately if `search_service_p99_latency_ms > 800` for more than 30s (wire to an alert or a manual `kubectl delete stresschaos search-service-cpu-hog`).

This is an automated soft-gate report. Findings indicate review targets and do not block merge by themselves.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🔍 Vally Lint Results

⚠️ Warnings or advisories found

Scope Checked
Skills 1
Agents 1
Total 2
Severity Count
❌ Errors 0
⚠️ Warnings 0
ℹ️ Advisories 1

Summary

Level Finding
ℹ️ Vally currently lints SKILL.md content. Agent files were detected but skipped:
Full linter output
### Linting skills/chaos-engineering-test-generator
npm warn EBADENGINE Unsupported engine {
npm warn EBADENGINE   package: 'commander@15.0.0',
npm warn EBADENGINE   required: { node: '>=22.12.0' },
npm warn EBADENGINE   current: { node: 'v20.20.2', npm: '10.8.2' }
npm warn EBADENGINE }
npm warn deprecated prebuild-install@7.1.3: No longer maintained. Please contact the author of the relevant native addon; alternatives are available.
✅ chaos-engineering-test-generator (2/2 checks passed)
    ✓ [spec-compliance] All 1 skill(s) are spec-compliant.
        ✓ spec-compliance: All spec checks passed.
    ✓ [valid-refs] All file references across 1 skill(s) are valid.
        ✓ valid-refs: All file references resolve to existing files within the skill directory.

1 skill(s) linted, 1 passed

### Agent files detected (not linted by vally)
ℹ️ Vally currently lints SKILL.md content. Agent files were detected but skipped:
agents/sre-incident-responder.agent.md

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a resilience-engineering bundle to the Awesome Copilot collection: a new Skill for generating hypothesis-driven chaos experiments plus a new Agent for guiding incident response and Game Day execution. This extends the repository’s SRE/operations offerings with structured chaos-testing guidance and an incident-response playbook agent.

Changes:

  • Added chaos-engineering-test-generator skill content and examples.
  • Added sre-incident-responder agent definition for incident + Game Day workflows.
  • Updated skills/agents documentation indexes to include the new entries (and reflect generated table formatting).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
skills/chaos-engineering-test-generator/SKILL.md New skill defining chaos experiment generation workflow, safety guidelines, and examples.
agents/sre-incident-responder.agent.md New agent that guides triage/mitigation/diagnosis/communication/postmortems and can hand off to chaos skill for Game Days.
docs/README.skills.md Adds the new skill to the published skills index.
docs/README.agents.md Adds the new agent to the published agents index (and updates MCP column formatting for some rows).

Comment on lines +39 to +40
### 3. Blast radius and safety controls
Every generated experiment includes: a scoped selector (namespace/label/percentage of replicas, never "all"), a `duration`, and abort conditions tied to a monitored metric (e.g., abort if error rate > X% or latency p99 > Y ms). The skill defaults to the smallest blast radius that can still test the hypothesis and flags anything wider as "requires explicit approval."
Comment on lines +78 to +79
Hypothesis: killing 25% of `checkout-service` pods for 60s should not raise the checkout error rate above 1%, because readiness probes and 3 replicas should absorb the loss.

Comment on lines +102 to +103
Hypothesis: adding 300ms latency to 30% of `payments-api` pods should trigger the client-side circuit breaker within 10s without cascading timeouts upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent PR touches agents new-submission PR adds at least one new contribution skills PR touches skills

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants