feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent#2207
Open
Gugaarleo wants to merge 1 commit into
Open
feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent#2207Gugaarleo wants to merge 1 commit into
Gugaarleo wants to merge 1 commit into
Conversation
Contributor
🔒 PR Risk Scan ResultsScanned 4 changed file(s).
|
Contributor
🔍 Vally Lint Results
Summary
Full linter output |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a resilience-engineering bundle to the Awesome Copilot collection: a new Skill for generating hypothesis-driven chaos experiments plus a new Agent for guiding incident response and Game Day execution. This extends the repository’s SRE/operations offerings with structured chaos-testing guidance and an incident-response playbook agent.
Changes:
- Added
chaos-engineering-test-generatorskill content and examples. - Added
sre-incident-responderagent definition for incident + Game Day workflows. - Updated skills/agents documentation indexes to include the new entries (and reflect generated table formatting).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| skills/chaos-engineering-test-generator/SKILL.md | New skill defining chaos experiment generation workflow, safety guidelines, and examples. |
| agents/sre-incident-responder.agent.md | New agent that guides triage/mitigation/diagnosis/communication/postmortems and can hand off to chaos skill for Game Days. |
| docs/README.skills.md | Adds the new skill to the published skills index. |
| docs/README.agents.md | Adds the new agent to the published agents index (and updates MCP column formatting for some rows). |
Comment on lines
+39
to
+40
| ### 3. Blast radius and safety controls | ||
| Every generated experiment includes: a scoped selector (namespace/label/percentage of replicas, never "all"), a `duration`, and abort conditions tied to a monitored metric (e.g., abort if error rate > X% or latency p99 > Y ms). The skill defaults to the smallest blast radius that can still test the hypothesis and flags anything wider as "requires explicit approval." |
Comment on lines
+78
to
+79
| Hypothesis: killing 25% of `checkout-service` pods for 60s should not raise the checkout error rate above 1%, because readiness probes and 3 replicas should absorb the loss. | ||
|
|
Comment on lines
+102
to
+103
| Hypothesis: adding 300ms latency to 30% of `payments-api` pods should trigger the client-side circuit breaker within 10s without cascading timeouts upstream. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Checklist
npm startand verified thatREADME.mdis up to date.mainbranch for this pull request.Description
Adds a Skill + Agent bundle focused on resilience engineering and incident response.
Skill:
chaos-engineering-test-generator(skills/chaos-engineering-test-generator/)Generates ready-to-run chaos engineering experiments (LitmusChaos, Chaos Mesh, Chaos Monkey-style, AWS FIS) to validate system resilience against network latency, pod/instance failure, CPU/memory stress, and dependency outages. Every generated experiment is hypothesis-driven and always includes a defined blast radius, a duration cap, and an abort condition tied to a monitored metric — never a fault injected without a rollback path. Includes three worked examples (pod-delete, network latency injection, CPU stress) and a progressive blast-radius escalation pattern for taking an experiment from a single pod in non-prod to a full production Game Day.
Agent:
sre-incident-responder(agents/sre-incident-responder.agent.md)Acts as a Site Reliability Engineer during both live production incidents and planned resilience exercises. Follows a consistent triage → mitigate → diagnose → communicate → learn flow, always checking for a fast mitigation (rollback, feature flag, failover, scale-out) before deep root-causing. Drafts factual, non-speculative status updates and blameless post-mortems (timeline, contributing factors, action items — no individual blame). When the session is a planned Game Day rather than a live incident, it hands off to the
chaos-engineering-test-generatorskill to scaffold the experiment (hypothesis, blast radius, abort condition) before running the same response flow against the simulated failure.Together, the skill and agent form a bundle: the skill designs the fault-injection experiment, and the agent conducts the response to it — covering both the Cloud Monitoring/SRE and Security/Resilience areas of the course.
Type of Contribution
Additional Notes
@sre-incident-responderin Copilot Chat.npm run skill:validateandnpm run build.By submitting this pull request, I confirm that my contribution abides by the Code of Conduct and will be licensed under the MIT License.