feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent by Gugaarleo · Pull Request #2207 · github/awesome-copilot

Gugaarleo · 2026-07-03T14:29:57Z

Pull Request Checklist

I have read and followed the CONTRIBUTING.md guidelines.
I have read and followed the Guidance for submissions involving paid services.
My contribution adds a new instruction, prompt, agent, skill, workflow, or canvas extension file in the correct directory.
The file follows the required naming convention.
The content is clearly structured and follows the example format.
I have tested my instructions, prompt, agent, skill, workflow, or canvas extension with GitHub Copilot.
I have run npm start and verified that README.md is up to date.
I am targeting the main branch for this pull request.

Description

Adds a Skill + Agent bundle focused on resilience engineering and incident response.

Skill: chaos-engineering-test-generator (skills/chaos-engineering-test-generator/)
Generates ready-to-run chaos engineering experiments (LitmusChaos, Chaos Mesh, Chaos Monkey-style, AWS FIS) to validate system resilience against network latency, pod/instance failure, CPU/memory stress, and dependency outages. Every generated experiment is hypothesis-driven and always includes a defined blast radius, a duration cap, and an abort condition tied to a monitored metric — never a fault injected without a rollback path. Includes three worked examples (pod-delete, network latency injection, CPU stress) and a progressive blast-radius escalation pattern for taking an experiment from a single pod in non-prod to a full production Game Day.

Agent: sre-incident-responder (agents/sre-incident-responder.agent.md)
Acts as a Site Reliability Engineer during both live production incidents and planned resilience exercises. Follows a consistent triage → mitigate → diagnose → communicate → learn flow, always checking for a fast mitigation (rollback, feature flag, failover, scale-out) before deep root-causing. Drafts factual, non-speculative status updates and blameless post-mortems (timeline, contributing factors, action items — no individual blame). When the session is a planned Game Day rather than a live incident, it hands off to the chaos-engineering-test-generator skill to scaffold the experiment (hypothesis, blast radius, abort condition) before running the same response flow against the simulated failure.

Together, the skill and agent form a bundle: the skill designs the fault-injection experiment, and the agent conducts the response to it — covering both the Cloud Monitoring/SRE and Security/Resilience areas of the course.

Type of Contribution

Additional Notes

Skill triggers automatically when a prompt mentions resilience testing, chaos experiments, Game Days, or Kubernetes/cloud fault injection.
Agent can be invoked directly via @sre-incident-responder in Copilot Chat.
Both were validated against npm run skill:validate and npm run build.

By submitting this pull request, I confirm that my contribution abides by the Code of Conduct and will be licensed under the MIT License.

…er-agent agent

github-actions · 2026-07-03T14:30:20Z

🔒 PR Risk Scan Results

Scanned 4 changed file(s).

Severity	Count
🔴 High	0
🟠 Medium	2
ℹ️ Info	0

Severity	Rule	File	Line	Match
🟠	`package-exec-command`	`docs/README.skills.md`	31	\| [acreadiness-assess](../skills/acreadiness-assess/SKILL.md)<br />`gh skills install github/awesome-copilot acreadiness-assess` \| Run the AgentRC readiness assessment on the curre
🟠	`unpinned-version-indicator`	`skills/chaos-engineering-test-generator/SKILL.md`	122	Abort condition: stop immediately if `search_service_p99_latency_ms > 800` for more than 30s (wire to an alert or a manual `kubectl delete stresschaos search-service-cpu-hog`).

This is an automated soft-gate report. Findings indicate review targets and do not block merge by themselves.

github-actions · 2026-07-03T14:31:59Z

🔍 Vally Lint Results

⚠️ Warnings or advisories found

Scope	Checked
Skills	1
Agents	1
Total	2

Severity	Count
❌ Errors	0
⚠️ Warnings	0
ℹ️ Advisories	1

Summary

Level	Finding
ℹ️	Vally currently lints SKILL.md content. Agent files were detected but skipped:

Full linter output

### Linting skills/chaos-engineering-test-generator
npm warn EBADENGINE Unsupported engine {
npm warn EBADENGINE   package: 'commander@15.0.0',
npm warn EBADENGINE   required: { node: '>=22.12.0' },
npm warn EBADENGINE   current: { node: 'v20.20.2', npm: '10.8.2' }
npm warn EBADENGINE }
npm warn deprecated prebuild-install@7.1.3: No longer maintained. Please contact the author of the relevant native addon; alternatives are available.
✅ chaos-engineering-test-generator (2/2 checks passed)
    ✓ [spec-compliance] All 1 skill(s) are spec-compliant.
        ✓ spec-compliance: All spec checks passed.
    ✓ [valid-refs] All file references across 1 skill(s) are valid.
        ✓ valid-refs: All file references resolve to existing files within the skill directory.

1 skill(s) linted, 1 passed

### Agent files detected (not linted by vally)
ℹ️ Vally currently lints SKILL.md content. Agent files were detected but skipped:
agents/sre-incident-responder.agent.md

Copilot

Pull request overview

Adds a resilience-engineering bundle to the Awesome Copilot collection: a new Skill for generating hypothesis-driven chaos experiments plus a new Agent for guiding incident response and Game Day execution. This extends the repository’s SRE/operations offerings with structured chaos-testing guidance and an incident-response playbook agent.

Changes:

Added chaos-engineering-test-generator skill content and examples.
Added sre-incident-responder agent definition for incident + Game Day workflows.
Updated skills/agents documentation indexes to include the new entries (and reflect generated table formatting).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
skills/chaos-engineering-test-generator/SKILL.md	New skill defining chaos experiment generation workflow, safety guidelines, and examples.
agents/sre-incident-responder.agent.md	New agent that guides triage/mitigation/diagnosis/communication/postmortems and can hand off to chaos skill for Game Days.
docs/README.skills.md	Adds the new skill to the published skills index.
docs/README.agents.md	Adds the new agent to the published agents index (and updates MCP column formatting for some rows).

+### 3. Blast radius and safety controls
+Every generated experiment includes: a scoped selector (namespace/label/percentage of replicas, never "all"), a `duration`, and abort conditions tied to a monitored metric (e.g., abort if error rate > X% or latency p99 > Y ms). The skill defaults to the smallest blast radius that can still test the hypothesis and flags anything wider as "requires explicit approval."


+Hypothesis: killing 25% of `checkout-service` pods for 60s should not raise the checkout error rate above 1%, because readiness probes and 3 replicas should absorb the loss.
+


+Hypothesis: adding 300ms latency to 30% of `payments-api` pods should trigger the client-side circuit breaker within 10s without cascading timeouts upstream.
+


feat: chaos-engineering-test-generator skill and sre-incident-respond…

5db713b

…er-agent agent

Gugaarleo requested a review from aaronpowell as a code owner July 3, 2026 14:29

Copilot AI review requested due to automatic review settings July 3, 2026 14:29

github-actions Bot added agent PR touches agents new-submission PR adds at least one new contribution skills PR touches skills labels Jul 3, 2026

Copilot started reviewing on behalf of Gugaarleo July 3, 2026 14:30 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent#2207

feat: chaos-engineering-test-generator skill and sre-incident-responder-agent agent#2207
Gugaarleo wants to merge 1 commit into
github:mainfrom
Gugaarleo:feat/chaos-engineering-test-generator

Gugaarleo commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		### 3. Blast radius and safety controls
		Every generated experiment includes: a scoped selector (namespace/label/percentage of replicas, never "all"), a `duration`, and abort conditions tied to a monitored metric (e.g., abort if error rate > X% or latency p99 > Y ms). The skill defaults to the smallest blast radius that can still test the hypothesis and flags anything wider as "requires explicit approval."

		Hypothesis: killing 25% of `checkout-service` pods for 60s should not raise the checkout error rate above 1%, because readiness probes and 3 replicas should absorb the loss.

		Hypothesis: adding 300ms latency to 30% of `payments-api` pods should trigger the client-side circuit breaker within 10s without cascading timeouts upstream.

Uh oh!

Conversation

Gugaarleo commented Jul 3, 2026

Pull Request Checklist

Description

Type of Contribution

Additional Notes

Uh oh!

github-actions Bot commented Jul 3, 2026

🔒 PR Risk Scan Results

Uh oh!

github-actions Bot commented Jul 3, 2026

🔍 Vally Lint Results

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants