Agents should be pretty good at mutation testing, right? We’ll see what they can do!
Claude code plan output
Your vision is a mutation testing workflow where:
- Claude creates a todo list of testable units (methods/functions/classes)
- For each unit, delegates to quickfix agent to:
a. Theorize what mutation could break tests
b. Apply mutation
c. Run specific test file
d. Revert mutation
e. Report if test caught the mutation
- If test passed (bad - mutation not caught), improve the test and retry
- If test failed (good - mutation caught), move to next unit
- Works across multiple test frameworks (rspec, mix, cargo, etc.)
- All deliverables are skills starting with mutation-testing-*
Before I explore approaches, I need to clarify one critical aspect:
What level of granularity do you want for the mutation strategies?
● User answered Claude's questions:
⎿ · How sophisticated should the mutation generation be? → Hybrid (simple mutations first, escalate to LLM if all pass)
● Excellent. Hybrid approach gives us speed + depth when needed.
My quickfix agent is basically one-shot haiku based.
Claude produced a plan for the following skills:
Claude code plan output
Skill Structure & Responsibilities
Skill 1: mutation-testing-coordinator (Entry point)
- Triggered by /mutation-test [path]
- Detects language/framework from file extension + project structure
- Parses target into list of testable units (functions/methods/classes)
- Creates TodoWrite with one todo per unit
- Iterates through todos, invoking mutation-testing-mutate-and-verify for each
Skill 2: mutation-testing-test-discovery
- Spawns quickfix agent to find relevant test files for a unit
- Strategies:
- Naming convention match (user.rb → **/user_spec.rb, **/user_test.rb)
- Find callers with rg (imports, references)
- Find test files for those callers
- Returns: List of test file paths
Skill 3: mutation-testing-mutate-and-verify (Core loop)
- Takes: unit location, test files list, framework
- Runs simple mutations first (from mutation-testing-simple-mutations)
- For each mutation:
- Spawns quickfix to apply mutation
- Runs all test files (framework-specific command)
- jj restore -f @- <file>
- If ANY test failed → log success, next mutation
- If ALL tests passed → call developer agent + mutation-testing-test-improvement
- After all simple mutations caught, tries ONE LLM mutation
- Stops when: all simple mutations caught + LLM mutation caught
Skill 4: mutation-testing-simple-mutations
- Reference guide for quickfix agent
- Language-agnostic patterns:
- Operator swaps: == ↔ !=, > ↔ <, && ↔ ||
- Boolean flips: true ↔ false
- Boundary conditions: > ↔ >=, < ↔ <=
- Return value mutations: return nil/null/0 instead of actual value
- Conditional inversions: remove if guard, invert if condition
- Quickfix agent picks ONE mutation per invocation based on code structure
Skill 5: mutation-testing-semantic-mutations
- Guide for quickfix agent to generate LLM-based mutations
- Prompts agent to analyze code intent and create mutations that:
- Break business logic (not just syntax)
- Test edge cases (empty lists, nil values, boundary conditions)
- Remove critical validations
- Example: In authenticate(user), mutate to skip password check
Skill 6: mutation-testing-test-improvement
- Guide for developer agent when mutation not caught
- Steps:
a. Show the mutation that wasn't caught
b. Analyze why current test missed it
c. Add specific assertion/test case to catch this mutation
d. Explain improvement to user
- Developer agent commits improved test immediately (separate commit)
…and to keep the agents motivated…
Claude code plan output
● Agent Motivation & Competition System
Skill 7: mutation-testing-agent-incentives
- Defines reward structure and competitive framing for spawned agents
- Used by coordinator when spawning quickfix/developer agents
For Quickfix Agents (mutation generation):
- "You're competing against other mutation agents. Your goal: create mutations that expose test gaps."
- "Reward: +10 points if your mutation isn't caught (finds weak test)"
- "Bonus: +5 points if mutation is semantically meaningful (not just syntactic)"
- "Top performing agents get highlighted in the final report."
For Developer Agents (test improvement):
- "You're competing against other test improvement agents. Your goal: write the minimal, most precise test that catches
the mutation."
- "Reward: +10 points if improved test catches the mutation on first try"
- "Bonus: +5 points if test is concise (<10 lines added)"
- "Penalty: -5 points if test is overly broad or tests implementation details"
- "Top performing agents get highlighted in the final report."
Coordinator tracks:
- Mutation success rate per quickfix agent
- Test improvement first-try success rate per developer agent
- Displays leaderboard at end of /mutation-test run
Example output:
Mutation Testing Complete!
🏆 Agent Performance:
Quickfix Agent #3: 2 gaps found, 1 semantic bonus → 25 pts ⭐ TOP PERFORMER
Developer Agent #1: 2 improvements, both caught on retry → 15 pts
Quickfix Agent #1: 1 gap found → 10 pts
Does this competitive framing + reward system look good?
I’m curious about the result but given very low complexity of changes required to test something, it’s promising.