@arathunku

Agents should be pretty good at mutation testing, right? We’ll see what they can do!

Claude code plan output

  Your vision is a mutation testing workflow where:
  - Claude creates a todo list of testable units (methods/functions/classes)
  - For each unit, delegates to quickfix agent to:
    a. Theorize what mutation could break tests
    b. Apply mutation
    c. Run specific test file
    d. Revert mutation
    e. Report if test caught the mutation
  - If test passed (bad - mutation not caught), improve the test and retry
  - If test failed (good - mutation caught), move to next unit
  - Works across multiple test frameworks (rspec, mix, cargo, etc.)
  - All deliverables are skills starting with mutation-testing-*

  Before I explore approaches, I need to clarify one critical aspect:

  What level of granularity do you want for the mutation strategies?
● User answered Claude's questions:
  ⎿  · How sophisticated should the mutation generation be? → Hybrid (simple mutations first, escalate to LLM if all pass)
● Excellent. Hybrid approach gives us speed + depth when needed.

My quickfix agent is basically one-shot haiku based.

Claude produced a plan for the following skills:

Claude code plan output

  Skill Structure & Responsibilities

  Skill 1: mutation-testing-coordinator (Entry point)
  - Triggered by /mutation-test [path]
  - Detects language/framework from file extension + project structure
  - Parses target into list of testable units (functions/methods/classes)
  - Creates TodoWrite with one todo per unit
  - Iterates through todos, invoking mutation-testing-mutate-and-verify for each

  Skill 2: mutation-testing-test-discovery
  - Spawns quickfix agent to find relevant test files for a unit
  - Strategies:
    - Naming convention match (user.rb → **/user_spec.rb, **/user_test.rb)
    - Find callers with rg (imports, references)
    - Find test files for those callers
  - Returns: List of test file paths

  Skill 3: mutation-testing-mutate-and-verify (Core loop)
  - Takes: unit location, test files list, framework
  - Runs simple mutations first (from mutation-testing-simple-mutations)
  - For each mutation:
    - Spawns quickfix to apply mutation
    - Runs all test files (framework-specific command)
    - jj restore -f @- <file>
    - If ANY test failed → log success, next mutation
    - If ALL tests passed → call developer agent + mutation-testing-test-improvement
  - After all simple mutations caught, tries ONE LLM mutation
  - Stops when: all simple mutations caught + LLM mutation caught

  Skill 4: mutation-testing-simple-mutations
  - Reference guide for quickfix agent
  - Language-agnostic patterns:
    - Operator swaps: == ↔ !=, > ↔ <, && ↔ ||
    - Boolean flips: true ↔ false
    - Boundary conditions: > ↔ >=, < ↔ <=
    - Return value mutations: return nil/null/0 instead of actual value
    - Conditional inversions: remove if guard, invert if condition
  - Quickfix agent picks ONE mutation per invocation based on code structure

  Skill 5: mutation-testing-semantic-mutations
  - Guide for quickfix agent to generate LLM-based mutations
  - Prompts agent to analyze code intent and create mutations that:
    - Break business logic (not just syntax)
    - Test edge cases (empty lists, nil values, boundary conditions)
    - Remove critical validations
  - Example: In authenticate(user), mutate to skip password check

  Skill 6: mutation-testing-test-improvement
  - Guide for developer agent when mutation not caught
  - Steps:
    a. Show the mutation that wasn't caught
    b. Analyze why current test missed it
    c. Add specific assertion/test case to catch this mutation
    d. Explain improvement to user
  - Developer agent commits improved test immediately (separate commit)

…and to keep the agents motivated…

Claude code plan output

● Agent Motivation & Competition System

  Skill 7: mutation-testing-agent-incentives
  - Defines reward structure and competitive framing for spawned agents
  - Used by coordinator when spawning quickfix/developer agents

  For Quickfix Agents (mutation generation):
  - "You're competing against other mutation agents. Your goal: create mutations that expose test gaps."
  - "Reward: +10 points if your mutation isn't caught (finds weak test)"
  - "Bonus: +5 points if mutation is semantically meaningful (not just syntactic)"
  - "Top performing agents get highlighted in the final report."

  For Developer Agents (test improvement):
  - "You're competing against other test improvement agents. Your goal: write the minimal, most precise test that catches
  the mutation."
  - "Reward: +10 points if improved test catches the mutation on first try"
  - "Bonus: +5 points if test is concise (<10 lines added)"
  - "Penalty: -5 points if test is overly broad or tests implementation details"
  - "Top performing agents get highlighted in the final report."

  Coordinator tracks:
  - Mutation success rate per quickfix agent
  - Test improvement first-try success rate per developer agent
  - Displays leaderboard at end of /mutation-test run

  Example output:
  Mutation Testing Complete!

  🏆 Agent Performance:
  Quickfix Agent #3: 2 gaps found, 1 semantic bonus → 25 pts ⭐ TOP PERFORMER
  Developer Agent #1: 2 improvements, both caught on retry → 15 pts
  Quickfix Agent #1: 1 gap found → 10 pts

  Does this competitive framing + reward system look good?

I’m curious about the result but given very low complexity of changes required to test something, it’s promising.