Claude Code Skill Creator: Test, Benchmark, and Optimize Your Skills

What This Covers

The Skill Creator is an official tool from Anthropic that lets you create, test, benchmark, and optimize skills without writing any code. Before this update, there was no systematic way to know if your skills were actually working correctly or triggering when they should. This guide covers how it works, how to install it, and how to use it.

Available on Claude.ai, Cowork, and as a plugin for Claude Code (released March 3, 2026).

Who This Is For

Anthropic designed this with a specific audience in mind: most skill authors are subject matter experts, not engineers. They know their workflows but don't have the tools to tell whether a skill still works with a new model, triggers when it should, or actually improved after an edit. The Skill Creator bridges that gap.

What Are Skills?

Skills are text prompts that tell Claude how to do a specific thing in a specific way. They live as SKILL.md files in your project or skill directories. Every skill falls into one of two categories:

Capability Uplift - A skill that helps Claude do something the base model either can't do or can't do consistently. Anthropic's document creation skills are good examples. They encode techniques and patterns that produce better output than prompting alone. Without the front-end design skill, Claude builds generic-looking websites. With it, the output quality jumps significantly. Most official skills fall here: PDF, PowerPoint, docx, xlsx, MCP builder.

Encoded Preference - A skill that documents a workflow where Claude can already do each piece, but the skill sequences the steps according to your process. Examples: a skill that walks through NDA review against set criteria, or one that drafts weekly updates pulling data from various MCPs. Think of it as chaining multiple capabilities into one repeatable pipeline.

This distinction matters because how you evaluate each type is different.

Why the Skill Creator Matters

Three problems existed before this tool:

No way to test skills. You had no data on whether a skill was actually producing better output than baseline Claude.
No way to benchmark. You couldn't compare skill vs. no skill, or version A vs. version B.
Skills don't always trigger. Claude uses a short description of each skill to decide when to use it. Too broad and you get false triggers. Too narrow and it never fires. There was no way to tune this.

The Skill Creator brings some of the rigor of software development (testing, benchmarking, iterative improvement) to skill authoring without requiring anyone to write code.

How to Install

Claude.ai and Cowork

Already available. Ask Claude to use the skill-creator to get started.

Claude Code (Plugin)

Open Claude Code
Type /plugin
Search for skill-creator
Install it (official Claude Code plugin)
Type /exit and restart Claude Code

Plugin source: github.com/anthropics/claude-plugins-official/tree/main/plugins/skill-creator

Skill source: github.com/anthropics/skills/tree/main/skills/skill-creator

To verify it's working, ask Claude: "What can the skill creator skill do for me?"

How to Use It

Two ways to invoke in Claude Code:

Say: "I want to use the skill creator to [create/test/optimize] a skill"
Type /skill-creator directly

On Claude.ai or Cowork, just ask Claude to use the skill-creator.

What the Tests Actually Do

1. Evals: Define What Good Looks Like

Evals are tests that check whether Claude does what you expect for a given prompt. If you've written software tests, the concept is familiar: define some test prompts (plus files if needed), describe what a good result looks like, and the Skill Creator tells you whether the skill holds up.

Real example from Anthropic: The PDF skill previously struggled with non-fillable forms. Claude had to place text at exact coordinates with no defined fields to guide it. Evals isolated the failure, and Anthropic shipped a fix that anchors positioning to extracted text coordinates. Without evals, that failure would have stayed hidden.

2. Catching Quality Regressions

As models evolve, a skill that worked well last month might behave differently today. Running evals against a new model gives you an early signal when something shifts before it impacts your work.

This is especially important for capability uplift skills. If the base model starts passing your evals without the skill loaded, that's a signal the skill's techniques may have been incorporated into the model's default behavior. The skill isn't broken; it's just no longer necessary. Example: a front-end design skill that was critical for Opus 4.6 might become redundant if Opus 5.0 handles design well on its own.

3. Benchmark Mode

Benchmark mode runs a standardized assessment using your evals. Run it after model updates or as you iterate on the skill. It tracks:

Eval pass rate
Elapsed time
Token usage

Your evals and results stay with you. Store them locally, integrate them with a dashboard, or plug them into a CI system.

4. Multi-Agent Support for Faster Testing

Running evals one at a time is slow, and accumulated context can bleed between test runs. The Skill Creator now spins up independent agents to run evals in parallel. Each agent runs in a clean context with its own token and timing metrics. Faster results, no cross-contamination.

It also includes comparator agents for A/B comparisons: two skill versions, or skill vs. no skill. Comparator agents judge outputs without knowing which version produced which result, so you get an unbiased assessment of whether a change actually helped.

5. Trigger Description Optimization

Claude doesn't preload all your skills. It reads a short description of each skill and decides which one to use based on your prompt. As your skill library grows, that description becomes critical.

The Skill Creator analyzes your current description against sample prompts and suggests edits that reduce both false positives (triggering when it shouldn't) and false negatives (not triggering when it should).

Anthropic ran this across their document-creation skills and saw improved triggering on 5 out of 6 public skills.

Real Example: Building and Testing a Workflow Skill

Here's a walkthrough of using the Skill Creator to build an encoded preference skill (example: a YouTube research pipeline):

Tell Claude: "Use the skill creator to create a YouTube Pipeline skill"
Describe the workflow: search YouTube, upload to NotebookLM, analyze content, create deliverable
The Skill Creator designs the full skill, breaking it into defined steps
Use Plan mode (Shift+Tab twice in Claude Code) to review the design before building
Run an eval to test fidelity
Review results: each step shows pass/fail, plus stats and insights

Because this is an encoded preference skill, the eval focuses on fidelity. Is it actually running each step in the right order and producing the expected output? If you were testing a capability uplift skill instead, you'd see the A/B comparison with and without the skill using comparator agents.

When to Use Each Test Type

Skill Type	What to Test	Why
Capability Uplift	Benchmark (skill vs. no skill)	Know if the skill still adds value as models improve
Capability Uplift	Version A vs. Version B	Compare optimizations using blind comparator agents
Encoded Preference	Workflow fidelity	Confirm every step executes in order
Any Skill	Trigger description	Ensure Claude actually uses the skill when it should

Looking Ahead

Anthropic notes that as models improve, the line between "skill" and "specification" may blur. Today, a SKILL.md file is essentially an implementation plan with detailed instructions telling Claude how to do something. Over time, a natural-language description of what the skill should do may be enough, with the model figuring out the rest.

The eval framework supports this direction. Evals already describe the "what." Eventually, that description may be the skill itself.

Key Takeaways

Skills are the easiest way to improve Claude's output quality. The Skill Creator makes the skills themselves better.
You're no longer guessing whether a skill works. You have data.
Capability uplift skills may become unnecessary as models improve. Benchmark mode catches that.
Encoded preference skills are only as good as their fidelity to your workflow. Evals verify that fidelity.
Trigger descriptions are the most overlooked optimization. If Claude isn't using your skill, the description is probably the problem.
Evals, benchmarks, and results are yours to keep. Store locally, integrate with dashboards, or plug into CI.

Resources

Skill Creator Plugin (Claude Code): github.com/anthropics/claude-plugins-official/tree/main/plugins/skill-creator
Skill Creator Source: github.com/anthropics/skills/tree/main/skills/skill-creator
Official Blog Post: claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills
Skills Documentation: claude.com/skills

NEW Claude Code Skill Creator: Test, Benchmark, and Optimize Your Skills