🏭 Agent Skills Harness

project

May 2026

Fork it, drop in examples of what good looks like, and the factory does the rest. A batteries-included development home for create-skill-autoresearch: a 5-phase agent skill factory that researches your domain, drafts against gold standards, iterates via autoresearch, and ships only after independent multi-agent consensus.

External Links

Demo Source Code

Achievements

Published to skills.sh

The create-skill-autoresearch factory skill published for one-command install: npx skills add a-tokyo/agent-skills --skill create-skill-autoresearch.

Jun 9, 2026

View on skills.sh

𝗦𝗶𝘁𝘂𝗮𝘁𝗶𝗼𝗻

AI coding agents (Claude Code, Cursor, Gemini CLI) have adopted SKILL.md as the format for reusable agent behaviors, but the existing skill generators (Anthropic's skill-creator, Cursor's create-skill) are single-pass drafters: one prompt in, one SKILL.md out. There is no domain research phase, no measurable rubric, no improvement loop, and no independent quality gate. Skills ship without ever being tested against real gold standards. Nobody knows whether the output is actually better than a hand-curated reference, and there is no reproducible way to iterate on quality.

𝗧𝗮𝘀𝗸

Architect a skill factory that gives agent skills a proper engineering lifecycle. Extend one-shot generators with structured domain research, automated benchmarking against gold standards via LLM-as-judge, an autonomous improvement loop with measurable convergence, and independent multi-agent verification with adversarial challenge. Every shipped skill should have a provable quality score before it reaches users.

𝗔𝗰𝘁𝗶𝗼𝗻

Designed and implemented create-skill-autoresearch as a 5-phase pipeline: Interview (discover purpose, scope, gold standards), Research (parallel subagents study domain materials and distill a dossier), Draft (structure SKILL.md following official authoring conventions), Autoresearch (iterate autonomously against an LLM-as-judge rubric until the score threshold is met or plateau is detected), and Verify (independent panel with consensus before shipping).

Engineered a 4-role agent topology with strict context isolation. The Orchestrator manages phase transitions and spawns agents. N parallel Researcher subagents build the domain dossier. A Builder agent drafts and iterates the skill. A 3-member Panel (Quality scorer, Utility scorer, Devil's Advocate) evaluates independently behind a context wall so the builder and judges never share context, preventing evaluation bias.

Built a structured consensus protocol based on academic multi-agent deliberation research. Per-criterion atomic scoring prevents halo effects. Evidence-anchoring requires verbatim quotes for extreme scores. Explicit adversarial assignment achieves 99.2% disagreement detection vs 48.3% for "think critically" prompts. A single synthesis round balances deliberation quality against token cost.

Designed the build workspace contract (builds/<name>/input/ for user materials, work/ for factory artifacts, output/ for the publish-ready skill package) so users only provide gold standards. The factory generates research dossiers, rubrics, evaluate.sh scripts, autoresearch journals with keep/revert decisions, and the final SKILL.md with references.

Built a self-test harness with 68 deterministic structural checks on the factory itself, integrated vendored companion skills (autoresearch, premortem, handoff, llm-council) tracked in skills-lock.json with content hashes, and shipped a Nextra documentation site single-sourced from the repo docs.

𝗥𝗲𝘀𝘂𝗹𝘁

Validated the full pipeline with a blind premortem-rebuild benchmark: a builder agent reconstructed a production skill from only a one-paragraph brief, and the independent 3-member panel adjudicated a consensus score of 0.84 (Quality 0.81, Utility 0.87) against the 0.80 quality bar. This proved the full doer, independent panel, devil's advocate, evidence-based adjudication loop delivers measurably above human-curated baselines. Published to skills.sh for one-command install (npx skills add a-tokyo/agent-skills --skill create-skill-autoresearch). The harness is the development home; the published skill lives at a-tokyo/agent-skills.