What it does
Node.js package and GitHub Action that validates MCP tool implementations using LLM-based scoring. Supports both TypeScript and YAML configuration formats. Evaluations are scored across five dimensions—accuracy, completeness, relevance, clarity, reasoning—on a 1–5 scale, with structured result objects. Includes built-in observability support with metrics and tracing via OTEL.
Who it's for
MCP server developers verifying tool correctness and CI/CD teams automating tool validation on pull requests. Useful for teams already running GitHub Actions who want to catch tool regressions before merge.
Common use cases
- Score tool output accuracy on PR changes using the GitHub Action
- Test error handling—e.g., invalid inputs—via YAML eval files
- Verify multi-step tool functionality, such as weather forecasts, locally with
npx mcp-eval - Monitor tool performance metrics with the OTEL-compatible observability stack
- Gate merges on eval scores by parsing GitHub Action results
Setup pitfalls
- Requires
OPENAI_API_KEYorANTHROPIC_API_KEYenvironment variable; defaults to GPT-4, which may incur token quota or billing issues on shared keys - The metrics and observability feature is alpha with unstable APIs; the
docker-composestack requires Docker and careful port configuration - Reads filesystem to load eval definitions—ensure sandboxing if running untrusted eval files
- Last commit 347 days ago; package is not actively maintained, so Anthropic SDK features or new MCP spec updates may lag