What it does

Node.js package and GitHub Action that validates MCP tool implementations using LLM-based scoring. Supports both TypeScript and YAML configuration formats. Evaluations are scored across five dimensions—accuracy, completeness, relevance, clarity, reasoning—on a 1–5 scale, with structured result objects. Includes built-in observability support with metrics and tracing via OTEL.

Who it's for

MCP server developers verifying tool correctness and CI/CD teams automating tool validation on pull requests. Useful for teams already running GitHub Actions who want to catch tool regressions before merge.

Common use cases

Score tool output accuracy on PR changes using the GitHub Action
Test error handling—e.g., invalid inputs—via YAML eval files
Verify multi-step tool functionality, such as weather forecasts, locally with npx mcp-eval
Monitor tool performance metrics with the OTEL-compatible observability stack
Gate merges on eval scores by parsing GitHub Action results

Setup pitfalls

Requires OPENAI_API_KEY or ANTHROPIC_API_KEY environment variable; defaults to GPT-4, which may incur token quota or billing issues on shared keys
The metrics and observability feature is alpha with unstable APIs; the docker-compose stack requires Docker and careful port configuration
Reads filesystem to load eval definitions—ensure sandboxing if running untrusted eval files
Last commit 347 days ago; package is not actively maintained, so Anthropic SDK features or new MCP spec updates may lag

Tool name

Description

Destructive?

add

✓ no

mcp-evals

What it does

Who it's for

Common use cases

Setup pitfalls