Custom Evals, a new open-source evaluation framework created by Anjaiah Methuku, provides a single testing interface across 17+ agent frameworks including CrewAI, Pydantic AI, Hermes Agent, AWS Bedrock, Google ADK, and OpenClaw. The framework requires no backend, no dashboard, and no mandatory test runner, according to the project announcement on dev.to.

The Fragmentation Problem

Agent builders working across multiple frameworks face incompatible output shapes, different evaluation tools, and framework-specific testing requirements. Methuku described the core friction in the announcement: teams building with LangGraph on Monday, LlamaIndex RAG pipelines on Wednesday, and CrewAI by Friday encounter a different evaluation setup for each one. Existing tools like Phoenix Evals (Arize), DeepEval, and RAGAS each cover specific niches but none provide a single unified interface that works across all frameworks without requiring infrastructure.

Four-Layer Architecture

Custom Evals separates evaluation into four independent layers, as detailed in the technical writeup:

Layer 1: Deterministic checks. No LLM calls required. Exact match, sentiment scoring, and custom evaluators via a Python decorator. Zero API cost.

Layer 2: LLM-as-judge. Four production-ready evaluators ship out of the box: HallucinationEvaluator (reference-free, checks output against context), CorrectnessEvaluator (requires ground truth), RelevanceEvaluator (reference-free), and CoherenceEvaluator (reference-free). Two additional RAG-specific evaluators handle faithfulness and answer relevancy.

Layer 3: NLP similarity metrics. Seven standard metrics (BLEU, ROUGE, cosine similarity, Jaro-Winkler, and others) for reference-based comparison without LLM calls.

Layer 4: OCR/document metrics. Character Error Rate and Bounding Box IoU for extraction pipelines using AWS Textract or Azure Form Recognizer, as noted by earezki.com.

Universal Adapter Pattern

The framework’s cross-platform support relies on a universal adapter pattern that reduces every integration (cloud platforms and community frameworks alike) to a standardized eval_input dictionary. Each evaluator declares a DIRECTION property (“minimize” for hallucination, “maximize” for coherence), which means test thresholds work correctly regardless of metric semantics.

The Testing Gap in Agent Development

The release addresses a structural gap in agent development workflows. As earezki.com reported, engineering teams frequently demo smooth AI agents only to spend weeks firefighting hallucinations in production because existing evaluation tools are either too heavy (requiring full observability stacks), too niche (RAG-only), or too opinionated (requiring specific test runners). Custom Evals also includes Agent Shield for real-time observability and secret redaction in HTTP/WebSocket traffic. The framework installs via pip install -e ".[dev]" with no additional infrastructure required.