Skip to content

Eval · Confident AI

DeepEval

Pytest-style LLM evaluation framework. Open source.

FREEMIUMOpen coreHybridCLIAPI

Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.

Model support

BYO key / model

Bring any provider/model for the LLM-as-judge metrics.

Where it runs

  • CLI
  • API

Tags

  • #eval
  • #open-source
  • #llm-as-judge
  • #rag
  • #ci
Open DeepEvalGitHubDocs
  • View Braintrust details
    EvalFREEMIUM

    Braintrust

    Braintrust

    Hosted eval + tracing platform for LLM apps.

    Production-grade eval orchestration with a dashboard, dataset versioning, and OpenTelemetry tracing. Useful once eval volume outgrows a CI YAML file.

    AI insight: Where teams graduate when a CI eval file stops scaling — it adds dataset versioning and OpenTelemetry traces to the loop.

    • eval
    • tracing
    • datasets
    • production
  • View Promptfoo details
    EvalFREEOSS

    Promptfoo

    Promptfoo

    Open-source LLM eval CLI. Rubric scoring + golden sets.

    YAML-driven eval harness. Pair a prompt with a goldset, define rubrics, run across multiple models in CI. Strong for catching prompt regressions before they hit production.

    AI insight: Define evals in plain YAML and run one goldset across models in CI — a prompt regression fails the build like any other test.

    • eval
    • ci
    • rubric
    • open-source