hub / github.com/confident-ai/deepeval

github.com/confident-ai/deepeval @v4.0.7 sqlite

repository ↗ · DeepWiki ↗ · release v4.0.7 ↗

8,557 symbols 40,323 edges 1,328 files 1,781 documented · 21%

README

    <img alt="DeepEval." src="https://github.com/confident-ai/deepeval/raw/v4.0.7/assets/hero/wordmark-light.svg" width="520">










<h1 align="center">The LLM Evaluation Framework</h1>

<a href="https://discord.gg/3SEyvpgu2f">
    <img alt="discord-invite" src="https://dcbadge.limes.pink/api/server/3SEyvpgu2f?style=flat">
</a>
<a href="https://www.reddit.com/r/deepeval/">
    <img alt="reddit-community" src="https://img.shields.io/badge/Reddit-r%2Fdeepeval-FF4500?logo=reddit&logoColor=white">
</a>

Documentation | Metrics and Features | Getting Started | Integrations | Confident AI

<a href="https://github.com/confident-ai/deepeval/releases">
    <img alt="GitHub release" src="https://img.shields.io/github/release/confident-ai/deepeval.svg?color=violet">
</a>
<a href="https://colab.research.google.com/drive/1PPxYEBa6eu__LquGoFFJZkhYgWVYE6kh?usp=sharing">
    <img alt="Try Quickstart in Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
</a>
<a href="https://github.com/confident-ai/deepeval/blob/master/LICENSE.md">
    <img alt="License" src="https://img.shields.io/github/license/confident-ai/deepeval.svg?color=yellow">
</a>
<a href="https://x.com/deepeval">
    <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/deepeval?style=social&logo=x">
</a>








<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=de">Deutsch</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=es">Español</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=fr">français</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=ja">日本語</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=ko">한국어</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=pt">Português</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=ru">Русский</a> | 
<a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=zh">中文</a>

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating large-language model systems. It is similar to Pytest but specialized for unit testing LLM apps. DeepEval incorporates the latest research to run evals via metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run locally on your machine.

Whether you're building AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your AI quality, prevent prompt drifting, or even transition from OpenAI to Claude with confidence.

[!IMPORTANT] Need a place for your DeepEval testing data to live 🏡❤️? Sign up to Confident AI to compare iterations of your LLM app, generate & share testing reports, and more.

Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our discord.

🔥 Metrics and Features

📐 Large variety of ready-to-use LLM eval metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that run locally on your machine covering all use cases:
Custom, All-Purpose Metrics:
- G-Eval — a research-backed LLM-as-a-judge metric for evaluating on any custom criteria with human-like accuracy
- DAG — DeepEval's graph-based deterministic LLM-as-a-judge metric builder

Agentic Metrics

- [Task Completion](https://deepeval.com/docs/metrics-task-completion) — evaluate whether an agent accomplished its goal
- [Tool Correctness](https://deepeval.com/docs/metrics-tool-correctness) — check if the right tools were called with the right arguments
- [Goal Accuracy](https://deepeval.com/docs/metrics-goal-accuracy) — measure how accurately the agent achieved the intended goal
- [Step Efficiency](https://deepeval.com/docs/metrics-step-efficiency) — evaluate whether the agent took unnecessary steps
- [Plan Adherence](https://deepeval.com/docs/metrics-plan-adherence) — check if the agent followed the expected plan
- [Plan Quality](https://deepeval.com/docs/metrics-plan-quality) — evaluate the quality of the agent's plan
- [Tool Use](https://deepeval.com/docs/metrics-tool-use) — measure quality of tool usage
- [Argument Correctness](https://deepeval.com/docs/metrics-argument-correctness) — validate tool call arguments

RAG Metrics

- [Answer Relevancy](https://deepeval.com/docs/metrics-answer-relevancy) — measure how relevant the RAG pipeline's output is to the input
- [Faithfulness](https://deepeval.com/docs/metrics-faithfulness) — evaluate whether the RAG pipeline's output factually aligns with the retrieval context
- [Contextual Recall](https://deepeval.com/docs/metrics-contextual-recall) — measure how well the RAG pipeline's retrieval context aligns with the expected output
- [Contextual Precision](https://deepeval.com/docs/metrics-contextual-precision) — evaluate whether relevant nodes in the RAG pipeline's retrieval context are ranked higher
- [Contextual Relevancy](https://deepeval.com/docs/metrics-contextual-relevancy) — measure the overall relevance of the RAG pipeline's retrieval context to the input
- [RAGAS](https://deepeval.com/docs/metrics-ragas) — average of answer relevancy, faithfulness, contextual precision, and contextual recall

Multi-Turn Metrics

- [Knowledge Retention](https://deepeval.com/docs/metrics-knowledge-retention) — evaluate whether the chatbot retains factual information throughout a conversation
- [Conversation Completeness](https://deepeval.com/docs/metrics-conversation-completeness) — measure whether the chatbot satisfies user needs throughout a conversation
- [Turn Relevancy](https://deepeval.com/docs/metrics-turn-relevancy) — evaluate whether the chatbot generates consistently relevant responses throughout a conversation
- [Turn Faithfulness](https://deepeval.com/docs/metrics-turn-faithfulness) — check if the chatbot's responses are factually grounded in retrieval context across turns
- [Role Adherence](https://deepeval.com/docs/metrics-role-adherence) — evaluate whether the chatbot adheres to its assigned role throughout a conversation

MCP Metrics

- [MCP Task Completion](https://deepeval.com/docs/metrics-mcp-task-completion) — evaluate how effectively an MCP-based agent accomplishes a task
- [MCP Use](https://deepeval.com/docs/metrics-mcp-use) — measure how effectively an agent uses its available MCP servers
- [Multi-Turn MCP Use](https://deepeval.com/docs/metrics-multi-turn-mcp-use) — evaluate MCP server usage across conversation turns

Multimodal Metrics

- [Text to Image](https://deepeval.com/docs/multimodal-metrics-text-to-image) — evaluate image generation quality based on semantic consistency and perceptual quality
- [Image Editing](https://deepeval.com/docs/multimodal-metrics-image-editing) — evaluate image editing quality based on semantic consistency and perceptual quality
- [Image Coherence](https://deepeval.com/docs/multimodal-metrics-image-coherence) — measure how well images align with their accompanying text
- [Image Helpfulness](https://deepeval.com/docs/multimodal-metrics-image-helpfulness) — evaluate how effectively images contribute to user comprehension of the text
- [Image Reference](https://deepeval.com/docs/multimodal-metrics-image-reference) — evaluate how accurately images are referred to or explained by accompanying text

Other Metrics

- [Hallucination](https://deepeval.com/docs/metrics-hallucination) — check whether the LLM generates factually correct information against provided context
- [Summarization](https://deepeval.com/docs/metrics-summarization) — evaluate whether summaries are factually correct and include necessary details
- [Bias](https://deepeval.com/docs/metrics-bias) — detect gender, racial, or political bias in LLM outputs
- [Toxicity](https://deepeval.com/docs/metrics-toxicity) — evaluate toxicity in LLM outputs
- [JSON Correctness](https://deepeval.com/docs/metrics-json-correctness) — check whether the output matches an expected JSON schema
- [Prompt Alignment](https://deepeval.com/docs/metrics-prompt-alignment) — measure whether the output aligns with instructions in the prompt template

🎯 Supports both end-to-end and component-level LLM evaluation.
🧩 Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.
🔮 Generate both single and multi-turn synthetic datasets for evaluation.
🔗 Integrates seamlessly with ANY CI/CD environment.
🧬 Optimize prompts automatically based on evaluation results.
🏆 Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., including MMLU, HellaSwag, DROP, BIG-Bench Hard, TruthfulQA, HumanEval, GSM8K.

🔌 Integrations

DeepEval plugs into any LLM framework — OpenAI Agents, LangChain, CrewAI, and more. To scale evals across your team — or let anyone run them without writing code — Confident AI gives you a native platform integration.

Frameworks

OpenAI — evaluate and trace OpenAI applications via a client wrapper
OpenAI Agents — evaluate OpenAI Agents end-to-end in under a minute
LangChain — evaluate LangChain applications with a callback handler
LangGraph — evaluate LangGraph agents with a callback handler
Pydantic AI — evaluate Pydantic AI agents with type-safe validation
CrewAI — evaluate CrewAI multi-agent systems
Anthropic — evaluate and trace Claude applications via a client wrapper
AWS AgentCore — evaluate agents deployed on Amazon AgentCore
LlamaIndex — evaluate RAG applications built with LlamaIndex

☁️ Platform + Ecosystem

Confident AI is an all-in-one platform that integrates natively with DeepEval.

Manage datasets, trace LLM applications, run evaluations, and monitor responses in production — all from one platform.
Don't need a UI? Confident AI can also be your data persistent layer - run evals, pull datasets, and inspect traces straight from claude code, cursor, via Confident AI's MCP server.

Confident AI MCP Architecture

🤖 Vibe-Coder QuickStart

Want your coding agent to add evals and fix failures for you? Install the DeepEval skill, point it at your agent, RAG pipeline, or chatbot, and ask it to generate a dataset, write the eval suite, run deepeval test run, and iterate on the failing metrics.

Start with the 5-minute vibe-coder guide.

🚀 Human QuickStart

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

Installation

Deepeval works with Python>=3.9+.

pip install -U deepeval

Create an account (highly recommended)

Using the deepeval platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.

To login, run:

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy [here](https://deepeval.com/docs/data-privacy?

Extension points exported contracts — how you extend this code

PromptMessage (Interface)

(no doc)

typescript/test/test-core/prompt.test.ts

SimulateHttpResponse (Interface)

(no doc)

typescript/src/simulate/index.ts

LanguageOption (Interface)

(no doc)

docs/components/language-selector/language-selector.tsx

Term (Interface)

(no doc)

docs/lib/lang/terms.ts

LinkCardProps (Interface)

(no doc)

docs/src/components/LinkCards/index.tsx

APIAnnotation (Interface)

(no doc)

typescript/src/annotation/api.ts

LinkCardsProps (Interface)

(no doc)

docs/src/components/LinkCards/index.tsx

TraceEvalOptions (Interface)

(no doc)

typescript/src/evaluate/trace-eval.ts

Core symbols most depended-on inside this repo

append

called by 1004

deepeval/dataset/types.py

get

called by 762

deepeval/tracing/context.py

make_model_data

called by 259

deepeval/models/llms/constants.py

_get_prompt

called by 240

deepeval/metrics/base_metric.py

push

called by 236

typescript/src/prompt/index.ts

get_settings

called by 200

deepeval/config/settings.py

edit

called by 193

deepeval/config/settings.py

set

called by 151

deepeval/tracing/context.py

Shape

Method 4,030

Function 3,022

Class 1,250

Interface 156

Route 84

Enum 15

Languages

Python82%

TypeScript18%

Modules by API surface

tests/test_core/stubs.py98 symbols

tests/test_core/test_prompts/test_interpolation.py91 symbols

typescript/src/tracing/tracing.ts80 symbols

deepeval/utils.py75 symbols

tests/test_confident/test_prompt.py73 symbols

tests/test_integrations/test_pydanticai/test_span_interceptor.py70 symbols

tests/test_core/test_models/test_openai_model.py64 symbols

tests/test_core/test_retry_policy.py59 symbols

tests/test_core/test_tracing/test_generators/test_generator_context_safety.py55 symbols

deepeval/tracing/tracing.py55 symbols

deepeval/synthesizer/synthesizer.py54 symbols

tests/test_core/test_test_case/test_multi_turn/test_turn.py53 symbols

Used by 1 indexed graphs manifest dependencies, hub-wide

github.com/topoteretes/cognee

Dependencies from manifests, versioned

@ai-sdk/openai3.0.30 · 1×

@eslint/js9.38.0 · 1×

@google/genai2.8.0 · 1×

@jest/globals29.5.0 · 1×

@langchain/core1.1.31 · 1×

@langchain/openai1.2.9 · 1×

@openai/agents0.5.4 · 1×

@opentelemetry/api1.9.0 · 1×

@opentelemetry/core2.6.0 · 1×

@opentelemetry/exporter-trace-otlp-grpc0.207.0 · 1×

@opentelemetry/exporter-trace-otlp-proto0.211.0 · 1×

@opentelemetry/resources2.2.0 · 1×

For agents

$ claude mcp add deepeval \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact