Back to Catalog

Use DeepEval and Traditional Metrics to assess RAG responses

BeginnerGuided Project

Explore Large Language Model (LLM) evaluation techniques in this hands-on project that compares LLaMA and Granite for Retrieval-Augmented Generation (RAG) and textual analysis. Leverage HuggingFace's Evaluate library for computing traditional metrics and DeepEval, a modern, LLM-based framework for evaluating complex metrics. Through step-by-step guidance, you’ll set up RAG and metric evaluation pipelines, interpret the results, and discover how modular metrics adapt to any LLM use case. Enroll now to gain essential data science expertise and confidently deploy robust RAG applications.

Language

  • English

Topic

  • Artificial Intelligence

Skills You Will Learn

  • LLM Evaluation, DeepEval, RAG, ROUGE, BERTScore, Generative AI

Offered By

  • IBMSkillsNetwork

Estimated Effort

  • 20 mins

Platform

  • SkillsNetwork

Last Update

  • August 29, 2025
About this Guided Project
Evaluating large language models (LLMs) is crucial for ensuring they deliver accurate, reliable, and contextually appropriate outputs. By comparing models (LLaMA and Granite) in a Retrieval-Augmented Generation (RAG) setup, you’ll gain hands-on experience with the end-to-end evaluation process. You’ll also learn to leverage DeepEval, an LLM based framework, and traditional metrics (ROUGE, BERTScore) via Hugging Face’s Evaluate library. These skills will empower you to rigorously assess any LLM pipeline, identify strengths and weaknesses, and iterate toward better model and overall application performance.

A Look at the Project Ahead

In this guided project, you will:

  • Set Up a RAG Pipeline: Integrate LLaMA and Granite with vector stores to retrieve relevant context for narrative QA.
  • Compute and Compare Metrics: Apply ROUGE and BERTScore to quantify model and retrieval quality, then interpret results.
  • Implement Evaluation Workflows: Use DeepEval to orchestrate human-like judgments alongside automatic metrics.
  • Explore Modularity: See how easily you can swap in new models, datasets, or metrics for future experiments.
  • Visualize and Interpret Results: Plot computed scores in comprehensive graphs to compare model performance on different metrics.
By the end of this project, you will be able to:

  • Design and deploy a retrieval-augmented generation pipeline using popular open-source LLMs.
  • Build a flexible evaluation framework that combines automatic scoring with LLM-driven judgment, and analyze metric outputs to guide model selection.

What You'll Need

  • Basic Python proficiency: Comfortable with common data structures and writing simple scripts.
  • Modern web browser: Latest version of Chrome, Edge, Firefox, or Safari for the optimal notebook experience.
  • (Optional) Library knowledge: Minimal knowledge of Pandas DataFrame data structure and Matplotlib visualization. 

Instructors

Joshua Zhou

Data Scientist

I like building fun and practical things.

Read more

Contributors

Joseph Santarcangelo

Senior Data Scientist at IBM

Joseph has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

Read more

Wojciech "Victor" Fulmyk

Data Scientist at IBM

As a data scientist at the Ecosystems Skills Network at IBM and a Ph.D. candidate in Economics at the University of Calgary, I bring a wealth of experience in unraveling complex problems through the lens of data. What sets me apart is my ability to seamlessly merge technical expertise with effective communication, translating intricate data findings into actionable insights for stakeholders at all levels. Follow my projects to learn data science principles, machine learning algorithms, and artificial intelligence agent implementations.

Read more