如何評估RAG的成效?

Evaluation of RAG (Retrieval-Augmented Generation) performance (Part 5 of RAG Series) (Friend Link) (2024.2)
- Where the ground truth is provided by the evaluator
  - Character based evaluation algorithm
    - Edit distance
      - Common use case: Language translation
  - Word based evaluation algorithm
    - METEOR, Word Error Rate(WER)
      - Common use case: Language translation
    - BLEU Score (Bilingual Evaluation Understudy Score)
      - Use cases: Translation, Text summarization
    - ROGUE score (Recall-Oriented Understudy for Gisting Evaluation Score)
      - Use cases: Summarization and Translation, ROGUE works well for short sentences.
  - Embedding based evaluation algorithms
    - BERT Score
      - Use cases: BERT score works well for small text output (e.g. chats)
    - Mover Score
      - Use cases: Mover score works well for text generation tasks, e.g., machine translation, text summarization, image captioning, question answering and etc.
- Where the ground truth is also generated by LLM (LLM assisted evaluation)
  - Mathematical Framework — RAGAS Score
    - From the Retrieval perspective, it measures context Precision and Recall.
    - From the Generation perspective, it measures context Faithfulness and Relevancy
  - Experimental Based Framework — GPT score
Stop Guessing and Measure Your RAG System to Drive Real Improvements (Friend Link)
- The Triple Crown of RAG Evaluation
  - Context Relevance
  - Answer Faithfulness
  - Answer Relevance
- Spotlight on RAGAs (Retrieval Augmented Generation Assessment)
- Benchmarks — How Does Your RAG System Stack Up?
Beyond Functionality: Navigating NFRs in a RAG System
- Overall System NFRs: performance, reliability, scalability, security, and usability
- Language Model NFRs: the groundedness, accuracy, and relevance of the responses generated by the language model
How to Evaluate RAG If You Don’t Have Ground Truth Data
- Vector Similarity Search Threshold
- Using Multiple LLMs to Judge Responses
- Human-in-the-Loop Feedback: Involving the Experts
- Existing Frameworks to Simplify Your Evaluation Process
  - RAGAS (Retrieval-Augmented Generation Assessment)
  - ARES: Open-Source Framework Using Synthetic Data and LLM Judge
Evaluation (Part 8) (Friend Link) (2024.4)
- 似乎是整合很多文章，所以，內容比較雜
- Benchmarks
  - Language understanding & QA
  - Common Sense & Reasoning
  - Coding
  - Conversation & Chatbot
- Different Ways to Compute Metric Scores
  - Scorers that are purely statistical are reliable but inaccurate, as they struggle to take semantics into account. In this section, it is more of the opposite — scorers that purely rely on NLP models are comparably more accurate, but are also more unreliable due to their probabilistic nature.
    - 兩者都具備
      - Embedding Model
        
        BERT Score
        
        Mover Score
      - Language Model
        
        QAG Score
        
        GPTScore
        
        SelfCheckGPT
- Evaluation Methods
  - Character level
  - Word Level
  - Embedding based
    - BERTScore (兩者兼具)
    - MoverScore (兩者兼具)
  - Language Model based
    - G-Eval
    - Prometheus
    - GPTScore (兩者兼具)
    - SelfCheckGPT (兩者兼具)
    - QAG Score
  - Frameworks (包括不同指標)
    - Deepeval
    - RAGAs
RAGEval: framework to automatically generate RAG Evaluation Datasets for different domains
The Challenges of Retrieving and Evaluating Relevant Context for RAG (Friend Link)
- Ragas, TruLens, and DeepEval
RAG: Key Aspects of Performance: Metrics and Measurement (Friend Link)
Evaluating The Quality Of RAG & Long-Context LLM Output
Evaluate RAG Pipeline Response Using Python
- BLEU (Bilingual Evaluation Understudy) Score
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score
- BERTScore
- Perplexity
- Diversity
- Racial Bias
How to measure the Bias and Fairness of LLM? (Friend Link)
RAG Evaluation: a Visual Approach
RAG techniques evaluation
Hallucination Detection Methods
How To Choose an Embedding Model for my RAG Pipeline (Friend Link)
Unsupervised LLM Evaluations
How to Create a RAG Evaluation Dataset From Documents (Friend Link)
Choosing the Best Embedding Model For Your RAG Pipeline