如何評估RAG的成效?
- Evaluation of RAG (Retrieval-Augmented Generation) performance (Part 5 of RAG Series) (Friend Link) (2024.2)
- Where the ground truth is provided by the evaluator
- Character based evaluation algorithm
- Edit distance
- Common use case: Language translation
- Word based evaluation algorithm
- METEOR, Word Error Rate(WER)
- Common use case: Language translation
- BLEU Score (Bilingual Evaluation Understudy Score)
- Use cases: Translation, Text summarization
- ROGUE score (Recall-Oriented Understudy for Gisting Evaluation Score)
- Use cases: Summarization and Translation, ROGUE works well for short sentences.
- Embedding based evaluation algorithms
- BERT Score
- Use cases: BERT score works well for small text output (e.g. chats)
- Mover Score
- Use cases: Mover score works well for text generation tasks, e.g., machine translation, text summarization, image captioning, question answering and etc.
- Where the ground truth is also generated by LLM (LLM assisted evaluation)
- Mathematical Framework — RAGAS Score
- From the Retrieval perspective, it measures context Precision and Recall.
- From the Generation perspective, it measures context Faithfulness and Relevancy
- Experimental Based Framework — GPT score
- Stop Guessing and Measure Your RAG System to Drive Real Improvements (Friend Link)
- The Triple Crown of RAG Evaluation
- Context Relevance
- Answer Faithfulness
- Answer Relevance
- Spotlight on RAGAs (Retrieval Augmented Generation Assessment)
- Benchmarks — How Does Your RAG System Stack Up?
- Beyond Functionality: Navigating NFRs in a RAG System
- Overall System NFRs: performance, reliability, scalability, security, and usability
- Language Model NFRs: the groundedness, accuracy, and relevance of the responses generated by the language model
- How to Evaluate RAG If You Don’t Have Ground Truth Data
- Vector Similarity Search Threshold
- Using Multiple LLMs to Judge Responses
- Human-in-the-Loop Feedback: Involving the Experts
- Existing Frameworks to Simplify Your Evaluation Process
- RAGAS (Retrieval-Augmented Generation Assessment)
- ARES: Open-Source Framework Using Synthetic Data and LLM Judge
- Evaluation (Part 8) (Friend Link) (2024.4)
- 似乎是整合很多文章,所以,內容比較雜
- Benchmarks
- Language understanding & QA
- Common Sense & Reasoning
- Coding
- Conversation & Chatbot
- Different Ways to Compute Metric Scores
- Scorers that are purely statistical are reliable but inaccurate, as they struggle to take semantics into account. In this section, it is more of the opposite — scorers that purely rely on NLP models are comparably more accurate, but are also more unreliable due to their probabilistic nature.
- 兩者都具備
- Embedding Model
- Language Model
- QAG Score
- GPTScore
- SelfCheckGPT
- Evaluation Methods
- Character level
- Word Level
- Embedding based
- BERTScore (兩者兼具)
- MoverScore (兩者兼具)
- Language Model based
- G-Eval
- Prometheus
- GPTScore (兩者兼具)
- SelfCheckGPT (兩者兼具)
- QAG Score
- Frameworks (包括不同指標)
- RAGEval: framework to automatically generate RAG Evaluation Datasets for different domains
- The Challenges of Retrieving and Evaluating Relevant Context for RAG (Friend Link)
- Ragas, TruLens, and DeepEval
- RAG: Key Aspects of Performance: Metrics and Measurement (Friend Link)
- Evaluating The Quality Of RAG & Long-Context LLM Output
- Evaluate RAG Pipeline Response Using Python
- BLEU (Bilingual Evaluation Understudy) Score
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score
- BERTScore
- Perplexity
- Diversity
- Racial Bias
- How to measure the Bias and Fairness of LLM? (Friend Link)
- RAG Evaluation: a Visual Approach
- RAG techniques evaluation
- Hallucination Detection Methods
- How To Choose an Embedding Model for my RAG Pipeline (Friend Link)
- Bitext Mining: F1, Accuracy, Precision, Recall
- Classification: Accuracy, Average Precision, F1
- Clustering: V-measure
- Pair Classification: Accuracy, Average Precision, F1, Precision, Recall
- Reranking: Mean MRR@k, MAP
- Retrieval: nDCG@k, MRR@k, MAP@k, Precision@k, Recall@k
- Semantic Textual Similarity (STS): Pearson and Spearman correlations
- Summarization: Pearson and Spearman correlations
- Unsupervised LLM Evaluations
- We demonstrate that the quality of self-evaluations can be improved with iterative self-reflection
- How to Create a RAG Evaluation Dataset From Documents (Friend Link)
- Choosing the Best Embedding Model For Your RAG Pipeline
- NDCG (Normalized Discounted Cumulative Gain)
- MRR (Mean Reciprocal Rank)
- MAP (Mean Average Precision)
- Recall
- Precision