Text generation evaluation measures how closely a model’s output matches one or more reference texts. Use these metrics to evaluate summarization, machine translation, or answers generated by a RAG pipeline.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/itsubaki/reval/llms.txt
Use this file to discover all available pages before exploring further.
reval provides two complementary approaches:
- ROUGE — lexical overlap between candidate and reference tokens
- BERTScore — semantic similarity using dense vector embeddings
Tokenization
Allreval text functions operate on []string token slices, not raw strings. You are responsible for tokenizing text before passing it in.
strings.Fields splits on whitespace and is fine for quick experiments. For production evaluation, use a proper tokenizer that handles punctuation, casing, and stemming consistently between candidates and references.ROUGE-1
ROUGE-1 measures unigram (single token) overlap between candidate and reference.ROUGE-L
ROUGE-L measures the Longest Common Subsequence (LCS) between candidate and reference, capturing in-order word matches without requiring them to be contiguous.ROUGE-Lsum
ROUGELsum evaluates multi-sentence candidates against multiple reference sentences. It finds the best-matching reference for each candidate sentence and accumulates LCS scores across all sentences.
ROUGELsum when your candidate is a multi-sentence document (e.g., an extractive summary) and you have multiple reference summaries to compare against.
BERTScore
BERTScore computes semantic similarity using pre-computed token embeddings. It greedily matches each candidate embedding to the most similar reference embedding via dot product, then returns precision, recall, and F1.BERTScore expects embeddings you compute yourself — for example, using a BERT, Sentence-BERT, or any other embedding model. The function does not perform tokenization or encoding; it only handles the matching and scoring step.L2-normalizing embeddings
If your embedding model does not produce unit-norm vectors, normalize them before scoring to ensure dot product equals cosine similarity:ROUGE vs BERTScore
| ROUGE | BERTScore | |
|---|---|---|
| Measures | Lexical token overlap | Semantic vector similarity |
| Requires embeddings | No | Yes |
| Sensitive to synonyms | No | Yes |
| Fast to compute | Yes | Depends on embedding model |
| Best for | Quick offline eval, shared tasks | Semantic quality, paraphrase tolerance |