Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/itsubaki/reval/llms.txt

Use this file to discover all available pages before exploring further.

Text generation evaluation measures how closely a model’s output matches one or more reference texts. Use these metrics to evaluate summarization, machine translation, or answers generated by a RAG pipeline. reval provides two complementary approaches:
  • ROUGE — lexical overlap between candidate and reference tokens
  • BERTScore — semantic similarity using dense vector embeddings

Tokenization

All reval text functions operate on []string token slices, not raw strings. You are responsible for tokenizing text before passing it in.
import "strings"

candidate := "the cat is sitting on the mat"
tokens := strings.Fields(candidate) // ["the", "cat", "is", "sitting", "on", "the", "mat"]
strings.Fields splits on whitespace and is fine for quick experiments. For production evaluation, use a proper tokenizer that handles punctuation, casing, and stemming consistently between candidates and references.

ROUGE-1

ROUGE-1 measures unigram (single token) overlap between candidate and reference.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGE1(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}
All three values are returned: precision (what fraction of candidate tokens appear in the reference), recall (what fraction of reference tokens appear in the candidate), and F1 (harmonic mean).

ROUGE-L

ROUGE-L measures the Longest Common Subsequence (LCS) between candidate and reference, capturing in-order word matches without requiring them to be contiguous.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGEL(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}
ROUGE-L is more sensitive to word order than ROUGE-1. Use ROUGE-L when the ordering of key phrases matters (e.g., translation quality). Use ROUGE-1 for bag-of-words overlap tasks like keyword coverage in summaries.

ROUGE-Lsum

ROUGELsum evaluates multi-sentence candidates against multiple reference sentences. It finds the best-matching reference for each candidate sentence and accumulates LCS scores across all sentences.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := [][]string{
        {"the", "cat", "is", "on", "the", "mat"},
        {"it", "is", "cute"},
    }
    refs := [][]string{
        {"the", "dog", "is", "on", "the", "mat"},
        {"the", "animal", "is", "cute"},
        {"the", "pet", "sleeps", "well"},
    }

    precision, recall, f1 := reval.ROUGELsum(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7778, 0.5000, 0.6087
}
Use ROUGELsum when your candidate is a multi-sentence document (e.g., an extractive summary) and you have multiple reference summaries to compare against.

BERTScore

BERTScore computes semantic similarity using pre-computed token embeddings. It greedily matches each candidate embedding to the most similar reference embedding via dot product, then returns precision, recall, and F1.
BERTScore expects embeddings you compute yourself — for example, using a BERT, Sentence-BERT, or any other embedding model. The function does not perform tokenization or encoding; it only handles the matching and scoring step.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    // Each inner slice is one token's embedding vector.
    // In practice, generate these with a real embedding model.
    candidates := [][]float64{
        {0.1, 0.2, 0.3},
        {0.4, 0.5, 0.6},
    }
    refs := [][]float64{
        {0.1, 0.2, 0.3},
        {0.7, 0.8, 0.9},
    }

    precision, recall, f1 := reval.BERTScore(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.8600, 0.7700, 0.8125
}

L2-normalizing embeddings

If your embedding model does not produce unit-norm vectors, normalize them before scoring to ensure dot product equals cosine similarity:
for i, emb := range candidates {
    candidates[i] = reval.Normalize(emb)
}
for i, emb := range refs {
    refs[i] = reval.Normalize(emb)
}

ROUGE vs BERTScore

ROUGEBERTScore
MeasuresLexical token overlapSemantic vector similarity
Requires embeddingsNoYes
Sensitive to synonymsNoYes
Fast to computeYesDepends on embedding model
Best forQuick offline eval, shared tasksSemantic quality, paraphrase tolerance