> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/itsubaki/reval/llms.txt
> Use this file to discover all available pages before exploring further.

# Text Generation Evaluation

> Evaluate summarization and generation quality with ROUGE and BERTScore

Text generation evaluation measures how closely a model's output matches one or more reference texts. Use these metrics to evaluate summarization, machine translation, or answers generated by a RAG pipeline.

`reval` provides two complementary approaches:

* **ROUGE** — lexical overlap between candidate and reference tokens
* **BERTScore** — semantic similarity using dense vector embeddings

## Tokenization

All `reval` text functions operate on `[]string` token slices, not raw strings. You are responsible for tokenizing text before passing it in.

```go theme={null}
import "strings"

candidate := "the cat is sitting on the mat"
tokens := strings.Fields(candidate) // ["the", "cat", "is", "sitting", "on", "the", "mat"]
```

<Note>
  `strings.Fields` splits on whitespace and is fine for quick experiments. For production evaluation, use a proper tokenizer that handles punctuation, casing, and stemming consistently between candidates and references.
</Note>

## ROUGE-1

ROUGE-1 measures unigram (single token) overlap between candidate and reference.

```go theme={null}
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGE1(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}
```

All three values are returned: **precision** (what fraction of candidate tokens appear in the reference), **recall** (what fraction of reference tokens appear in the candidate), and **F1** (harmonic mean).

## ROUGE-L

ROUGE-L measures the Longest Common Subsequence (LCS) between candidate and reference, capturing in-order word matches without requiring them to be contiguous.

```go theme={null}
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGEL(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}
```

<Tip>
  ROUGE-L is more sensitive to word order than ROUGE-1. Use ROUGE-L when the ordering of key phrases matters (e.g., translation quality). Use ROUGE-1 for bag-of-words overlap tasks like keyword coverage in summaries.
</Tip>

## ROUGE-Lsum

`ROUGELsum` evaluates multi-sentence candidates against multiple reference sentences. It finds the best-matching reference for each candidate sentence and accumulates LCS scores across all sentences.

```go theme={null}
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := [][]string{
        {"the", "cat", "is", "on", "the", "mat"},
        {"it", "is", "cute"},
    }
    refs := [][]string{
        {"the", "dog", "is", "on", "the", "mat"},
        {"the", "animal", "is", "cute"},
        {"the", "pet", "sleeps", "well"},
    }

    precision, recall, f1 := reval.ROUGELsum(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7778, 0.5000, 0.6087
}
```

Use `ROUGELsum` when your candidate is a multi-sentence document (e.g., an extractive summary) and you have multiple reference summaries to compare against.

## BERTScore

BERTScore computes semantic similarity using pre-computed token embeddings. It greedily matches each candidate embedding to the most similar reference embedding via dot product, then returns precision, recall, and F1.

<Note>
  `BERTScore` expects embeddings you compute yourself — for example, using a BERT, Sentence-BERT, or any other embedding model. The function does not perform tokenization or encoding; it only handles the matching and scoring step.
</Note>

```go theme={null}
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    // Each inner slice is one token's embedding vector.
    // In practice, generate these with a real embedding model.
    candidates := [][]float64{
        {0.1, 0.2, 0.3},
        {0.4, 0.5, 0.6},
    }
    refs := [][]float64{
        {0.1, 0.2, 0.3},
        {0.7, 0.8, 0.9},
    }

    precision, recall, f1 := reval.BERTScore(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.8600, 0.7700, 0.8125
}
```

### L2-normalizing embeddings

If your embedding model does not produce unit-norm vectors, normalize them before scoring to ensure dot product equals cosine similarity:

```go theme={null}
for i, emb := range candidates {
    candidates[i] = reval.Normalize(emb)
}
for i, emb := range refs {
    refs[i] = reval.Normalize(emb)
}
```

## ROUGE vs BERTScore

|                       | ROUGE                            | BERTScore                              |
| --------------------- | -------------------------------- | -------------------------------------- |
| Measures              | Lexical token overlap            | Semantic vector similarity             |
| Requires embeddings   | No                               | Yes                                    |
| Sensitive to synonyms | No                               | Yes                                    |
| Fast to compute       | Yes                              | Depends on embedding model             |
| Best for              | Quick offline eval, shared tasks | Semantic quality, paraphrase tolerance |
