Ranking Evaluation

Ranking evaluation tells you how well a system orders results relative to what users actually find relevant. Use these metrics when evaluating search engines, recommendation feeds, or retrieval stages in RAG pipelines.

Relevance judgments

All ranking metrics in reval accept a map[string]int relevance map that associates each item ID with an integer relevance score:

0 — not relevant
1 — relevant (binary) or minimally relevant (graded)
2, 3, … — increasingly relevant (graded judgments)

relevance := map[string]int{
    "doc-A": 3,  // highly relevant
    "doc-B": 2,  // relevant
    "doc-C": 1,  // marginally relevant
    "doc-D": 0,  // not relevant
    "doc-E": 3,  // highly relevant (not retrieved)
}

You only need to include items you have judgments for. Items in predicted that are absent from the map are treated as relevance 0.

Precision@K

Precision@K measures what fraction of the top-K retrieved items are relevant.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    predicted := []string{"A", "B", "C", "D"}
    relevance := map[string]int{
        "A": 3,
        "B": 2,
        "C": 0,
        "D": 0,
        "E": 3,
    }

    s := reval.Precision(predicted, relevance, 3)
    fmt.Println("Precision@3:", s)
    // Output: Precision@3: 0.6666666666666666
}

2 of the top-3 items (A and B) have relevance ≥ 1, so Precision@3 = 2/3.

Recall@K

Recall@K measures what fraction of all relevant items appear in the top-K results.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    predicted := []string{"A", "B", "C", "D"}
    relevance := map[string]int{
        "A": 3,
        "B": 2,
        "C": 1,
        "D": 0,
        "E": 3,
    }

    s := reval.Recall(predicted, relevance, 3)
    fmt.Println("Recall@3:", s)
    // Output: Recall@3: 0.75
}

3 relevant items exist (A, B, C). The top-3 retrieves A and B and C, so Recall@3 = 3/4 (E is relevant but not retrieved).

Precision and recall trade off against each other. A high-precision system returns few results but most are relevant; a high-recall system returns many results to avoid missing anything. Choose K to match your product’s page size or cutoff.

NDCG@K

NDCG (Normalized Discounted Cumulative Gain) accounts for both relevance and position. Highly relevant items appearing lower in the ranking are penalized.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    predicted := []string{"A", "B", "C", "D"}
    relevance := map[string]int{
        "A": 3,
        "B": 2,
        "C": 1,
        "D": 0,
        "E": 3,
    }

    s := reval.NDCG(predicted, relevance, 3)
    fmt.Println("NDCG@3:", s)
    // Output: NDCG@3: 0.7271926019583822
}

The score is normalized against the ideal ranking (items sorted by relevance descending), so a perfect ranking scores 1.0. Use NDCG when relevance is graded and position quality matters.

Multi-query evaluation with MAP

For a system evaluated across multiple queries, use Mean Average Precision (MAP) to aggregate across all queries into a single number.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    results := []reval.QueryResult{
        {
            Predicted: []string{"C", "A", "B", "D"},
            Relevance: map[string]int{
                "A": 1, "B": 1, "C": 0, "D": 0, "E": 1,
            },
        },
        {
            Predicted: []string{"A", "B", "C", "D"},
            Relevance: map[string]int{
                "A": 1, "B": 0, "C": 1, "D": 0, "E": 1,
            },
        },
    }

    s := reval.MeanAveragePrecision(results, 4)
    fmt.Printf("MAP@4: %.4f\n", s)
    // Output: MAP@4: 0.7083
}

QueryResult pairs a ranked list of retrieved IDs with the relevance map for that query. MAP averages the Average Precision score across all queries.

Choosing the right metric

When to use Precision@K

Use when users look at only the top K results and you care about result quality more than coverage. Common for web search and featured recommendations.

When to use Recall@K

Use when missing relevant items is costly — for example, legal document retrieval or medical record search where completeness matters.

When to use MAP

Use when evaluating across many queries simultaneously with binary relevance labels. MAP is the standard offline benchmark metric for information retrieval research.

When to use NDCG

Use when relevance is graded (not just relevant/not-relevant) or when the ranking position of highly relevant items matters to your product. NDCG is preferred for e-commerce and recommendation systems.

​Relevance judgments

​Precision@K

​Recall@K

​NDCG@K

​Multi-query evaluation with MAP

​Choosing the right metric

Relevance judgments

Precision@K

Recall@K

NDCG@K

Multi-query evaluation with MAP

Choosing the right metric