Relevance (relevance)

Overview

Relevance measures whether an answer stays on-topic with the user’s input, at the claim level.

The idea is aligned with recent evaluation work:

  • RAGAS arXiv:2309.15217 emphasizes reference-free, component-level evaluation for RAG systems, including answer quality dimensions beyond final exact-match style scoring.
  • FActScore arXiv:2305.14251 shows why claim-level decomposition is important: one answer can contain a mix of good and bad statements, so per-claim judgment is more informative than one coarse label.
  • Judging LLM-as-a-Judge arXiv:2306.05685 supports using strong LLM judges for scalable automated evaluation, while highlighting bias risks and careful prompt/interpretation design.

In nexa-gauge, relevance follows this pattern by checking each extracted claim from the output against the input and returning boolean verdicts (relevant / not relevant). The final score is the fraction of claims judged relevant.

This metric answers: “Did the model answer the input asked?” It does not measure factual support against evidence (that is grounding) and does not compare against a reference answer (that is refmatch/refalign metrics).

Use Case

Use relevance when you need to detect off-topic or partially on-topic responses:

  • QA systems where drift/off-topic content hurts UX
  • Agent outputs that tend to add unrelated details
  • Regression checks after prompt/model updates
  • Evaluation of concise answering behavior
  • Triage of answer quality before deeper factual checks

Node Overview

In nexa-gauge, relevance is an answer-category metric node.

What it does:

  • Uses claims extracted from the claims_extraction.
  • Uses the input as relevance target.
  • Calls the judge model with numbered claims and input.
  • Expects structured output that varies by scoring mode (see below).
  • Maps each raw judge value into a normalized per-claim score, then aggregates:
    • per-claim verdict = "ACCEPTED" when normalized score ≥ 0.6, else "REJECTED"
    • overall score = mean(per_claim_scores)

Per-node scoring controls (relevance block)

Add an optional relevance block to your record to tune the judge's output:

json
"relevance": { "scoring_mode": "scale_1_5", "include_reasoning": true }
  • scoring_mode: binary_yes_no (default) or scale_1_5
    • binary_yes_no: judge returns {"verdicts": [true, false, ...]}. Per-claim score is 0 or 1.
    • scale_1_5: judge returns {"verdicts": [5, 2, 4, ...]} (integers 1-5). Each is normalized via (raw-1)/4, then averaged.
  • include_reasoning: false (default) or true
    • When true, the judge also returns a single batch-level reasoning string appended to MetricResult.result after the per-claim verdict entries.

Omitting the relevance block (or omitting either knob) falls back to the conservative defaults.

Skip behavior:

  • If claims are missing, relevance is disabled, or input is empty, returns empty metrics and zero cost.

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input:

json
{
  "case_id": "eiffel-tower-basic",
  "input": "What is the Eiffel Tower and where is it located?",
  "output": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. ......."
}

Fields used by the relevance branch:

  • output: used upstream to produce claim_extraction
  • input: used directly by the relevance judge
  • case_id: used for case/report identity, not score computation

Fields not required by this node:

  • context is not needed for relevance scoring
  • reference is not needed for relevance scoring

Output

Primary output type:

  • RelevanceMetrics
    • metrics: list[MetricResult]
    • cost: CostEstimate

Example output with scoring_mode: "scale_1_5" and include_reasoning: true:

json
{
  "metrics": [
    {
      "name": "answer_relevancy",
      "category": "output|generation|answer",
      "score": 0.625,
      "result": [
        {
          "item": {
            "id": "11aa22bb33cc44dd",
            "text": "The Eiffel Tower is a wrought-iron lattice tower in Paris.",
            "tokens": 12.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 0,
          "confidence": 0.92,
          "extraction_failed": false,
          "verdict": "ACCEPTED",
          "raw_score": 5
        },
        {
          "item": {
            "id": "55ee66ff77gg88hh",
            "text": "Transformers use self-attention in deep learning.",
            "tokens": 9.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 1,
          "confidence": 0.85,
          "extraction_failed": false,
          "verdict": "REJECTED",
          "raw_score": 1
        },
        { "reasoning": "Most claims address the Eiffel Tower question; the transformers claim is unrelated." }
      ],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.00039,
    "input_tokens": 188.0,
    "output_tokens": 24.0
  }
}

In the default binary + no-reasoning configuration, per-claim raw_score is 0 or 1, and the trailing {"reasoning": ...} entry is omitted.

Attribute meaning:

  • metrics: list of metric results for this node (empty when skipped)
  • name: metric identifier (answer_relevancy in current implementation)
  • category: answer
  • score: mean of per-claim normalized scores in [0, 1]
  • result: per-claim relevance judgments (RelevancyClaim), plus an optional trailing {"reasoning": "..."} dict when include_reasoning: true
  • result[].item: claim text and token metadata
  • result[].source_chunk_index: source output chunk index
  • result[].confidence: claim extractor confidence
  • result[].extraction_failed: extraction failure flag
  • result[].verdict: ACCEPTED (per-claim score ≥ 0.6) or REJECTED
  • result[].raw_score: the raw integer the judge emitted (1-5 for scale_1_5, 0/1 for binary)
  • error: populated if judge output has no usable verdicts
  • cost.cost: USD cost for relevance evaluation
  • cost.input_tokens, cost.output_tokens: token usage for the judge call

Usage

bash
OUTPUT_DIR=./out/relevance
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash
nexagauge estimate relevance \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/relevance-estimate.txt"

Note: estimate supports --input and --limit; it does not expose a native --output-dir flag, so redirect/tee is used with OUTPUT_DIR.

Run Evaluation

bash
nexagauge run relevance \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

For full aggregation/report files including all metrics:

bash
nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"