Relevance (`relevance`)

Overview

Relevance measures whether an answer stays on-topic with the userâs input, at the claim level.

The idea is aligned with recent evaluation work:

RAGAS arXiv:2309.15217 emphasizes reference-free, component-level evaluation for RAG systems, including answer quality dimensions beyond final exact-match style scoring.
FActScore arXiv:2305.14251 shows why claim-level decomposition is important: one answer can contain a mix of good and bad statements, so per-claim judgment is more informative than one coarse label.
Judging LLM-as-a-Judge arXiv:2306.05685 supports using strong LLM judges for scalable automated evaluation, while highlighting bias risks and careful prompt/interpretation design.

In nexa-gauge, relevance follows this pattern by checking each extracted claim from the output against the input and returning boolean verdicts (relevant / not relevant). The final score is the fraction of claims judged relevant.

This metric answers: “Did the model answer the input asked?” It does not measure factual support against evidence (that is grounding) and does not compare against a reference answer (that is refmatch/refalign metrics).

Use Case

Use relevance when you need to detect off-topic or partially on-topic responses:

QA systems where drift/off-topic content hurts UX
Agent outputs that tend to add unrelated details
Regression checks after prompt/model updates
Evaluation of concise answering behavior
Triage of answer quality before deeper factual checks

Node Overview

In nexa-gauge, relevance is an answer-category metric node.

What it does:

Uses claims extracted from the claims_extraction.
Uses the input as relevance target.
Calls the judge model with numbered claims and input.
Expects structured output that varies by scoring mode (see below).
Maps each raw judge value into a normalized per-claim score, then aggregates:
- per-claim verdict = "ACCEPTED" when normalized score ≥ 0.6, else "REJECTED"
- overall score = mean(per_claim_scores)

Per-node scoring controls (`relevance` block)

Add an optional relevance block to your record to tune the judge's output:

json

"relevance": { "scoring_mode": "scale_1_5", "include_reasoning": true }

scoring_mode: binary_yes_no (default) or scale_1_5
- binary_yes_no: judge returns {"verdicts": [true, false, ...]}. Per-claim score is 0 or 1.
- scale_1_5: judge returns {"verdicts": [5, 2, 4, ...]} (integers 1-5). Each is normalized via (raw-1)/4, then averaged.
include_reasoning: false (default) or true
- When true, the judge also returns a single batch-level reasoning string appended to MetricResult.result after the per-claim verdict entries.

Omitting the relevance block (or omitting either knob) falls back to the conservative defaults.

Skip behavior:

If claims are missing, relevance is disabled, or input is empty, returns empty metrics and zero cost.

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input:

json

{
  "case_id": "eiffel-tower-basic",
  "input": "What is the Eiffel Tower and where is it located?",
  "output": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. ......."
}

Fields used by the relevance branch:

output: used upstream to produce claim_extraction
input: used directly by the relevance judge
case_id: used for case/report identity, not score computation

Fields not required by this node:

context is not needed for relevance scoring
reference is not needed for relevance scoring

Output

Primary output type:

RelevanceMetrics
- metrics: list[MetricResult]
- cost: CostEstimate

Example output with scoring_mode: "scale_1_5" and include_reasoning: true:

json

{
  "metrics": [
    {
      "name": "answer_relevancy",
      "category": "output|generation|answer",
      "score": 0.625,
      "result": [
        {
          "item": {
            "id": "11aa22bb33cc44dd",
            "text": "The Eiffel Tower is a wrought-iron lattice tower in Paris.",
            "tokens": 12.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 0,
          "confidence": 0.92,
          "extraction_failed": false,
          "verdict": "ACCEPTED",
          "raw_score": 5
        },
        {
          "item": {
            "id": "55ee66ff77gg88hh",
            "text": "Transformers use self-attention in deep learning.",
            "tokens": 9.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 1,
          "confidence": 0.85,
          "extraction_failed": false,
          "verdict": "REJECTED",
          "raw_score": 1
        },
        { "reasoning": "Most claims address the Eiffel Tower question; the transformers claim is unrelated." }
      ],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.00039,
    "input_tokens": 188.0,
    "output_tokens": 24.0
  }
}

In the default binary + no-reasoning configuration, per-claim raw_score is 0 or 1, and the trailing {"reasoning": ...} entry is omitted.

Attribute meaning:

metrics: list of metric results for this node (empty when skipped)
name: metric identifier (answer_relevancy in current implementation)
category: answer
score: mean of per-claim normalized scores in [0, 1]
result: per-claim relevance judgments (RelevancyClaim), plus an optional trailing {"reasoning": "..."} dict when include_reasoning: true
result[].item: claim text and token metadata
result[].source_chunk_index: source output chunk index
result[].confidence: claim extractor confidence
result[].extraction_failed: extraction failure flag
result[].verdict: ACCEPTED (per-claim score ≥ 0.6) or REJECTED
result[].raw_score: the raw integer the judge emitted (1-5 for scale_1_5, 0/1 for binary)
error: populated if judge output has no usable verdicts
cost.cost: USD cost for relevance evaluation
cost.input_tokens, cost.output_tokens: token usage for the judge call

Usage

bash

OUTPUT_DIR=./out/relevance
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash

nexagauge estimate relevance \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/relevance-estimate.txt"

Note: estimate supports --input and --limit; it does not expose a native --output-dir flag, so redirect/tee is used with OUTPUT_DIR.

Run Evaluation

bash

nexagauge run relevance \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

For full aggregation/report files including all metrics:

bash

nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

Relevance (relevance)

Overview

Use Case

Node Overview

Per-node scoring controls (relevance block)

Execution Flow

Input

Output

Usage

Estimate Cost

Run Evaluation

Relevance (`relevance`)

Per-node scoring controls (`relevance` block)