Relevance (relevance)
Overview
Relevance measures whether an answer stays on-topic with the userâs input, at the claim level.
The idea is aligned with recent evaluation work:
- RAGAS arXiv:2309.15217 emphasizes reference-free, component-level evaluation for RAG systems, including answer quality dimensions beyond final exact-match style scoring.
- FActScore arXiv:2305.14251 shows why claim-level decomposition is important: one answer can contain a mix of good and bad statements, so per-claim judgment is more informative than one coarse label.
- Judging LLM-as-a-Judge arXiv:2306.05685 supports using strong LLM judges for scalable automated evaluation, while highlighting bias risks and careful prompt/interpretation design.
In nexa-gauge, relevance follows this pattern by checking each extracted claim from the output against the input and returning boolean verdicts (relevant / not relevant). The final score is the fraction of claims judged relevant.
This metric answers: “Did the model answer the input asked?” It does not measure factual support against evidence (that is grounding) and does not compare against a reference answer (that is refmatch/refalign metrics).
Use Case
Use relevance when you need to detect off-topic or partially on-topic responses:
- QA systems where drift/off-topic content hurts UX
- Agent outputs that tend to add unrelated details
- Regression checks after prompt/model updates
- Evaluation of concise answering behavior
- Triage of answer quality before deeper factual checks
Node Overview
In nexa-gauge, relevance is an answer-category metric node.
What it does:
- Uses claims extracted from the
claims_extraction. - Uses the
inputas relevance target. - Calls the judge model with numbered claims and input.
- Expects structured output that varies by scoring mode (see below).
- Maps each raw judge value into a normalized per-claim score, then aggregates:
- per-claim
verdict = "ACCEPTED"when normalized score ≥0.6, else"REJECTED" - overall
score = mean(per_claim_scores)
- per-claim
Per-node scoring controls (relevance block)
Add an optional relevance block to your record to tune the judge's output:
"relevance": { "scoring_mode": "scale_1_5", "include_reasoning": true }scoring_mode:binary_yes_no(default) orscale_1_5binary_yes_no: judge returns{"verdicts": [true, false, ...]}. Per-claim score is 0 or 1.scale_1_5: judge returns{"verdicts": [5, 2, 4, ...]}(integers 1-5). Each is normalized via(raw-1)/4, then averaged.
include_reasoning:false(default) ortrue- When
true, the judge also returns a single batch-levelreasoningstring appended toMetricResult.resultafter the per-claim verdict entries.
- When
Omitting the relevance block (or omitting either knob) falls back to the conservative defaults.
Skip behavior:
- If claims are missing, relevance is disabled, or input is empty, returns empty metrics and zero cost.
Execution Flow
Input
Using your sample input:
{
"case_id": "eiffel-tower-basic",
"input": "What is the Eiffel Tower and where is it located?",
"output": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. ......."
}Fields used by the relevance branch:
output: used upstream to produceclaim_extractioninput: used directly by therelevancejudgecase_id: used for case/report identity, not score computation
Fields not required by this node:
contextis not needed for relevance scoringreferenceis not needed for relevance scoring
Output
Primary output type:
RelevanceMetricsmetrics: list[MetricResult]cost: CostEstimate
Example output with scoring_mode: "scale_1_5" and include_reasoning: true:
{
"metrics": [
{
"name": "answer_relevancy",
"category": "output|generation|answer",
"score": 0.625,
"result": [
{
"item": {
"id": "11aa22bb33cc44dd",
"text": "The Eiffel Tower is a wrought-iron lattice tower in Paris.",
"tokens": 12.0,
"confidence": 1.0,
"cached": false
},
"source_chunk_index": 0,
"confidence": 0.92,
"extraction_failed": false,
"verdict": "ACCEPTED",
"raw_score": 5
},
{
"item": {
"id": "55ee66ff77gg88hh",
"text": "Transformers use self-attention in deep learning.",
"tokens": 9.0,
"confidence": 1.0,
"cached": false
},
"source_chunk_index": 1,
"confidence": 0.85,
"extraction_failed": false,
"verdict": "REJECTED",
"raw_score": 1
},
{ "reasoning": "Most claims address the Eiffel Tower question; the transformers claim is unrelated." }
],
"error": null
}
],
"cost": {
"cost": 0.00039,
"input_tokens": 188.0,
"output_tokens": 24.0
}
}In the default binary + no-reasoning configuration, per-claim raw_score is 0 or 1, and the trailing {"reasoning": ...} entry is omitted.
Attribute meaning:
metrics: list of metric results for this node (empty when skipped)name: metric identifier (answer_relevancyin current implementation)category:answerscore: mean of per-claim normalized scores in[0, 1]result: per-claim relevance judgments (RelevancyClaim), plus an optional trailing{"reasoning": "..."}dict wheninclude_reasoning: trueresult[].item: claim text and token metadataresult[].source_chunk_index: source output chunk indexresult[].confidence: claim extractor confidenceresult[].extraction_failed: extraction failure flagresult[].verdict:ACCEPTED(per-claim score ≥ 0.6) orREJECTEDresult[].raw_score: the raw integer the judge emitted (1-5 for scale_1_5, 0/1 for binary)error: populated if judge output has no usable verdictscost.cost: USD cost for relevance evaluationcost.input_tokens,cost.output_tokens: token usage for the judge call
Usage
OUTPUT_DIR=./out/relevance
mkdir -p "$OUTPUT_DIR"Estimate Cost
nexagauge estimate relevance \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/relevance-estimate.txt"Note: estimate supports --input and --limit; it does not expose a native --output-dir flag, so redirect/tee is used with OUTPUT_DIR.
Run Evaluation
nexagauge run relevance \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR"For full aggregation/report files including all metrics:
nexagauge run eval \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR"