Grounding (grounding)

Overview

grounding measures factual faithfulness: are the claims in a model answer actually supported by the provided context?

The metric design aligns with two core papers:

  • RAGAS arXiv:2309.15217 frames faithfulness for RAG as a claim-level support check against retrieved passages, without needing a gold reference answer.
  • FActScore arXiv:2305.14251 argues that factuality should be evaluated as atomic facts, not a single binary judgment, because long-form outputs often mix correct and incorrect statements.

In practice, this means:

  1. Break answer content into verifiable claims.
  2. Check each claim against context evidence.
  3. Aggregate claim verdicts into a faithfulness score.

nexa-gauge’s grounding node operationalizes this pattern using claims from the output and an LLM judge that returns boolean support verdicts per claim. The final score is the fraction of supported claims.

This metric is especially useful when you care about hallucination control and evidence-backed answering. It evaluates factual support, not style or completeness, so it should be combined with other metrics (for example relevance) for broader quality coverage.

Use Case

Use grounding when you need confidence that outputs stay tied to supplied evidence:

  • RAG QA systems (docs, knowledge bases, support bots)
  • Compliance/policy workflows where unsupported claims are risky
  • Regression testing after retrieval, prompt, or model changes
  • Benchmarking hallucination rate across model versions
  • Validating claim-level trustworthiness in generated summaries

Node Overview (nexa-gauge)

In nexa-gauge, grounding is an answer-category metric node.

What it does:

  • Receives list of Claim objects from upstream Claims Node.
  • Receives context from normalized scanner inputs.
  • Sends one judge prompt with:
    • full context text
    • numbered claims
  • Expects structured output that varies by scoring mode (see below).
  • Maps each raw judge value into a normalized per-claim score, then aggregates:
    • per-claim verdict = "ACCEPTED" when normalized score ≥ 0.6 (the pass threshold), else "REJECTED"
    • overall score = mean(per_claim_scores)

Per-node scoring controls (grounding block)

Add an optional grounding block to your record to tune the judge's output:

json
"grounding": { "scoring_mode": "scale_1_5", "include_reasoning": true }
  • scoring_mode: binary_yes_no (default) or scale_1_5
    • binary_yes_no: judge returns {"verdicts": [true, false, ...]}. Per-claim score is 0 or 1.
    • scale_1_5: judge returns {"verdicts": [4, 1, 5, ...]} (integers 1-5). Each is normalized via (raw-1)/4, then averaged across claims.
  • include_reasoning: false (default) or true
    • When true, the judge also returns a single batch-level reasoning string that summarizes the decision. It is appended to MetricResult.result after the per-claim verdict entries.

Omitting the grounding block (or omitting either knob) falls back to the conservative defaults — binary verdicts, no reasoning.

Skip behavior:

  • If no claims, no context, or grounding disabled, returns empty metrics and zero cost.

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input:

json
{
  "case_id": "eiffel-tower-basic",
  "input": "What is the Eiffel Tower and where is it located?",
  "output": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. .......",
  "context": "The Eiffel Tower (/ˈaɪfəl/ EYE-fəl; French: Tour Eiffel) is a wrought-iron lattice tower on the Champ de Mars in Paris, France. ......."
}

Fields used by the grounding branch:

  • output: used (upstream) to create claims
  • context: used (directly) as evidence text for support verification
  • case_id: used for case identity/reporting, not for scoring logic

Fields not used by grounding:

  • input: not used by grounding (used by relevance)
  • reference: not used by grounding (used by refmatch and refalign)

Output

Primary output type:

  • GroundingMetrics
    • metrics: list[MetricResult]
    • cost: CostEstimate

Example output with scoring_mode: "scale_1_5" and include_reasoning: true:

json
{
  "metrics": [
    {
      "name": "grounding",
      "category": "output|generation|answer",
      "score": 0.625,
      "result": [
        {
          "item": {
            "id": "a1b2c3d4e5f6a7b8",
            "text": "The Eiffel Tower is in Paris, France.",
            "tokens": 10.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 0,
          "confidence": 0.93,
          "extraction_failed": false,
          "verdict": "ACCEPTED",
          "raw_score": 5
        },
        {
          "item": {
            "id": "b2c3d4e5f6a7b8c9",
            "text": "The Eiffel Tower is located in Berlin.",
            "tokens": 10.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 0,
          "confidence": 0.88,
          "extraction_failed": false,
          "verdict": "REJECTED",
          "raw_score": 1
        },
        { "reasoning": "Most claims are directly supported by the context; the Berlin claim contradicts it." }
      ],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.00042,
    "input_tokens": 215.0,
    "output_tokens": 28.0
  }
}

In the default binary + no-reasoning configuration, the per-claim raw_score is 0 or 1, and the trailing {"reasoning": ...} entry is omitted entirely.

Attribute meaning:

  • metrics: one entry for this node (name="grounding"), or empty when skipped
  • name: metric/node identifier
  • category: output|generation|answer (from MetricCategory.ANSWER)
  • score: mean of per-claim normalized scores in [0,1]
  • result: per-claim faithfulness records, plus an optional trailing {"reasoning": "..."} dict when include_reasoning: true
  • result[].item: claim text and token metadata
  • result[].source_chunk_index: output chunk where claim came from
  • result[].confidence: extractor confidence for the claim
  • result[].extraction_failed: extraction failure marker
  • result[].verdict: ACCEPTED (per-claim score ≥ 0.6) or REJECTED
  • result[].raw_score: the raw integer the judge emitted (1-5 for scale_1_5, 0/1 for binary)
  • error: populated when verdict parsing fails (for example "No verdicts returned")
  • cost.cost: USD cost estimate/actual for this node call
  • cost.input_tokens, cost.output_tokens: model token usage (or null for zero-cost skips)

Usage

bash
OUTPUT_DIR=./out/grounding
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash
nexagauge estimate grounding \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/estimate.txt"

Note: estimate currently supports --input and --limit, but not --output-dir; use tee to save estimate output.

Run Evaluation

bash
nexagauge run grounding \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

For full per-case report files that include grounding plus other metrics:

bash
nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"