RefAlign (refalign)

Overview

refalign measures semantic similarity between model output and a reference answer using sentence embeddings. Where refmatch computes lexical overlap (ROUGE/BLEU/METEOR), refalign captures meaning-level agreement — paraphrases and synonyms that carry the same information but share few exact tokens.

The node operates at the segment level: both output and reference go through the same chunk → refine pipeline before embedding, so alignment is computed on coherent semantic units rather than raw token sequences. An optional LLM-assisted atomic-extraction step can split those segments further into minimal factual units before scoring.

refalign emits five complementary metrics:

  • refalign_precision — fraction of output segments that are semantically covered by the reference
  • refalign_recall — fraction of reference segments that are semantically covered by the output
  • refalign_f1 — harmonic mean of precision and recall
  • refalign_global_similarity — mean cosine similarity across all output segments
  • refalign_score — maximum pairwise cosine similarity between any output segment and any reference segment

Use Case

Use refalign when lexical overlap is insufficient to judge quality:

  • Evaluating paraphrase-heavy model outputs against reference answers
  • Detecting under-generation (low recall) vs. hallucination risk (low precision) separately
  • Measuring semantic fidelity of summaries or translations
  • Benchmarking outputs where gold answers use different vocabulary than model outputs
  • Complementing refmatch with a meaning-level signal in the same run

Node Overview

In nexa-gauge, refalign is an answer-category metric node.

What the node does:

  • Receives refined output chunks from upstream refiner
  • Receives refined reference chunks from upstream refine_reference
  • Optionally sends both chunk lists to the judge LLM for atomic decomposition (atomic_chunks: true)
  • Computes a cosine similarity matrix between all output-segment and reference-segment embeddings
  • Derives precision (max-column matching), recall (max-row matching), F1, global mean similarity, and the overall max pairwise similarity
  • Returns five MetricResult rows

Per-case config (refalign block)

Add an optional refalign block to tune the node's behavior:

json
"refalign": {
  "atomic_chunks": false,
  "similarity_threshold": 0.6,
  "refine_top_k": null
}
KnobDefaultDescription
atomic_chunksfalseWhen true, calls the judge LLM to split each output/reference chunk into atomic factual units before embedding. More accurate for long compound sentences; adds LLM cost.
similarity_threshold0.6Cosine similarity at or above which a segment pair is considered matched. Affects per-segment verdict; does not directly change the aggregate score.
refine_top_knullOverride the MMR top_k for the reference refinement step. null uses the global refiner setting.

Omitting the refalign block runs with all defaults.

Execution Flow

Graph
Rendering diagram...

Both the output and reference go through independent chunk → refine branches so the comparison units are consistently sized on both sides.

Input

json
{
  "case_id": "bitcoin-economics-medium",
  "output": "Bitcoin is a decentralised digital currency created in 2009. It operates on a peer-to-peer network without a central authority. Transactions are recorded on a public blockchain.",
  "reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}

Fields used by refalign:

  • output: refined chunks used as the "candidate" side of the similarity matrix
  • reference: refined chunks used as the "reference" side
  • refalign (optional): config block with atomic_chunks, similarity_threshold, refine_top_k

Fields not used for scoring:

  • input, context — not needed by this node
  • case_id — report identity only

Output

Primary output type is RefalignMetrics (ng_core/types.py).

  • metrics: list[MetricResult]
  • cost: CostEstimate

Example output (default config, atomic_chunks: false):

json
{
  "metrics": [
    {
      "name": "refalign_precision",
      "category": "output|generation|answer",
      "score": 0.82,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    },
    {
      "name": "refalign_recall",
      "category": "output|generation|answer",
      "score": 0.74,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    },
    {
      "name": "refalign_f1",
      "category": "output|generation|answer",
      "score": 0.78,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    },
    {
      "name": "refalign_global_similarity",
      "category": "output|generation|answer",
      "score": 0.79,
      "verdict": "PASSED",
      "result": [{"passed": false}],
      "error": null
    },
    {
      "name": "refalign_score",
      "category": "output|generation|answer",
      "score": 0.88,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.0,
    "input_tokens": null,
    "output_tokens": null
  }
}

When atomic_chunks: true, the node makes LLM calls for decomposition and cost reflects that usage.

Attribute meaning:

  • metrics: five results when both output and reference are present; empty when either is absent
  • name: refalign_precision, refalign_recall, refalign_f1, refalign_global_similarity, or refalign_score
  • category: output|generation|answer
  • score: value in [0,1] (higher is better semantic alignment)
  • verdict: PASSED or FAILED against per-metric pass thresholds (precision 0.6, recall 0.6, F1 0.7, global similarity 0.6, refalign_score 0.8)
  • result[0].passed: boolean pass/fail for that metric
  • error: populated when embedding or LLM decomposition fails; null on success
  • cost.cost: 0.0 when atomic_chunks: false; non-zero when LLM decomposition is enabled
  • cost.input_tokens, cost.output_tokens: LLM token usage from atomic extraction, or null when no LLM calls are made

Usage

bash
OUTPUT_DIR=./out/refalign
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash
nexagauge estimate refalign \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/refalign_estimate.txt"

estimate supports --input and --limit. No --output-dir flag; redirect or tee to save.

CLI: Run Evaluation

bash
nexagauge run refalign \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

To run refalign alongside all other metrics in a single pass:

bash
nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

Enable Atomic Extraction

Add a refalign block to your records to enable LLM-assisted atomic decomposition:

json
{
  "case_id": "example",
  "output": "...",
  "reference": "...",
  "refalign": { "atomic_chunks": true }
}

This makes the node split coarse sentences into minimal factual units before embedding, improving alignment accuracy for long compound statements at the cost of additional LLM calls per case.