RefAlign (`refalign`)

Overview

refalign measures semantic similarity between model output and a reference answer using sentence embeddings. Where refmatch computes lexical overlap (ROUGE/BLEU/METEOR), refalign captures meaning-level agreement — paraphrases and synonyms that carry the same information but share few exact tokens.

The node operates at the segment level: both output and reference go through the same chunk → refine pipeline before embedding, so alignment is computed on coherent semantic units rather than raw token sequences. An optional LLM-assisted atomic-extraction step can split those segments further into minimal factual units before scoring.

refalign emits five complementary metrics:

refalign_precision — fraction of output segments that are semantically covered by the reference
refalign_recall — fraction of reference segments that are semantically covered by the output
refalign_f1 — harmonic mean of precision and recall
refalign_global_similarity — mean cosine similarity across all output segments
refalign_score — maximum pairwise cosine similarity between any output segment and any reference segment

Use Case

Use refalign when lexical overlap is insufficient to judge quality:

Evaluating paraphrase-heavy model outputs against reference answers
Detecting under-generation (low recall) vs. hallucination risk (low precision) separately
Measuring semantic fidelity of summaries or translations
Benchmarking outputs where gold answers use different vocabulary than model outputs
Complementing refmatch with a meaning-level signal in the same run

Node Overview

In nexa-gauge, refalign is an answer-category metric node.

What the node does:

Receives refined output chunks from upstream refiner
Receives refined reference chunks from upstream refine_reference
Optionally sends both chunk lists to the judge LLM for atomic decomposition (atomic_chunks: true)
Computes a cosine similarity matrix between all output-segment and reference-segment embeddings
Derives precision (max-column matching), recall (max-row matching), F1, global mean similarity, and the overall max pairwise similarity
Returns five MetricResult rows

Per-case config (`refalign` block)

Add an optional refalign block to tune the node's behavior:

json

"refalign": {
  "atomic_chunks": false,
  "similarity_threshold": 0.6,
  "refine_top_k": null
}

Knob	Default	Description
`atomic_chunks`	`false`	When `true`, calls the judge LLM to split each output/reference chunk into atomic factual units before embedding. More accurate for long compound sentences; adds LLM cost.
`similarity_threshold`	`0.6`	Cosine similarity at or above which a segment pair is considered matched. Affects per-segment verdict; does not directly change the aggregate score.
`refine_top_k`	`null`	Override the MMR `top_k` for the reference refinement step. `null` uses the global refiner setting.

Omitting the refalign block runs with all defaults.

Execution Flow

Graph

Rendering diagram...

Both the output and reference go through independent chunk → refine branches so the comparison units are consistently sized on both sides.

Input

json

{
  "case_id": "bitcoin-economics-medium",
  "output": "Bitcoin is a decentralised digital currency created in 2009. It operates on a peer-to-peer network without a central authority. Transactions are recorded on a public blockchain.",
  "reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}

Fields used by refalign:

output: refined chunks used as the "candidate" side of the similarity matrix
reference: refined chunks used as the "reference" side
refalign (optional): config block with atomic_chunks, similarity_threshold, refine_top_k

Fields not used for scoring:

input, context — not needed by this node
case_id — report identity only

Output

Primary output type is RefalignMetrics (ng_core/types.py).

metrics: list[MetricResult]
cost: CostEstimate

Example output (default config, atomic_chunks: false):

json

{
  "metrics": [
    {
      "name": "refalign_precision",
      "category": "output|generation|answer",
      "score": 0.82,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    },
    {
      "name": "refalign_recall",
      "category": "output|generation|answer",
      "score": 0.74,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    },
    {
      "name": "refalign_f1",
      "category": "output|generation|answer",
      "score": 0.78,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    },
    {
      "name": "refalign_global_similarity",
      "category": "output|generation|answer",
      "score": 0.79,
      "verdict": "PASSED",
      "result": [{"passed": false}],
      "error": null
    },
    {
      "name": "refalign_score",
      "category": "output|generation|answer",
      "score": 0.88,
      "verdict": "PASSED",
      "result": [{"passed": true}],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.0,
    "input_tokens": null,
    "output_tokens": null
  }
}

When atomic_chunks: true, the node makes LLM calls for decomposition and cost reflects that usage.

Attribute meaning:

metrics: five results when both output and reference are present; empty when either is absent
name: refalign_precision, refalign_recall, refalign_f1, refalign_global_similarity, or refalign_score
category: output|generation|answer
score: value in [0,1] (higher is better semantic alignment)
verdict: PASSED or FAILED against per-metric pass thresholds (precision 0.6, recall 0.6, F1 0.7, global similarity 0.6, refalign_score 0.8)
result[0].passed: boolean pass/fail for that metric
error: populated when embedding or LLM decomposition fails; null on success
cost.cost: 0.0 when atomic_chunks: false; non-zero when LLM decomposition is enabled
cost.input_tokens, cost.output_tokens: LLM token usage from atomic extraction, or null when no LLM calls are made

Usage

bash

OUTPUT_DIR=./out/refalign
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash

nexagauge estimate refalign \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/refalign_estimate.txt"

estimate supports --input and --limit. No --output-dir flag; redirect or tee to save.

CLI: Run Evaluation

bash

nexagauge run refalign \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

To run refalign alongside all other metrics in a single pass:

bash

nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

Enable Atomic Extraction

Add a refalign block to your records to enable LLM-assisted atomic decomposition:

json

{
  "case_id": "example",
  "output": "...",
  "reference": "...",
  "refalign": { "atomic_chunks": true }
}

This makes the node split coarse sentences into minimal factual units before embedding, improving alignment accuracy for long compound statements at the cost of additional LLM calls per case.

RefAlign (refalign)

Overview

Use Case

Node Overview

Per-case config (refalign block)

Execution Flow

Input

Output

Usage

CLI: Estimate Cost

CLI: Run Evaluation

Enable Atomic Extraction

RefAlign (`refalign`)

Per-case config (`refalign` block)