RefAlign (refalign)
Overview
refalign measures semantic similarity between model output and a reference answer using sentence embeddings. Where refmatch computes lexical overlap (ROUGE/BLEU/METEOR), refalign captures meaning-level agreement — paraphrases and synonyms that carry the same information but share few exact tokens.
The node operates at the segment level: both output and reference go through the same chunk → refine pipeline before embedding, so alignment is computed on coherent semantic units rather than raw token sequences. An optional LLM-assisted atomic-extraction step can split those segments further into minimal factual units before scoring.
refalign emits five complementary metrics:
refalign_precision— fraction of output segments that are semantically covered by the referencerefalign_recall— fraction of reference segments that are semantically covered by the outputrefalign_f1— harmonic mean of precision and recallrefalign_global_similarity— mean cosine similarity across all output segmentsrefalign_score— maximum pairwise cosine similarity between any output segment and any reference segment
Use Case
Use refalign when lexical overlap is insufficient to judge quality:
- Evaluating paraphrase-heavy model outputs against reference answers
- Detecting under-generation (low recall) vs. hallucination risk (low precision) separately
- Measuring semantic fidelity of summaries or translations
- Benchmarking outputs where gold answers use different vocabulary than model outputs
- Complementing
refmatchwith a meaning-level signal in the same run
Node Overview
In nexa-gauge, refalign is an answer-category metric node.
What the node does:
- Receives refined output chunks from upstream
refiner - Receives refined reference chunks from upstream
refine_reference - Optionally sends both chunk lists to the judge LLM for atomic decomposition (
atomic_chunks: true) - Computes a cosine similarity matrix between all output-segment and reference-segment embeddings
- Derives precision (max-column matching), recall (max-row matching), F1, global mean similarity, and the overall max pairwise similarity
- Returns five
MetricResultrows
Per-case config (refalign block)
Add an optional refalign block to tune the node's behavior:
"refalign": {
"atomic_chunks": false,
"similarity_threshold": 0.6,
"refine_top_k": null
}| Knob | Default | Description |
|---|---|---|
atomic_chunks | false | When true, calls the judge LLM to split each output/reference chunk into atomic factual units before embedding. More accurate for long compound sentences; adds LLM cost. |
similarity_threshold | 0.6 | Cosine similarity at or above which a segment pair is considered matched. Affects per-segment verdict; does not directly change the aggregate score. |
refine_top_k | null | Override the MMR top_k for the reference refinement step. null uses the global refiner setting. |
Omitting the refalign block runs with all defaults.
Execution Flow
Both the output and reference go through independent chunk → refine branches so the comparison units are consistently sized on both sides.
Input
{
"case_id": "bitcoin-economics-medium",
"output": "Bitcoin is a decentralised digital currency created in 2009. It operates on a peer-to-peer network without a central authority. Transactions are recorded on a public blockchain.",
"reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}Fields used by refalign:
output: refined chunks used as the "candidate" side of the similarity matrixreference: refined chunks used as the "reference" siderefalign(optional): config block withatomic_chunks,similarity_threshold,refine_top_k
Fields not used for scoring:
input,context— not needed by this nodecase_id— report identity only
Output
Primary output type is RefalignMetrics (ng_core/types.py).
metrics: list[MetricResult]cost: CostEstimate
Example output (default config, atomic_chunks: false):
{
"metrics": [
{
"name": "refalign_precision",
"category": "output|generation|answer",
"score": 0.82,
"verdict": "PASSED",
"result": [{"passed": true}],
"error": null
},
{
"name": "refalign_recall",
"category": "output|generation|answer",
"score": 0.74,
"verdict": "PASSED",
"result": [{"passed": true}],
"error": null
},
{
"name": "refalign_f1",
"category": "output|generation|answer",
"score": 0.78,
"verdict": "PASSED",
"result": [{"passed": true}],
"error": null
},
{
"name": "refalign_global_similarity",
"category": "output|generation|answer",
"score": 0.79,
"verdict": "PASSED",
"result": [{"passed": false}],
"error": null
},
{
"name": "refalign_score",
"category": "output|generation|answer",
"score": 0.88,
"verdict": "PASSED",
"result": [{"passed": true}],
"error": null
}
],
"cost": {
"cost": 0.0,
"input_tokens": null,
"output_tokens": null
}
}When atomic_chunks: true, the node makes LLM calls for decomposition and cost reflects that usage.
Attribute meaning:
metrics: five results when bothoutputandreferenceare present; empty when either is absentname:refalign_precision,refalign_recall,refalign_f1,refalign_global_similarity, orrefalign_scorecategory:output|generation|answerscore: value in[0,1](higher is better semantic alignment)verdict:PASSEDorFAILEDagainst per-metric pass thresholds (precision 0.6, recall 0.6, F1 0.7, global similarity 0.6,refalign_score0.8)result[0].passed: boolean pass/fail for that metricerror: populated when embedding or LLM decomposition fails;nullon successcost.cost:0.0whenatomic_chunks: false; non-zero when LLM decomposition is enabledcost.input_tokens,cost.output_tokens: LLM token usage from atomic extraction, ornullwhen no LLM calls are made
Usage
OUTPUT_DIR=./out/refalign
mkdir -p "$OUTPUT_DIR"CLI: Estimate Cost
nexagauge estimate refalign \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/refalign_estimate.txt"estimate supports --input and --limit. No --output-dir flag; redirect or tee to save.
CLI: Run Evaluation
nexagauge run refalign \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5To run refalign alongside all other metrics in a single pass:
nexagauge run eval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5Enable Atomic Extraction
Add a refalign block to your records to enable LLM-assisted atomic decomposition:
{
"case_id": "example",
"output": "...",
"reference": "...",
"refalign": { "atomic_chunks": true }
}This makes the node split coarse sentences into minimal factual units before embedding, improving alignment accuracy for long compound statements at the cost of additional LLM calls per case.