RefMatch (refmatch)

Overview

refmatch is a lexical overlap evaluation node that compares a model output to a gold reference answer using ROUGE, BLEU, and METEOR metrics.

The metric family comes from established summarization and MT evaluation work: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004). In practice, these metrics provide fast, deterministic signals of lexical/phrase overlap between candidate and reference text.

In nexa-gauge, refmatch computes five scores in [0,1]:

  • rouge1 (unigram overlap)
  • rouge2 (bigram overlap)
  • rougeL (longest common subsequence style overlap)
  • bleu
  • meteor

Unlike judge-model metrics, this node does not call an LLM and always reports zero cost. It is most useful as a fast baseline similarity signal, typically combined with the semantic refalign node or LLM-judge metrics for fuller quality assessment.

Note: The shared scoring_mode and include_reasoning knobs available on the LLM-judge nodes (geval, grounding, relevance, redteam) do not apply to refmatch. ROUGE/BLEU/METEOR are deterministic lexical metrics — there is no judge to configure.

Use Case

Use refmatch when you have trusted reference answers and want fast, deterministic overlap-based quality checks.

  • Regression checks for answer fidelity against a gold target
  • Benchmark scoring where deterministic, low-latency metrics are needed
  • Sanity checking summarization or QA outputs before deeper judge-based evaluation
  • Comparing model variants with a consistent lexical baseline
  • Cost-sensitive pipelines that need non-LLM metrics

For paraphrase-sensitive comparison (same meaning, different wording), combine refmatch with refalign, which scores semantic similarity via embeddings.

Node Overview

In nexa-gauge, refmatch is an answer metric node.

What the node does:

  • Reads normalized output and reference text
  • Skips when reference is missing or blank (returns empty metrics, zero cost)
  • Computes ROUGE-1/2/L (F1), BLEU (smoothed sentence BLEU), and METEOR
  • Returns one MetricResult per metric
  • Returns zero-cost CostEstimate because no model calls are made

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input:

json
{
  "case_id": "bitcoin-economics-medium",
  "input": "What is Bitcoin and how does it work as a currency?",
  "output": "Bitcoin is a decentralised digital currency created in 2009 by the pseudonymous Satoshi Nakamoto. Unlike traditional currencies issued by central banks, Bitcoin operates on a peer-to-peer network with no central authority. ....",
  "reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}

Fields used by the refmatch node:

  • output: candidate text to score
  • reference: target text to compare against

Fields not used for scoring in this node:

  • input
  • case_id (used for report identity, not metric computation)

Output

Primary output type is RefmatchMetrics (ng_core/types.py).

  • metrics: list[MetricResult]
  • cost: CostEstimate

Example output:

json
{
  "metrics": [
    {
      "name": "rouge1",
      "category": "output|generation|answer",
      "score": 0.7063,
      "result": null,
      "error": null
    },
    {
      "name": "rouge2",
      "category": "output|generation|answer",
      "score": 0.4921,
      "result": null,
      "error": null
    },
    {
      "name": "rougeL",
      "category": "output|generation|answer",
      "score": 0.6554,
      "result": null,
      "error": null
    },
    {
      "name": "bleu",
      "category": "output|generation|answer",
      "score": 0.3712,
      "result": null,
      "error": null
    },
    {
      "name": "meteor",
      "category": "output|generation|answer",
      "score": 0.5987,
      "result": null,
      "error": null
    }
  ],
  "cost": {
    "cost": 0.0,
    "input_tokens": null,
    "output_tokens": null
  }
}

Attribute meaning:

  • metrics: five results when reference is present, empty list when skipped
  • name: metric identifier (rouge1, rouge2, rougeL, bleu, meteor)
  • category: output|generation|answer
  • score: metric value in [0,1] (higher is better overlap)
  • result: unused for lexical metrics (null)
  • error: null on success; populated only if a metric-level failure occurs
  • cost.cost: always 0.0
  • cost.input_tokens, cost.output_tokens: always null (no LLM usage)

Usage

bash
OUTPUT_DIR=./out/refmatch
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash
nexagauge estimate refmatch \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/refmatch_estimate.txt"

estimate supports --input and --limit; it does not expose a native --output-dir option, so redirect/tee is used with OUTPUT_DIR.

CLI: Run Evaluation

bash
nexagauge run refmatch \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

For full per-case report JSON across all branches:

bash
nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5