GEval Score (`geval`)

Overview

geval is the scoring stage of nexa-gauge’s GEval branch. It applies the “LLM-as-a-judge with explicit evaluation steps” pattern from G-Eval (arXiv:2303.16634): evaluate output quality against structured, metric-specific steps rather than relying only on lexical overlap metrics.

In the paper, the key idea is to improve human alignment by using evaluation criteria plus concrete intermediate steps. nexa-gauge operationalizes this in two phases: geval_steps resolves steps (from provided evaluation_steps or generated from criteria), then geval scores each metric using those resolved steps with the judge model.

This node is useful when you need rubric-driven answer quality scoring across custom dimensions like concept coverage, procedural correctness, reference alignment, and other task-specific checks. Each metric is scored independently, with machine-readable pass/fail reasoning attached per metric result.

geval is an answer-quality metric. It does not perform claim extraction, grounding support checks, or reference n-gram similarity. It consumes already-resolved GEval metric definitions and produces normalized metric outputs and token/cost accounting.

Use Case

Use geval when you want customizable, rubric-based evaluation of generated answers.

Evaluate domain-specific criteria not covered by generic metrics
Mix explicit evaluation_steps with criteria-generated steps
Run consistent grading for QA, RAG, assistant responses, and summarization
Add interpretable per-metric reasoning and pass/fail signals
Track score plus token/cost usage for evaluation governance

Node Overview

In nexa-gauge, geval is the scoring node after geval_steps.

Branch: scan -> geval_steps -> geval
scan normalizes input record fields into typed Inputs
geval_steps builds resolved_steps:
- pass-through for metrics with provided evaluation_steps
- generated/cache-loaded steps for metrics with only criteria
geval scores each resolved metric using native GEval scoring (no DeepEval dependency)

Per-node scoring controls (set once on the geval block, applied uniformly to every metric in geval.metrics):

scoring_mode: binary_yes_no (default) or scale_1_5
include_reasoning: false (default) or true

These knobs live at the node-config level, not per-metric — so all metrics in a single geval block share the same scoring scale and reasoning behavior. If you need different modes for different metrics, run them in separate cases.

Scoring behavior by mode:

binary_yes_no (default): judge returns numeric score in {0,1} (1=yes, 0=no); logprob-weighted expected score is normalized to [0,1]. Cheaper output and a strict pass/fail rubric.
scale_1_5: judge returns score in [1..5]; the same logprob-weighted aggregation path is applied and normalized to [0,1]. Use when you want sharper differentiation than pass/fail.

Reasoning behavior:

include_reasoning=false (default): score-only schema, smallest output. Result payload omits reasoning and tokens keys entirely.
include_reasoning=true: judge also returns a short rationale. Result includes reasoning and tokens fields.

Validation and skip rules:

Validates required item_fields (input, output, reference, context)
Skips metric with error if any required field is missing
Skips metric with error if resolved steps are empty
Otherwise returns MetricResult with:
- score
- result[0].passed (score >= 0.6)
- result[0].raw_score — int(avg_normalized_step_score), the truncated average of per-step normalized scores (each already in [0,1]); effectively 1 only when every step scored a perfect 1.0, otherwise 0 — regardless of scoring_mode
- result[0].reasoning and result[0].tokens (only when include_reasoning=true)

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input, the geval scoring node ultimately uses:

output text
input text
reference text (only if a metric’s item_fields includes reference)
context text (only if a metric’s item_fields includes context)
geval.metrics[*] indirectly, via geval_steps.resolved_steps

How your sample maps at runtime:

rag_concept_coverage (item_fields: [input, output]): scored
retrieval_pipeline_steps (item_fields: [input, output]): scored
reference_alignment (item_fields: [output, reference]): skipped with error, because sample input does not include reference

Direct node signature in code:

run(resolved_artifacts, output, input, reference, context, scoring_mode, include_reasoning)

scoring_mode and include_reasoning are sourced from the parent Geval config (the graph wiring reads inputs.geval.scoring_mode and inputs.geval.include_reasoning before invoking run). Omitting the geval block from a record falls back to the defaults (binary_yes_no, reasoning off).

So geval does not read raw criteria directly. It reads resolved metric artifacts output by geval_steps.

Output

For geval/score.py, the concrete output type is GevalMetrics in nexa_gauge_core/types.py.

metrics: list[MetricResult]
cost: CostEstimate | None

Note: RelevanceMetrics has a similar top-level shape (metrics + cost), but this node returns GevalMetrics.

Example output for the sample input when geval is configured with scoring_mode: "scale_1_5" and include_reasoning: true:

json

{
  "metrics": [
    {
      "name": "rag_concept_coverage",
      "category": "output|generation|answer",
      "score": 0.83,
      "result": [
        {
          "passed": true,
          "raw_score": 0,
          "reasoning": "The response explains RAG and contrasts it with fine-tuning, including update cadence and cost tradeoffs.",
          "tokens": 19
        }
      ],
      "error": null
    },
    {
      "name": "retrieval_pipeline_steps",
      "category": "output|generation|answer",
      "score": 0.66,
      "result": [
        {
          "passed": true,
          "raw_score": 0,
          "reasoning": "It covers retrieval and context injection, but caveats about hallucinations are only partially explicit.",
          "tokens": 18
        }
      ],
      "error": null
    },
    {
      "name": "reference_alignment",
      "category": "output|generation|answer",
      "score": null,
      "result": null,
      "error": "Skipped GEval metric due to missing required record fields: reference."
    }
  ],
  "cost": {
    "cost": 0.00102,
    "input_tokens": 312.0,
    "output_tokens": 74.0
  }
}

With the defaults (binary_yes_no + include_reasoning: false) the same metric would emit:

json

{
  "name": "rag_concept_coverage",
  "category": "output|generation|answer",
  "score": 1.0,
  "result": [{"passed": true, "raw_score": 1}],
  "error": null
}

— note the absence of reasoning and tokens in the result payload. Here raw_score: 1 because the (single) step's normalized score was exactly 1.0.

Attribute meanings:

metrics: one MetricResult per resolved GEval metric
name: metric name from GEval config
category: always output|generation|answer for this node
score: normalized metric score in [0, 1] when evaluated; null when skipped/error
result: list payload for successful evaluations
result[].passed: boolean thresholded by METRIC_PASS_THRESHOLD (0.6)
result[].raw_score: int(avg_normalized_step_score) — truncated average of per-step normalized scores; 1 only when every step scored 1.0, otherwise 0, regardless of scoring_mode
result[].reasoning: judge explanation text — present only when include_reasoning: true
result[].tokens: token count of the reasoning text — present only when include_reasoning: true
error: skip/failure reason for that metric
cost.cost: summed evaluation USD cost
cost.input_tokens / cost.output_tokens: aggregate usage from scoring calls

Usage

bash

OUTPUT_DIR=./out/geval-score
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash

nexagauge estimate geval \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/geval_estimate.txt"

estimate supports --input and --limit; it does not expose --output-dir, so save output into your output directory with tee.

CLI: Run Evaluation

bash

nexagauge run geval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

If you want per-case report JSON files, run through eval:

bash

nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

GEval Score (geval)

Overview

Use Case

Node Overview

Execution Flow

Input

Output

Usage

CLI: Estimate Cost

CLI: Run Evaluation

GEval Score (`geval`)