GEval Score (geval)
Overview
geval is the scoring stage of nexa-gauge’s GEval branch. It applies the “LLM-as-a-judge with explicit evaluation steps” pattern from G-Eval (arXiv:2303.16634): evaluate output quality against structured, metric-specific steps rather than relying only on lexical overlap metrics.
In the paper, the key idea is to improve human alignment by using evaluation criteria plus concrete intermediate steps. nexa-gauge operationalizes this in two phases: geval_steps resolves steps (from provided evaluation_steps or generated from criteria), then geval scores each metric using those resolved steps with the judge model.
This node is useful when you need rubric-driven answer quality scoring across custom dimensions like concept coverage, procedural correctness, reference alignment, and other task-specific checks. Each metric is scored independently, with machine-readable pass/fail reasoning attached per metric result.
geval is an answer-quality metric. It does not perform claim extraction, grounding support checks, or reference n-gram similarity. It consumes already-resolved GEval metric definitions and produces normalized metric outputs and token/cost accounting.
Use Case
Use geval when you want customizable, rubric-based evaluation of generated answers.
- Evaluate domain-specific criteria not covered by generic metrics
- Mix explicit
evaluation_stepswith criteria-generated steps - Run consistent grading for QA, RAG, assistant responses, and summarization
- Add interpretable per-metric reasoning and pass/fail signals
- Track score plus token/cost usage for evaluation governance
Node Overview
In nexa-gauge, geval is the scoring node after geval_steps.
- Branch:
scan -> geval_steps -> geval scannormalizes input record fields into typedInputsgeval_stepsbuildsresolved_steps:- pass-through for metrics with provided
evaluation_steps - generated/cache-loaded steps for metrics with only
criteria
- pass-through for metrics with provided
gevalscores each resolved metric using native GEval scoring (no DeepEval dependency)
Per-node scoring controls (set once on the geval block, applied uniformly to every metric in geval.metrics):
scoring_mode:binary_yes_no(default) orscale_1_5include_reasoning:false(default) ortrue
These knobs live at the node-config level, not per-metric — so all metrics in a single geval block share the same scoring scale and reasoning behavior. If you need different modes for different metrics, run them in separate cases.
Scoring behavior by mode:
binary_yes_no(default): judge returns numeric score in{0,1}(1=yes,0=no); logprob-weighted expected score is normalized to[0,1]. Cheaper output and a strict pass/fail rubric.scale_1_5: judge returns score in[1..5]; the same logprob-weighted aggregation path is applied and normalized to[0,1]. Use when you want sharper differentiation than pass/fail.
Reasoning behavior:
include_reasoning=false(default): score-only schema, smallest output. Result payload omitsreasoningandtokenskeys entirely.include_reasoning=true: judge also returns a short rationale. Result includesreasoningandtokensfields.
Validation and skip rules:
- Validates required
item_fields(input,output,reference,context) - Skips metric with
errorif any required field is missing - Skips metric with
errorif resolved steps are empty - Otherwise returns
MetricResultwith:scoreresult[0].passed(score >= 0.6)result[0].raw_score—int(avg_normalized_step_score), the truncated average of per-step normalized scores (each already in[0,1]); effectively1only when every step scored a perfect1.0, otherwise0— regardless ofscoring_moderesult[0].reasoningandresult[0].tokens(only wheninclude_reasoning=true)
Execution Flow
Input
Using your sample input, the geval scoring node ultimately uses:
outputtextinputtextreferencetext (only if a metric’sitem_fieldsincludesreference)contexttext (only if a metric’sitem_fieldsincludescontext)geval.metrics[*]indirectly, viageval_steps.resolved_steps
How your sample maps at runtime:
rag_concept_coverage(item_fields: [input, output]): scoredretrieval_pipeline_steps(item_fields: [input, output]): scoredreference_alignment(item_fields: [output, reference]): skipped with error, because sample input does not includereference
Direct node signature in code:
run(resolved_artifacts, output, input, reference, context, scoring_mode, include_reasoning)
scoring_mode and include_reasoning are sourced from the parent Geval config (the graph wiring reads inputs.geval.scoring_mode and inputs.geval.include_reasoning before invoking run). Omitting the geval block from a record falls back to the defaults (binary_yes_no, reasoning off).
So geval does not read raw criteria directly. It reads resolved metric artifacts output by geval_steps.
Output
For geval/score.py, the concrete output type is GevalMetrics in nexa_gauge_core/types.py.
metrics: list[MetricResult]cost: CostEstimate | None
Note: RelevanceMetrics has a similar top-level shape (metrics + cost), but this node returns GevalMetrics.
Example output for the sample input when geval is configured with scoring_mode: "scale_1_5" and include_reasoning: true:
{
"metrics": [
{
"name": "rag_concept_coverage",
"category": "output|generation|answer",
"score": 0.83,
"result": [
{
"passed": true,
"raw_score": 0,
"reasoning": "The response explains RAG and contrasts it with fine-tuning, including update cadence and cost tradeoffs.",
"tokens": 19
}
],
"error": null
},
{
"name": "retrieval_pipeline_steps",
"category": "output|generation|answer",
"score": 0.66,
"result": [
{
"passed": true,
"raw_score": 0,
"reasoning": "It covers retrieval and context injection, but caveats about hallucinations are only partially explicit.",
"tokens": 18
}
],
"error": null
},
{
"name": "reference_alignment",
"category": "output|generation|answer",
"score": null,
"result": null,
"error": "Skipped GEval metric due to missing required record fields: reference."
}
],
"cost": {
"cost": 0.00102,
"input_tokens": 312.0,
"output_tokens": 74.0
}
}With the defaults (binary_yes_no + include_reasoning: false) the same metric would emit:
{
"name": "rag_concept_coverage",
"category": "output|generation|answer",
"score": 1.0,
"result": [{"passed": true, "raw_score": 1}],
"error": null
}— note the absence of reasoning and tokens in the result payload. Here raw_score: 1 because the (single) step's normalized score was exactly 1.0.
Attribute meanings:
metrics: oneMetricResultper resolved GEval metricname: metric name from GEval configcategory: alwaysoutput|generation|answerfor this nodescore: normalized metric score in[0, 1]when evaluated;nullwhen skipped/errorresult: list payload for successful evaluationsresult[].passed: boolean thresholded byMETRIC_PASS_THRESHOLD(0.6)result[].raw_score:int(avg_normalized_step_score)— truncated average of per-step normalized scores;1only when every step scored1.0, otherwise0, regardless ofscoring_moderesult[].reasoning: judge explanation text — present only wheninclude_reasoning: trueresult[].tokens: token count of the reasoning text — present only wheninclude_reasoning: trueerror: skip/failure reason for that metriccost.cost: summed evaluation USD costcost.input_tokens/cost.output_tokens: aggregate usage from scoring calls
Usage
OUTPUT_DIR=./out/geval-score
mkdir -p "$OUTPUT_DIR"CLI: Estimate Cost
nexagauge estimate geval \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/geval_estimate.txt"estimate supports --input and --limit; it does not expose --output-dir, so save output into your output directory with tee.
CLI: Run Evaluation
nexagauge run geval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5If you want per-case report JSON files, run through eval:
nexagauge run eval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5