Hugging Face Data
Overview
nexa-gauge can read datasets from Hugging Face with hf://<dataset-id> sources. Rows from the selected split are treated like local records and normalized with the same field aliases.
Install the optional dependency first:
pip install "nexa-gauge[huggingface]"Basic Usage
nexagauge estimate eval \
--input hf://<dataset_id> \
--limit 10nexagauge run eval \
--input hf://<dataset_id> \
--limit 10 \
--output-dir ./reportauto adapter mode selects the Hugging Face adapter whenever the input starts with hf://.
Adapter Options
| Option | Purpose |
|---|---|
--input hf://<dataset-id> | Hugging Face dataset source. |
--adapter huggingface | Force the Hugging Face adapter instead of auto-detecting. |
--hf-config <name> | Optional dataset config name. |
--hf-revision <rev> | Optional revision, tag, branch, or commit. |
--split <name> | Dataset split for estimate. Default is train. |
--limit <n> | Maximum number of rows to process. |
--start <n> / --end <n> | Process a deterministic row slice. |
Example with a config and revision:
nexagauge estimate eval \
--input hf://<dataset_id> \
--adapter huggingface \
--hf-config default \
--hf-revision main \
--limit 25Row Schema
Hugging Face rows must expose the same fields or aliases as local data.
| Purpose | Accepted field names |
|---|---|
| Case ID | case_id, id |
| Generation | output, generation, response, answer, completion |
| Question | input, query, prompt |
| Context | context, contexts, documents |
| Reference | reference, ground_truth, gold_answer, label |
| GEval config | geval |
| Redteam config | redteam |
Note: Aliases are normalised to the canonical field name in the output. If your input row uses
answer, the metrics output will refer to it asoutput;queryorpromptbecomesinput;ground_truth/gold_answer/labelbecomesreference;contexts/documentsbecomescontext;idbecomescase_id. Don't be surprised when the input key you supplied isn't the key you see in the output JSON — the column on the left is what nexa-gauge reports.
Custom column mappings with --field
When a Hugging Face dataset uses column names that aren't in the table above, point nexa-gauge at them with the --field LOGICAL=COLUMN flag instead of preprocessing the dataset. The flag is repeatable, so map as many fields as you need in a single invocation:
nexagauge run relevance \
--input hf://<dataset_id> \
--field output=text \
--field input=qIn this example, the row column text is treated as the output, and q is treated as the input. Everything downstream — chunking, claim extraction, refinement, metric scoring, the cache fingerprint, and the JSON output — uses the canonical names (output, input, …), so two runs of the same content produce the same cache key whether the dataset uses text, answer, or output.
Allowed logical keys: case_id, output, input, reference, context, geval, redteam, refalign. The first five cover the row data fields shown above; the last three (geval, redteam, refalign) map a column to the corresponding metric config block. Anything else fails fast with a list of valid options.
Precedence: if a row carries both the canonical name and your user-mapped column (e.g. an empty output field plus a populated text), the explicit --field mapping wins. This is intentional — you asked for it.
Mirrored on nexagauge estimate: the same --field option works for cost estimation, so the mapping doesn't need to change between estimate and run.
Validation errors you might see:
Invalid --field value 'foo'. Expected 'LOGICAL=COLUMN'.— missing=.Unknown logical key 'gen' in field mapping. Allowed: case_id, context, geval, input, output, redteam, refalign, reference.— typo in the canonical key (useoutput, notgen).--field: duplicate mapping for 'output', last value 'X' wins.— warning only, the last--fieldfor a logical key takes effect.
If a dataset does not already include generated outputs, precompute model responses into a output-like field before running nexa-gauge.
Reshape nested structures with @register_transform
--field handles flat column-to-column renames. Some datasets have nested structures that no single column maps to — hotpotqa/hotpot_qa's context, for example, is {title: list[str], sentences: list[list[str]]}, not a string or list of strings. For these, decorate a small Python function with @register_transform("name") and point the CLI at it:
# my_transforms.py
from ng_core import register_transform
@register_transform("hotpot_qa")
def hotpot_qa(record: dict) -> dict:
ctx = record.get("context") or {}
titles = ctx.get("title") or []
sentences = ctx.get("sentences") or []
paragraphs = [
f"{title}\n{' '.join(sents)}"
for title, sents in zip(titles, sentences)
]
return {
"case_id": record.get("id"),
"input": record.get("input", ""),
"output": record.get("answer", ""),
"context": paragraphs,
"reference": record.get("answer", ""),
}nexagauge run eval \
--input hf://hotpotqa/hotpot_qa \
--hf-config distractor \
--extension-file ./my_transforms.py \
--transform hotpot_qa \
--limit 10 \
--output-dir ./reportThe transform runs once per record, before the scanner, and produces a dict in nexa-gauge's canonical shape. Allowed output keys: case_id, input, output, context, reference. The same flags work with nexagauge estimate.
Note:
gevalandredteamare nexa-gauge metric configs, not dataset data — don't construct them in a transform. Configure them on the record directly.
--extension-file is repeatable, so you can load several files of registered functions in one invocation; --transform then picks which one to apply. You can also compose with --field — the transform reshapes structure first, then --field renames columns on the result.
See Extensions for the full reference (contract, error model, composition rules — and the home for future extension types like prompts).
Metric Activation
The same activation rules apply to Hugging Face rows:
outputis required for chunking, refinement, claims, redteam, and most metrics.inputactivatesrelevance.contextactivatesgrounding.referenceactivatesrefmatch(lexical overlap) andrefalign(semantic similarity).gevalactivatesgeval_stepsandgeval.redteamadds or overrides custom redteam rubrics.
For the complete table, see Data Schema.
Common Runs
Estimate a small slice:
nexagauge run relevance \
--input hf://sentence-transformers/natural-questions \
--limit 2 \
--output-dir ./data/hg_exp_relevanceRun grounding on rows that include context:
nexagauge run grounding \
--input hf://wandb/RAGTruth-processed \
--limit 3 \
--output-dir ./data/hg_exp_groundingRun lexical reference metrics on rows that include reference:
nexagauge run redteam \
--input hf://mteb/toxic_conversations_50k \
--field output=text \
--limit 3 \
--output-dir ./data/hg_exp_toxicity