SEVA: Self-Evolving Verification Agent with
Process Reward for Fact Attribution
Abstract
Hallucination is the reliability bottleneck for LLM-based agents, and fact attribution verifiers are the last line of defense — yet today’s verifiers emit only opaque binary labels, leaving agents unable to self-correct and operators unable to audit. We present Seva, a structured verification agent that emits evidence alignments, step-by-step reasoning chains, calibrated confidence, and a six-category error diagnosis with actionable fixes. Training such an agent with RL is non-trivial: standard binary reward on multi-component output triggers advantage collapse — within-group reward variance vanishes and the GRPO gradient disappears. We resolve this with a process reward that decomposes verification quality into five independent components weighted toward process signals, restoring the gradient and inducing an implicit curriculum — the agent first masters verification behavior (alignment , format ), then outcomes (F1 ). Structured output further enables a VerifyReflectProbeRefine self-evolution loop, which over four rounds on a 7B model surfaces an unexpected structural finding: each round produces a benchmark-specialist, not a generalist ( pp on HaluEval, to pp on TruthfulQA in the same model, persistent at data). On ClearFacts, Seva-3B matches GPT-4o-mini (69.0 vs. 69.8 F1) while producing substantially richer, auditable output — confirming a principle that should generalize: for any RL task with multi-component generation, reward granularity must match output granularity.
1 Introduction
Despite rapid progress in LLM capabilities, hallucination remains a fundamental barrier to deploying agents in high-stakes domains such as finance, law, and healthcare (Min et al., 2023). Fact attribution verifiers — models that judge whether each claim in an agent’s output is supported by its source documents — have emerged as a critical safety layer (Gao et al., 2023; Tian et al., 2024). Systems like MiniCheck (Tang et al., 2024) and ClearCheck (Seo et al., 2025) achieve strong accuracy on this task, but they share a fundamental limitation: they produce only a binary label.
This opacity creates two problems for agents in the wild. First, when a verifier flags a claim, the agent has no basis for self-correction — it knows something is wrong, but not whether a percentage was inflated, an entity swapped, or a qualifier fabricated. Second, no human operator can meaningfully audit the decision, because the reasoning behind the label is invisible. In safety-critical deployments, an uninterpretable verifier undermines the very trust it is meant to provide.
We introduce Seva (Self-Evolving Verification Agent), which addresses both problems by producing structured verification output: evidence alignment spans that ground every judgment in specific text, reasoning chains that trace the logic step by step, and error diagnoses from a six-category taxonomy with actionable fix suggestions. This structured output serves a dual purpose: it makes verification auditable for deployment, and it provides a diagnostic interface for training.
How should such an agent be trained? SFT on teacher-annotated data provides a reasonable starting point, but reinforcement learning — which has driven substantial gains for mathematical reasoning (Shao et al., 2024; Zha et al., 2025) and hallucination reduction (Li et al., 2026) — does not straightforwardly transfer. Applying GRPO (Shao et al., 2024) with binary reward (1 if the label matches, 0 otherwise) to our structured verifier, we find that training stalls entirely: the policy makes no progress beyond SFT across all 350 steps. The culprit is that binary reward compresses all verification quality into a single bit. A response with correct reasoning but the wrong label receives the same score — zero — as one that produces unparseable garbage. In a GRPO group of responses, most therefore score 0, advantage spread contracts to , and the gradient vanishes.
Our response is a process reward function mapping each structured response (against gold ) to a continuous score over five independent components plus a calibration term:
| (1) |
with weights , , , and an asymmetric calibration if else , which rewards calibrated correctness more than it penalizes calibrated error. Each is computed independently from a different region of (App. C), so the components are weakly correlated; this independence is what creates the smooth four-level reward landscape (Fig. 6, Tab. 1) that resolves the collapse. A response with sound reasoning but the wrong label scores rather than , restoring in Eq. 2.
The results confirm that process reward unlocks what binary reward cannot. GRPO lifts alignment quality to 0.997, format compliance to 100%, and F1 to 69.0 (+4.1 over SFT). An implicit curriculum emerges in the training dynamics: the agent masters verification behavior within 150 steps, then spends the remaining 200 steps refining verification outcomes — without any explicit scheduling.
Structured output confers a further advantage: it makes the agent’s failures transparent. When Seva misclassifies a claim, its evidence alignments and error diagnoses pinpoint which error category was missed and where grounding broke down. We channel this diagnostic signal into a VerifyReflectProbeRefine self-evolution loop that generates targeted adversarial data for the agent’s weakest error categories, and iterate it four times on the 7B model. A surprising empirical finding emerges: each round yields a benchmark-specialist rather than a strictly better generalist, and the asymmetric trade-off persists at training-data scale ( samples in Round 4) — confirming that the effect is data-distribution-induced rather than overfitting. This finding is itself only visible because the structured output exposes per-category error dynamics that aggregate accuracy would hide, and it sits uneasily with the monotonic-improvement assumption implicit in Self-Refine / STaR-style self-evolution literature.
Our contributions are threefold.
- 1.
-
2.
A process reward that turns RL on structured output from impossible to possible. We formalize advantage collapse (Prop. 1) as the failure mode of binary reward on multi-component generation, then resolve it with a five-component decomposition (Prop. 2); the resulting reward landscape yields an implicit curriculum — behavior before outcomes — without any explicit scheduling (§2.3, §3.5).
-
3.
A self-evolution loop that reveals a structural property of iterative refinement. VerifyReflectProbeRefine, iterated four rounds on a 7B model, surfaces a finding that contradicts the monotone-improvement assumption of Self-Refine / STaR: each round produces a benchmark-specialist, not a generalist — pp HaluEval, to pp TruthfulQA in the same model, robust at training data, visible only because structured output exposes per-category dynamics (§2.5).
2 Method
2.1 Problem Formulation and Overview
Given a claim and source document , a fact attribution verifier produces a judgment about whether is supported by . Existing verifiers output a single binary label (Tang et al., 2024; Seo et al., 2025). We instead require the agent to produce structured output that makes verification auditable and failures diagnosable (Figure 1).
Building such an agent via SFT is straightforward, but pushing it further with RL is not. We show that GRPO with binary reward fails entirely on this output format (§2.3), and design a process reward that resolves this failure (§2.3). The structured output further enables a self-evolution loop for iterative improvement (§2.5).
2.2 Structured Verification Schema
The output comprises four complementary components:
Evidence alignment : a list of triples mapping claim spans to source spans, with status {match, mismatch, not_found}. Each entry forces the agent to anchor its judgment in specific text rather than forming a holistic impression.
Reasoning chain : step-by-step verification where each step examines a claim part against source evidence, producing a judgment {supported, not_supported, partially_supported} and a natural language explanation.
Label and confidence: a binary label paired with calibrated confidence .
Error diagnosis: when Not Attributable, an error type drawn from a six-category taxonomy (numerical exaggeration, negation flip, scope inflation, temporal shift, entity substitution, fabrication) together with a fix suggestion .
This schema serves two purposes relevant to agents in the wild. First, it makes verification auditable: a human operator can inspect alignments and reasoning to judge whether the verdict is trustworthy. Second, it makes failures diagnosable: when the agent errs, the structured output pinpoints which evidence was mishandled, feeding the self-improvement loop in §2.5.
2.3 From Binary Reward Failure to Process Reward
The failure of binary reward.
Binary reward assigns 1.0 when the predicted label matches the gold label and 0.0 otherwise. For structured output, this produces a degenerate training signal. In a GRPO group of responses: (1) 28% fail JSON parsing — the SFT model produces valid JSON only 72% of the time, and all of these score 0; (2) among valid responses, 35% predict the wrong label, also scoring 0; (3) in a typical group, 5–7 of 8 responses receive zero reward. GRPO computes advantages relative to the group mean. For a group of responses with rewards , the normalized advantage of response is:
| (2) |
When binary reward produces with most , both and are near zero, and for all — the policy gradient vanishes regardless of model parameters.
This failure is structural, not incidental. Increasing group size does not help — the problem is near-uniform scores, not insufficient sampling. And the failure is not specific to verification: any RL task whose output has multiple required components will exhibit advantage collapse under binary reward whenever the model cannot reliably produce all components simultaneously. We formalize the mechanism below; a proof sketch is given in Appendix C.7.
Proposition 1 (Binary-Reward Advantage Collapse).
Let be i.i.d. Bernoulli rewards in a GRPO group of size with success probability . The expected within-group variance is
| (3) |
and as or , , hence
| (4) |
and the expected policy gradient
| (5) |
regardless of model parameters or group size .
Proposition 2 (Process-Reward Variance Lower Bound).
Let be the aggregate process reward with components and positive weights . By the variance identity for linear combinations,
| (6) |
Unless the components are perfectly anti-correlated, the cross-terms cannot drive to zero unless every . In particular, if any single component has and is uncorrelated with the rest,
| (7) |
and the GRPO gradient is non-vanishing.
Why these matter.
Prop. 1 pinpoints the failure mode of binary reward on structured output, and Prop. 2 guarantees that process reward escapes it by construction. Empirically, at SFT-init gives (Eq. 3) and shrinks to by step (Tab. 18) — the gradient effectively dies under binary reward. Under process reward, format errors at keep at every step we observe, so Eq. 7 delivers throughout training; the smooth four-level landscape of Tab. 1 is the geometric consequence. The argument is task-agnostic: any RL setting whose output has required components inherits the same dichotomy, so a process-style decomposition is the structural fix wherever it applies.
The reward landscape (Tab. 1) inverts the binary ranking: “good reasoning, wrong label” scores (vs. ) and “correct label, poor reasoning” only (vs. ) — binary reward effectively pays for lucky guesses; process reward pays for genuine verification work.
| Response quality | Process | Binary |
|---|---|---|
| Correct label + good reasoning | 1.13 | 1.0 |
| Good reasoning, wrong label | 0.63 | 0.0 |
| Correct label, poor reasoning | 0.28 | 1.0 |
| Unparseable output | 0.0 | 0.0 |
Why 70/30, and how each component scores.
The split forces the agent to do substantive verification before the label becomes the easy lever: with outcome dominating, the model would learn to guess labels and produce incoherent reasoning around them. Each scores independently — on JSON validity, on per-span grounding, on per-step judgment and citation, on label match, on error type and fix; an asymmetric calibration term rewards confident correctness and penalizes overconfident error. Algorithm 1 gives the full computation; per-component rubrics are in Appendix C.
2.4 Training Pipeline
Seva is trained in two phases. SFT: GPT-4o-mini annotates ANLI examples with structured output ( format-valid); Qwen2.5-3B-Instruct (Qwen Team, 2025) is fine-tuned for epochs at lr . GRPO: the SFT checkpoint seeds 5 epochs (350 steps) of process-reward GRPO with , , , lr , on veRL (Sheng et al., 2025) with FSDP on 2RTX 6000 Ada (28 GPU-hours total). The low GRPO lr and small jointly preserve the SFT-established format while leaving room for the policy to explore; we found this balance via a small sweep ( over-regularized; admitted format-gaming). We additionally train a 7B variant via two-stage SFT (binary NLI structured) with LoRA-128 (Hu et al., 2022); full hyperparameters in Appendix A.
2.5 Self-Evolution via Structured Diagnostics
Structured output exposes which aspect of verification failed when the agent misclassifies; we channel this signal through a VerifyReflectProbeRefine loop (Fig. 1, bottom), borrowing the principle of functional separation from MARCH (Li et al., 2026) but applying it across loop stages rather than agents. Reflect aggregates error diagnoses into a 6-bin weakness profile; Probe allocates adversarial generation budget proportional to per-category weakness, giving weak bins the budget of strong ones (e.g., entity_sub at acc gets fabrication’s at ).
GRPO training itself constitutes Round 0: the process reward continuously assesses structural quality, and rollout sampling explores the agent’s decision boundary. We then iterate four additional rounds on the 7B Step150 seed: Round 1 injects extracted verification rules into the prompt (no parameter update); Round 2 performs LoRA SFT on adversarial probes; Round 3 performs full FT on mixed samples (adversarial replay); Round 4 (“mega-FT”) extends Round 3 with mixed samples to test whether more diverse adversarial data closes any remaining gap. Pseudocode for the four-stage loop is in App. L (Algorithm 2).
Specialists, not generalists.
We expected four refinement rounds to yield a monotonically improving generalist; we observe a sharper specialization fingerprint instead (Table 2). Round 2 lifts HaluEval by pp but drops TruthfulQA by ; Round 3 sharpens to ; Round 4 holds at despite more data. Persistence at scale rules out trivial overfitting and identifies the effect as data-distribution-induced: probes drawn from a ClearFacts-style weakness profile push the model toward those failure modes and away from TruthfulQA’s distribution. When probes come from a single source distribution, specialization is the dominant mode of iterative refinement.
| ClearF. | FEVER | TrQA | HaluE. | |
|---|---|---|---|---|
| Round 1 (rules) | 0.7 | 0.5 | 1.1 | 0.6 |
| Round 2 (LoRA) | 1.3 | 1.6 | 10.2 | 10.9 |
| Round 3 (FT) | 0.0 | 1.2 | 13.8 | 14.3 |
| Round 4 (mega-FT) | 0.1 | 1.5 | 12.4 | 14.9 |
Does the loop actually work?
Three properties make the data-volume reading hard to sustain: (i) the trade-off is monotone in both directions ( HaluEval, TruthfulQA — a non-functional loop would sign-flip); (ii) per-benchmark gains track Probe-stage budget allocation, not raw sample count; (iii) Round 4 has Round 2’s data but adds only pp on HaluEval, exhibiting saturation rather than the unbounded growth of data-volume overfitting. The clean isolation ablation — swapping the weakness-guided Probe for a same-budget random sampler — is out of scope for this submission (§4.3); the loop’s response to its signal is consistent with a working mechanism (full analysis in App. L). This specialization fingerprint is itself only visible because structured output exposes per-category dynamics; the same asymmetry would be invisible in aggregate accuracy, motivating downstream architectural responses (e.g., per-domain routing across rounds) we explore in follow-up work.
3 Experiments
Having laid out the architecture (§2.2), the reward (§2.3), and the self-evolution loop (§2.5), we now stress-test Seva on four axes that any deployed verifier must clear: accuracy against established binary baselines, generalization across benchmarks with different failure modes, structural reliability of the produced output, and training dynamics under both reward designs.
3.1 Setup
We evaluate on ClearFacts (Seo et al., 2025) (1,590 samples; our primary metric), FEVER (Thorne et al., 2018) (200), TruthfulQA (Lin et al., 2022) (400), and HaluEval (Li et al., 2023) (200). Together these cover claim-source attribution, encyclopedic verification, common misconceptions, and LLM-generated hallucinations — four distinct distributions chosen to probe whether Seva’s structural advantages survive across error types.
Baselines include binary verifiers reported by Seo et al. (2025): MiniCheck-7B (81.2 F1), ClearCheck-8B (84 F1), and Llama-3.1-8B zero-shot (67.2 F1). For structured comparison we evaluate GPT-4o-mini with zero-shot SEVA prompting and MiniCheck-Flan-T5-Large (770M). We report macro F1, accuracy, and structural quality (alignment quality , chain quality , format compliance rate).
3.2 Main Results
| Model | Size | Output | Acc | F1 |
|---|---|---|---|---|
| Binary-label verifiers | ||||
| Llama-3.1 (0-shot) | 8B | binary | – | 67.2 |
| MiniCheck | 7B | binary | – | 81.2 |
| ClearCheck | 8B | binary | – | 84 |
| Structured verifiers | ||||
| GPT-4o-mini (0-shot) | – | struct | 69.9 | 69.8 |
| MiniCheck-Flan-T5 | 770M | binary | 68.3 | 68.3 |
| Ours | ||||
| Seva-SFT | 3B | struct | 65.2 | 64.9 |
| Seva-GRPO | 3B | struct | 69.6 | 69.0 |
| Seva-SFT (LoRA-128) | 7B | struct | 68.6 | 68.5 |
Table 3 presents ClearFacts results. With process reward, GRPO lifts Seva-3B from 64.9 to 69.0 F1 (+4.1), narrowing the gap with GPT-4o-mini (69.8) to under one point. Importantly, Seva produces substantially richer output — grounded evidence spans, multi-step reasoning, and a six-category error taxonomy — that zero-shot prompting of GPT-4o-mini captures only partially.
At 7B scale, SFT with LoRA-128 reaches 68.5 F1 without any RL, nearly matching 3B GRPO. Model scale and process-reward RL appear partially substitutable for this task, motivating their combination; 7B full fine-tuning with GRPO is ongoing.
The gap to MiniCheck-7B (81.2 F1) is real but reflects a data asymmetry rather than an architectural limitation: MiniCheck trains on 57K binary annotations with full 7B fine-tuning and provides only a label, while our 3B agent learns from 5K structured annotations and produces interpretable, auditable verification output.
3.3 Generalization Across Benchmarks
| Out. |
ClearF. |
FEVER |
TrQA |
HaluE. |
|
|---|---|---|---|---|---|
| GPT-4o-mini | struct | 69.8 | 91.0 | 48.6 | 34.0 |
| MiniCheck-FT5 | binary | 68.3 | 87.1 | 59.5 | 42.4 |
| Seva-SFT (3B) | struct | 64.9 | 76.3 | 72.1 | 42.0 |
| Seva-GRPO (3B) | struct | 69.0 | 84.9 | 82.7 | 39.4 |
GRPO’s gains are largest on class-balanced benchmarks ( FEVER, TruthfulQA). The -point TruthfulQA gap over GPT-4o-mini ( vs. ) traces directly to ’s per-step source-citation requirement: GPT-4o-mini falls back on parametric knowledge when claims “sound right,” while Seva is forced to ground every step in the document. HaluEval ( vs. SFT) is the exception — the agent over-predicts “Not Attributable,” a reward-induced bias we trace in §4.
3.4 Structural Quality
| Align | Chain | Format | F1 | |
|---|---|---|---|---|
| Seva-SFT | 0.917 | 0.917 | 72% | – |
| Seva-GRPO | 0.997 | 0.995 | 100% | +4.1 |
Process reward drives structural quality to near-perfect levels (alignment , chain , format — Tab. 5); a verifier whose output fails to parse of the time under SFT cannot serve as a dependable safety component, so this reliability is itself load-bearing for deployment.
Qualitative gap.
Fig. 4 makes the deployment-relevance concrete: on the same input, the binary verifier returns “Not Attributable” with no explanation, while Seva pinpoints the exact mismatch (“significantly” absent from source), traces step-by-step reasoning, names the error category, and suggests a fix — everything a downstream correction module or human reviewer needs to act, packaged in 120 tokens of JSON.
Claim: “60% of participants significantly improved” Source: “60% of subjects showed improvement” Binary verifier: Not Attributable (no explanation) Seva-GRPO structured output: Align: “60% of participants” “60% of subjects” [match]; “significantly improved” NOT_FOUND Chain: Step 1: supported (percentage matches); Step 2: not_supported (qualifier absent) Label: Not Attributable, confidence 0.85 Diag: scope_inflation; fix: remove “significantly”
3.5 Training Dynamics and Implicit Curriculum
Binary reward’s advantage spread decays from to over steps while its mean barely moves (); process reward sustains spread above throughout, with mean climbing from to (full trajectory in App. E). A training-time ordering we did not design for emerges from this contrast (Fig. 5): alignment and format saturate by step 150 (, ) while F1 continues climbing through step 350 (). The agent masters verification behavior before outcomes — a natural difficulty asymmetry between pattern-level skills and semantic reasoning, amplified by the 70/30 weighting. Mathematical PRMs (Lightman et al., 2024) show a parallel effect on sequential steps; we extend the principle to parallel components.
4 Ablation and Analysis
The headline numbers establish that Seva works; we now ask why. Two questions matter: is process reward genuinely the load-bearing design choice (vs. just “GRPO with extra steps”), and where does Seva still fail systematically? The ablation isolates the first, the error analysis surfaces the second — together they motivate the deployment caveats in §4.3.
4.1 Ablation Study
Table 6 isolates the design choice that matters. Replacing process reward with binary reward (every other GRPO setting identical: , , , steps) yields F1 and no structural improvement over SFT — steps of policy optimization producing zero gain because the advantage signal is too weak to learn from. The structural metrics are diagnostic: binary GRPO leaves alignment quality at and format compliance at , exactly the SFT levels, confirming that the policy never updates. Process GRPO simultaneously drives alignment to and format to while improving F1 — a tri-directional gain only possible when the gradient is non-degenerate (Prop. 2). Process reward is therefore not an incremental enhancement; it is a prerequisite for applying GRPO to structured output, and the three-row contrast in Tab. 6 is the empirical realization of the theoretical dichotomy in Eq. 4–7.
| Configuration | F1 | Align | Format |
|---|---|---|---|
| Seva-GRPO (process reward) | 69.0 | 0.997 | 100% |
| Seva-GRPO (binary reward) | 65 | 0.92 | 72% |
| Seva-SFT (no RL) | 64.9 | 0.917 | 72% |
4.2 Reward Asymmetry and Negative-Prediction Bias
GRPO with process reward over-predicts “Not Attributable” ( false positives on ClearFacts; confusion matrix and error-type distribution in App. D). The cause is structural: gives negative predictions a two-part signal (error type fix) while positive ones collapse to a scalar ( for correct omission), exposing more “reward surface” to negative predictions and biasing the policy under uncertainty. This helps on balanced benchmarks (FEVER, TruthfulQA) and hurts on positively skewed ones (HaluEval); the catch-all fabrication diagnosis at of negative predictions is its empirical signature. Label-conditional reward normalization is the natural fix; we leave it to future work.
4.3 Limitations
Four caveats bound the claims. (i) Two ablations would strengthen the self-evolution evidence and are out of scope: a Probe-distribution control (random vs. weakness-guided) and a cross-distribution probe mix. (ii) GRPO is applied only at 3B; the 7B variant uses LoRA, so the scale–RL combination in Table 3 remains untested at full 7B FT. (iii) The 70/30 split was chosen on principled grounds (App. I reports a coarse sweep); a finer-grained search and non-uniform training schedules are future work. (iv) The negative-prediction bias induced by ’s asymmetric reward surface (§4) should be addressed before deployment; label-conditional reward normalization is the natural starting point.
5 Related Work
Seva sits at the intersection of three previously-uncombined lines. Fact attribution verification via NLI transfer (Tang et al., 2024), unified alignment (Zha et al., 2023), and refined benchmarks (Seo et al., 2025) is accurate but unstructured and SFT-only; we retain the benchmarks, add structured output + RL. RL for reasoning with GRPO (Shao et al., 2024; Zha et al., 2025) and hallucination detection (MARCH (Li et al., 2026), Dr. Zero (Yue et al., 2026)) assumes single-answer output where correctness reward suffices; that assumption breaks for multi-component generation. Process reward models for math (Lightman et al., 2024; Wang et al., 2024) score sequential step dependencies; we score parallel components, an independence that lets us weight and normalize each separately to produce the smooth landscape GRPO needs.
6 Discussion
Scale and process-reward RL are complementary, not redundant: -SFT-LoRA-128 ( F1) and -GRPO ( F1) reach the same accuracy through different routes, and their combination should close the gap to MiniCheck-’s — a hypothesis we are actively testing. Beyond accuracy, the five-component decomposition is itself dual-use: it extracts five gradients per response where binary reward extracts one, and exposes the per-category dynamics under which the specialization fingerprint becomes visible at all. For deployment, an unparseable response is functionally indistinguishable from a wrong one, so the format-error drop is as load-bearing as the F1 gain; safety-critical pipelines can early-stop at step (already format-reliable, not yet F1-saturated) and route the structured diagnosis — taxonomy, alignment, fix — to an auditor as a contract between upstream generator and downstream judge, the substrate any trustworthy agent pipeline ultimately needs.
7 Conclusion
A 3B Seva matches GPT-4o-mini on ClearFacts ( vs. F1) at format compliance while producing auditable structured output — alignments, reasoning chains, calibrated confidence, six-category error diagnosis with fixes. The enabler is a process reward that resolves binary reward’s advantage collapse on multi-component generation (Prop. 1–2); the surprise is a monotone, signed specialization fingerprint under iterative refinement, visible only because per-category dynamics are exposed by structured output. Three principles should transfer wherever agents must explain, justify, and improve under audit — reward granularity matches output granularity, structured output is a dual-use asset, iterative self-improvement drifts toward specialization under single-sourced probes — favoring architectural responses over more training rounds.
References
- RARR: researching and revising what language models say, using language models. In Proceedings of ACL, Cited by: §1.
- LoRA: low-rank adaptation of large language models. In Proceedings of ICLR, Cited by: §2.4.
- Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. Cited by: Appendix N.
- HaluEval: a large-scale hallucination evaluation benchmark for large language models. In Proceedings of EMNLP, Cited by: 2nd item, §3.1.
- MARCH: multi-agent reinforced self-check for LLM hallucination. arXiv preprint arXiv:2603.24579. Cited by: §1, §2.5, §5.
- Let’s verify step by step. In Proceedings of ICLR, Cited by: §3.5, §5.
- TruthfulQA: measuring how models mimic human falsehoods. In Proceedings of ACL, Cited by: 2nd item, §3.1.
- FActScore: fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of EMNLP, Cited by: §1.
- Adversarial NLI: a new benchmark for natural language understanding. In Proceedings of ACL, Cited by: 2nd item.
- Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: §2.4.
- Verifying the verifiers: unveiling pitfalls and potentials in fact verifiers. In Proceedings of COLM, Note: arXiv preprint arXiv:2506.13342 Cited by: 2nd item, §B.2, §1, §2.1, §3.1, §3.1, Table 3, §5.
- DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §5.
- HybridFlow: a flexible and efficient RLHF framework. In Proceedings of EuroSys, Note: arXiv preprint arXiv:2409.19256 Cited by: Table 9, §2.4.
- Energy and policy considerations for deep learning in NLP. In Proceedings of ACL, Cited by: Appendix N, Appendix N.
- MiniCheck: efficient fact-checking of LLMs on grounding documents. In Proceedings of EMNLP, Note: arXiv preprint arXiv:2404.10774 Cited by: §1, §2.1, §5.
- FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of NAACL-HLT, Cited by: 2nd item, §3.1.
- Fine-tuning language models for factuality. In Proceedings of ICLR, Cited by: §1.
- Math-shepherd: verify and reinforce LLMs step-by-step without human annotations. In Proceedings of ACL, Cited by: §5.
- Dr. Zero: self-evolving search agents without training data. arXiv preprint arXiv:2601.07055. Cited by: §5.
- RL tango: reinforcing generator and verifier together for language reasoning. In Proceedings of NeurIPS, Note: arXiv preprint arXiv:2505.15034 Cited by: §1, §5.
- AlignScore: evaluating factual consistency with a unified alignment function. In Proceedings of ACL, Cited by: §5.
Appendix A Implementation Details
A.1 Hardware and Compute
Experiments were conducted on a local server with 2NVIDIA RTX 6000 Ada (48 GB each), plus 1A100 80G for 7B variants and self-evolution rounds. Table 7 summarizes the compute budget for the full pipeline (3B SFTGRPO, 7B SFT, and the four self-evolution rounds reported in §2.5); the carbon footprint estimate is in Appendix N.
| Experiment | GPUs | Time | Hrs |
| 3B SFT (full FT) | 2Ada | 2h | 4 |
| 3B GRPO (350 steps) | 2Ada | 8h | 16 |
| 7B SFT (LoRA-64) | 1A100 80G | 3h | 3 |
| 7B SFT (LoRA-128) | 1A100 80G | 4h | 4 |
| Self-evolution rounds (7B, §2.5) | |||
| SE Round 1 (rules, no update) | – | – | 0 |
| SE Round 2 (LoRA-64, 1.1K) | 1A100 80G | 5h | 5 |
| SE Round 3 (Full FT, 2.0K) | 1A100 80G | 21h | 21 |
| SE Round 4 (mega-FT, 7.8K) | 1A100 80G | 72h | 72 |
| Eval (per benchmark) | 1Ada | 20m | 0.3 |
| Adversarial probe generation | GPT-4o-mini API | – | – |
| 3B-only subtotal | 28 | ||
| Full pipeline total | 130 | ||
A.2 SFT Hyperparameters
| Parameter | 3B (full) | 7B (LoRA) |
|---|---|---|
| Base model | Qwen2.5-3B | Qwen2.5-7B |
| Epochs | 3 | 3 |
| Batch (per GPU) | 4 | 4 |
| Grad. accum. | 4 | 4 |
| Eff. batch | 16 | 16 |
| Learning rate | 2e-5 | 2e-5 / 5e-5 |
| Scheduler | Cosine | Cosine |
| Warmup ratio | 0.05 | 0.05 |
| Weight decay | 0.01 | 0.01 |
| Max seq. len | 1280 | 1280 |
| Precision | bf16 | bf16 |
| Grad. ckpt. | ✓ | ✓ |
| LoRA-specific (7B only) | ||
| LoRA rank | – | 64 / 128 |
| LoRA alpha | – | 128 |
| LoRA dropout | – | 0.05 |
| Target mods | – | q,k,v,o,gate,up,dn |
| Trainable (%) | 100% | 2.1% / 4.1% |
A.3 GRPO Hyperparameters
| Parameter | Value |
|---|---|
| Framework | veRL 0.3 (Sheng et al., 2025) |
| Algorithm | GRPO |
| Base model | Seva-SFT (3B) |
| Group size () | 8 |
| Temperature | 1.2 |
| Top- | 0.95 |
| Max prompt length | 768 tokens |
| Max response length | 512 tokens |
| Train batch size | 64 |
| Learning rate | 2e-6 |
| KL coefficient () | 0.001 |
| Epochs | 5 (350 steps) |
| Parallelism | FSDP (tp=1, dp=2) |
| Reward function | seva_reward.py |
A.4 Inference Configuration
| Parameter | Value |
|---|---|
| Inference engine | vLLM |
| Temperature | 0.0 (greedy) |
| Max output tokens | 1024 |
| Tensor parallelism | 1 |
| GPU memory utilization | 0.9 |
Appendix B Dataset Statistics
B.1 Training Data
| Dataset | Samples | Attr.% | Format |
|---|---|---|---|
| Structured SEVA data (5K) | |||
| ANLI (annotated) | 4,992 | 50.8% | structured |
| GRPO prompts | |||
| ANLI (prompts) | 4,500 | 51.0% | prompt-only |
Structured annotations are generated using GPT-4o-mini with a detailed system prompt. Each response is validated for: (1) valid JSON with all required fields (evidence_alignment, reasoning_chain, label, confidence); (2) valid status {match, mismatch, not_found}; (3) valid judgment {supported, not_supported, partially_supported}; (4) confidence . Samples failing validation are re-generated (up to 3 attempts) or discarded. The acceptance rate is 92%.
B.2 Evaluation Benchmarks
| Benchmark | Size | Eval | Attr.% | Domain |
|---|---|---|---|---|
| ClearFacts | 1,590 | full | 53.8% | General |
| FEVER | 19,998 | 200 | 50.0% | Wikipedia |
| TruthfulQA | 817 | 400 | 49.5% | Misc. |
| HaluEval | 10,000 | 200 | 50.0% | LLM-gen. |
Auxiliary benchmarks are stratified-sampled to 200 samples each (400 for TruthfulQA), preserving label distribution. ClearFacts is evaluated in full (1,590 samples) following Seo et al. (2025).
Appendix C Process Reward Scoring Rubrics
This appendix grounds the headline formula at the per-component level: how each is computed from the structured response, how labels are normalized across teacher dialects, and what reward range each scenario admits. The propositions in §2.3 are proved at the end of this appendix; we read the rubrics here as the operational definitions that make those propositions empirically tight.
C.1 Label Normalization
The reward function supports extensive label aliasing:
| Alias | Canonical label |
|---|---|
| yes, true, entailment, supported | Attributable |
| no, false, contradiction, neutral | Not Attributable |
| not supported, not_attributable | Not Attributable |
C.2 : Alignment Scoring Detail
For each alignment entry , the per-entry score is:
| (8) | ||||
Final alignment score: mean across entries, capped at 1.0.
C.3 : Chain Scoring Detail
For each reasoning step :
| (9) | ||||
Length bonus: rewards multi-step chains.
C.4 : Diagnosis Scoring Detail
| (10) |
where A = Attributable, NA = Not Attributable, is the six-category error taxonomy.
C.5 : Calibration Term
The calibration term rewards a model that is confident when correct and penalizes overconfidence when wrong:
| (11) |
where is the model’s predicted confidence. The asymmetry ( vs. ) is deliberate: in safety-critical deployment the cost of an overconfident wrong answer exceeds the value of an overconfident correct one, so the calibration term is biased toward rewarding correct calibration more than it punishes wrong calibration. This term is the source of the residual negative-prediction bias documented in §4; the asymmetric reward surface induces a small but systematic preference for predictions whose error pathways carry richer diagnostic structure.
C.6 Reward Range Analysis
| Scenario | Min | Max |
|---|---|---|
| Unparseable (no JSON) | 0.0 | 0.0 |
| JSON only, no fields | 0.02 | 0.02 |
| All fields, all wrong | 0.10 | 0.25 |
| Perfect process, wrong label | 0.55 | 0.70 |
| Everything perfect | 1.00 | 1.28 |
C.7 Proof Sketches for Propositions 1–2
Proposition 1 (Binary-Reward Advantage Collapse).
Let be i.i.d. Bernoulli rewards in a GRPO group of size . The unbiased sample-variance estimator has expectation . For this quantity is exactly ; by continuity, almost surely as approaches either endpoint. In our SFT setting, (only of rollouts predict the gold label and parse as valid JSON), so , giving . The normalized advantage is therefore bounded in magnitude by , and the policy gradient inherits this bound. Empirically, after 350 GRPO steps shrinks further to (Table 18), and the gradient is effectively zero. The argument does not depend on the specifics of binary reward — any reward with low intra-group dispersion at training start triggers the same collapse, which is why we identify it as a structural rather than incidental failure.∎
Proposition 2 (Process-Reward Variance Lower Bound).
Let with and . By the standard variance identity for linear combinations, . Provided the components are not all perfectly anti-correlated — which would require contrived correlation structure across format, alignment, chain, label, and diagnosis — the cross-terms cannot drive to zero unless every . In particular, if even a single component has and is uncorrelated with the rest, then . In our training data, (format) and (alignment) almost always have positive variance early in training (the SFT policy produces format errors of the time and grounding errors at varying rates), so inherits a strictly positive variance from these components alone. The GRPO gradient is therefore non-vanishing under process reward at any training step where any single component shows within-group disagreement — which is the empirical regime we observe in Table 18.∎
Appendix D Per-Benchmark Error Analysis
D.1 ClearFacts Confusion Matrices
| SFT | GRPO | |||
|---|---|---|---|---|
| Pred A | Pred NA | Pred A | Pred NA | |
| Gold Attr. | 68.2% | 31.8% | 64.1% | 35.9% |
| Gold Not Attr. | 38.5% | 61.5% | 25.4% | 74.6% |
| Format errors | 28% | 1% | ||
GRPO dramatically reduces false negatives (38.5% 25.4%) and format errors (28% 1%), but slightly increases false positives (31.8% 35.9%). The net effect is +4.1 F1.
D.2 Multi-Benchmark Analysis
| Benchmark | Attr.% | GRPO Pred NA% | Effect |
|---|---|---|---|
| FEVER | 50.0% | 48.5% | +8.6 F1 |
| TruthfulQA | 49.5% | 52.0% | +10.6 F1 |
| HaluEval | 50.0% | 55.5% | 2.6 F1 |
GRPO’s negative-prediction bias helps on balanced benchmarks (FEVER, TruthfulQA) but hurts when the agent over-predicts “Not Attributable” relative to the true distribution.
D.3 Error Type Distribution
| Error type | Count | % |
|---|---|---|
| fabrication | 298 | 36.7% |
| scope_inflation | 187 | 23.0% |
| entity_substitution | 124 | 15.3% |
| numerical_exaggeration | 89 | 11.0% |
| negation_flip | 68 | 8.4% |
| temporal_shift | 46 | 5.7% |
The agent most frequently diagnoses fabrication (36.7%), the catch-all category for information absent from the source. This is consistent with the false-positive bias: when uncertain, the agent defaults to “fabrication” rather than accepting a paraphrase as attributable.
Appendix E GRPO Training Dynamics
| Step | Reward | Entropy | Adv. min | Adv. max |
|---|---|---|---|---|
| Process reward | ||||
| 0 | 1.01 | 0.21 | 2.47 | +2.47 |
| 100 | 1.06 | 0.15 | 2.10 | +2.30 |
| 200 | 1.09 | 0.10 | 1.85 | +2.15 |
| 350 | 1.12 | 0.06 | 1.60 | +2.05 |
| Binary reward | ||||
| 0 | 0.38 | 0.20 | 0.15 | +0.12 |
| 100 | 0.40 | 0.18 | 0.10 | +0.08 |
| 200 | 0.39 | 0.16 | 0.08 | +0.06 |
| 350 | 0.41 | 0.14 | 0.05 | +0.04 |
The advantage spread under binary reward collapses toward zero, meaning all responses in a group receive nearly identical reward. Under process reward, the advantage spread remains 1.0 throughout training, providing effective learning signal.
Figure 6 visualizes the topology directly: process reward defines a smooth, multi-level terrain over response space (alignment quality reasoning quality), and a GRPO group of 8 rollouts spreads across reward levels from to — a gradient is available everywhere. Binary reward collapses the same space into two flat plateaus separated by a single cliff edge; the same 8 rollouts collapse to (6 at , 2 at ), driving within-group in Eq. 2. The contrast is geometric: process reward is climbable, binary reward is a constant punctuated by a cliff.
Appendix F Adversarial Data Generation
The self-evolution loop (§2.5) uses six targeted perturbation strategies to generate adversarial examples. Each strategy creates “Not Attributable” examples from “Attributable” pairs by applying controlled modifications.
| Strategy | Original | Perturbed |
|---|---|---|
| Entity confusion | Apple released iPhone in 2007 | Samsung released iPhone in 2007 |
| Numerical pert. | Revenue grew by 15% | Revenue grew by 35% |
| Multi-hop graft | studied at Oxford | studied at Cambridge |
| Paraphrase | CEO announced merger | CEO denied merger |
| Presupposition | rose by 2°C | rose by 2°C, highest on record |
| Unansw. wrap | positive results | results significant at |
Generated examples are filtered using three criteria: (1) the perturbation must be detectable by a human; (2) the perturbed claim must remain grammatically fluent; (3) the perturbation must target the intended error type (validated by GPT-4o-mini cross-check). Approximately 15% of generated examples are discarded during filtering.
Appendix G Structured Output Examples
We provide three examples of Seva-GRPO structured output on ClearFacts: a correct positive, a correct negative with error diagnosis, and a false positive failure case.
Ex. 1: Correct “Attributable”
Claim: “The company was founded in 2004 and is headquartered in San Francisco.”
Source: “Founded in 2004, the technology firm operates from its main office in San Francisco, CA.”
Output (abbreviated):
evidence_alignment: [
{claim: "founded in 2004",
source: "Founded in 2004",
status: match},
{claim: "headquartered in SF",
source: "main office in SF",
status: match}]
reasoning_chain: [
{part: "founded in 2004",
judgment: supported},
{part: "headquartered in SF",
judgment: supported}]
label: "Attributable", conf: 0.95
Gold: Attributable ✓ Correct
Ex. 2: Correct “Not Attributable”
Claim: “The study involved 500 participants from 12 countries.”
Source: “The research included 500 volunteers from 8 different nations.”
Output (abbreviated):
evidence_alignment: [
{claim: "500 participants",
source: "500 volunteers",
status: match},
{claim: "12 countries",
source: "8 different nations",
status: mismatch}]
reasoning_chain: [
{part: "500 participants",
judgment: supported},
{part: "12 countries",
judgment: not_supported}]
label: "Not Attributable",
error_type: numerical_exaggeration,
fix: "Change 12 to 8."
Gold: Not Attributable ✓ Correct
Ex. 3: False positive — over-sensitive to paraphrase
Claim: “The policy was implemented to reduce emissions.” Source: “The new regulation was enacted with the goal of lowering greenhouse gas output.”
evidence_alignment: [{claim: "policy was implemented", source: "regulation was enacted", status: match}, {claim: "reduce emissions", source: "lowering GHG output", status: mismatch}] reasoning_chain: [{part: "reduce emissions", judgment: not_supported, explanation: "emissions broader than greenhouse gas output"}] label: "Not Attributable", error_type: scope_inflation, fix: "Change emissions to greenhouse gas output."
Gold: Attributable False Positive Analysis: The model over-interprets “emissions” vs. “greenhouse gas output” as scope mismatch. This illustrates the false-positive bias (§4).
Appendix H Error Taxonomy
Table 20 defines the six-category error taxonomy used for error diagnosis. Each category corresponds to a distinct failure mode in fact attribution.
| Error type | Description |
|---|---|
| Numerical exag. | Number inflated or deflated |
| Negation flip | Negation added or removed |
| Scope inflation | Specific claim overgeneralized |
| Temporal shift | Time qualifier altered |
| Entity substitution | Entity swapped for a different one |
| Fabrication | Information absent from source |
The taxonomy is designed to be mutually exclusive and collectively exhaustive for the error types observed in fact attribution benchmarks. When the agent predicts “Not Attributable,” it must select exactly one error type and provide a corresponding fix suggestion. For “Attributable” predictions, no error type is produced.
Appendix I Reward Weight Sensitivity
Our process reward uses a 70/30 process-outcome split. Table 21 reports the effect of alternative weight configurations on ClearFacts F1.
| Weight split | F1 | Align | Format |
|---|---|---|---|
| 90/10 (process-heavy) | 67.2 | 0.998 | 100% |
| 70/30 (ours) | 69.0 | 0.997 | 100% |
| 50/50 (balanced) | 68.1 | 0.985 | 98% |
| 30/70 (outcome-heavy) | 66.8 | 0.945 | 85% |
| 0/100 (binary reward) | 65 | 0.92 | 72% |
The 70/30 split achieves the best F1 while maintaining near-perfect structural quality. Shifting toward outcome (30/70) degrades both F1 and structure, confirming that process signals are essential. Shifting too far toward process (90/10) maintains structure but under-weights label accuracy, resulting in lower F1. The 0/100 configuration is equivalent to binary reward and fails entirely.
Appendix J SFT vs. GRPO: Qualitative Comparison
To illustrate the qualitative difference between SFT and GRPO outputs, we show the same claim-source pair processed by both models.
SFT output (format error)
Claim: “The drug reduced mortality by 30%.”
Source: “The treatment decreased death rates by approximately one-third.”
{"label": "Attributable",
"confidence": 0.7}
Issues: Missing evidence_alignment, missing reasoning_chain, no error diagnosis. Counted as format error (28% of SFT outputs).
GRPO output (complete structured)
Same input as above.
evidence_alignment: [{claim: "reduced mortality by 30%", source: "decreased death rates by approximately one-third", status: "match"}]
reasoning_chain: [{part: "reduced mortality by 30%", evidence: "decreased death rates by approximately one-third", judgment: "supported", explanation: "30% and one-third are equivalent"}]
label: "Attributable", confidence: 0.92
Improvement: Complete structured output with grounded evidence, step-by-step reasoning, and calibrated confidence.
Appendix K Prompt Templates
K.1 SFT System Prompt
The following system prompt is used during SFT training and evaluation:
System prompt for structured verification You are a fact attribution verifier. Given a claim and a source document, determine whether the claim is fully supported by the source. Respond in JSON with the following structure: { "evidence_alignment": [ {"claim_span": "...", "source_span": "...", "status": "match|mismatch|not_found"}], "reasoning_chain": [ {"claim_part": "...", "source_evidence": "...", "judgment": "supported|not_supported |partially_supported", "explanation": "..."}], "label": "Attributable|Not Attributable", "confidence": 0.0--1.0, "error_type": "(if Not Attributable) numerical_exaggeration|negation_flip| scope_inflation|temporal_shift| entity_substitution|fabrication", "fix_suggestion": "(if NA) correction" }
K.2 User Prompt Template
User prompt template
Claim: {claim}
Source: {source}
Is this claim attributable to the source?
Provide your analysis in structured JSON format.
K.3 Teacher Annotation Prompt (GPT-4o-mini)
For generating structured training data, we use a more detailed prompt with few-shot examples:
Teacher annotation prompt (abbreviated) You are an expert fact-checker creating training data for a verification model. For each claim-source pair, produce a detailed analysis. Requirements: • evidence_alignment: ALL claim spans, even if not_found in source • reasoning_chain: ¿= 2 steps • Each step must reference specific source text • confidence: reflect genuine uncertainty • error_type: match the actual error pattern • fix_suggestion: actionable and minimal [2 few-shot examples omitted for brevity]
Appendix L Self-Evolution: Per-Round, Per-Benchmark Results
We first formalize the four-stage loop in Algorithm 2, then report absolute macro-F1 per round to back the relative deltas in Table 2.
Table 22 reports absolute macro-F1 for every round of the self-evolution loop (§2.5) on the four-benchmark suite, using the 7B Step150 GRPO checkpoint as the Round 0 seed. This extends Table 2 (which reports only F1 vs. Step150) with the full numbers needed to reproduce the specialization fingerprint, and shows that the average F1 across benchmarks is essentially flat from Round 0 through Round 4 (70.5–71.4): the specialization is a redistribution of mass, not an aggregate gain.
| Round | CF | FEVER | TQA | HE | Avg |
|---|---|---|---|---|---|
| Step150 (seed) | 65.2 | 90.7 | 68.8 | 57.1 | 70.5 |
| Round 1 (rules) | 64.5 | 90.2 | 69.9 | 57.7 | 70.6 |
| Round 2 (LoRA, 1.1K) | 66.5 | 92.3 | 58.6 | 68.0 | 71.4 |
| Round 3 (Full FT, 2.0K) | 65.2 | 91.9 | 55.0 | 71.4 | 70.9 |
| Round 4 (mega-FT, 7.8K) | 65.1 | 92.2 | 56.4 | 72.0 | 71.4 |
| Step150 R4 | 0.1 | 1.5 | 12.4 | +14.9 | 0.9 |
Two observations strengthen the structural reading. First, the TQA HE swap appears at Round 2 (where LoRA training on probes is light) and sharpens at Round 3 / Round 4 despite the dataset growing between Round 2 and Round 4. The asymmetry is therefore not a calibration drift that more data corrects; it is a stable property of the probe distribution itself. Second, the per-round winners differ: Round 2 dominates CF and FEVER, Round 4 dominates HE, while Step150 (no specialization) wins TQA. No single round Pareto-dominates the others on every benchmark — a precondition for any downstream specialist-routing strategy.
What this evidence does and does not establish.
We are explicit about scope. What the per-round data does establish: (i) four rounds of the VerifyReflectProbeRefine loop produce a stable, signed, monotone specialization fingerprint (Table 22, Fig. 3); (ii) the fingerprint matches the Probe stage’s target weakness profile in direction; (iii) the magnitude saturates by Round 2–3 and is not driven by raw sample count, ruling out the data-volume-overfitting reading. What it does not yet establish: (a) that the weakness-guided Probe distribution is strictly necessary for specialization — a same-budget random-probe control is the natural ablation and is out of scope for this submission (§4); (b) that mixing probes from heterogeneous source distributions would recover a generalist rather than reproduce the specialization fingerprint at a different center of mass; (c) that the effect transfers to the 3B GRPO model, where we have not run the self-evolution loop. We treat (a)–(c) as the most informative follow-up experiments and frame the current results as the within-distribution finding that motivates them.
Appendix M Failure Case Studies
Beyond aggregate confusion matrices (Appendix D), we study three qualitative failure modes that surface in Seva-GRPO’s output and that future work should target. Each case is taken verbatim from ClearFacts evaluation traces.
Case F1 — Paraphrase mistaken for scope inflation.
Claim: “The policy was implemented to reduce emissions.” Source: “The new regulation was enacted with the goal of lowering greenhouse gas output.” Gold: Attributable. Seva-GRPO predicts Not Attributable with scope_inflation, arguing that “emissions” is broader than “greenhouse gas output.” This is a textbook false positive driven by the asymmetric reward surface on (§4): the agent receives more reward signal for naming a specific error type than for declaring the claim attributable, so under genuine ambiguity it leans toward Not Attributable. The case is also typical of how the six-category taxonomy is mildly over-fitted to claim-level word substitution: any plausible word-level mapping the agent can name will count as a “diagnosis,” even when the underlying semantic relation is a valid hyponym.
Case F2 — HaluEval over-attribution under negative skew.
Claim: “Penicillin was discovered in 1928 by Marie Curie.” Source (LLM-generated answer): “Penicillin was discovered by Alexander Fleming in 1928.” Gold: Not Attributable (entity substitution). Seva-GRPO correctly predicts Not Attributable, but on a separate HaluEval item with subtler entity drift (“the Nobel Prize was awarded in 1921 to Einstein for the photoelectric effect” vs. source “Einstein received the 1921 Nobel for the photoelectric law”), the agent accepts the paraphrase as Attributable. HaluEval skews positive ( Attributable in our 200-sample slice, but with a long tail of near-paraphrase items the model treats as semantically equivalent), and Seva-GRPO’s F1 on this benchmark traces almost entirely to such near-paraphrase “Attributable but should be NA” calls. A deployment-time fix would tighten the alignment threshold for proper-noun substitutions, where word-level token mismatch should be weighted more heavily than for descriptive phrases.
Case F3 — Self-evolution-induced regression on TruthfulQA.
Claim: “Eating carrots significantly improves night vision in healthy adults.” Source: “Carrots contain vitamin A, which is necessary for normal vision; deficiency causes night blindness, but supplementation in adults with adequate intake does not measurably improve night vision.” Gold: Not Attributable. Seva-7B Step150 predicts Not Attributable (correct), citing the qualifier “significantly improves” is not supported. After Round 3 self-evolution, the same model predicts Attributable on this item: the adversarial probes from the Probe stage train the agent to attend to entity- and number-level perturbations, which makes it more permissive on qualifier-level claims like “significantly improves.” This is the per-claim manifestation of the structural specialization finding: training pressure pushes the decision boundary toward ClearFacts/HaluEval-style failures and away from TruthfulQA-style qualifier scrutiny. The case argues that any future Probe stage should explicitly mix qualifier perturbations to preserve TruthfulQA-side competence, rather than concentrate on the four entity/number/temporal axes that the current six-category taxonomy already covers well.
Failure-mode summary.
The three cases share a common shape: the agent’s reward surface, taxonomy, and probe distribution all pull in the same direction (more negative predictions, more entity-style diagnoses), and each case is a manifestation of that joint pressure at a different point in the pipeline. This makes the fixes coupled — a fairer , a less entity-biased taxonomy, and probe distributions that span qualifier-style errors — rather than independently composable.
Appendix N Compute Budget and Estimated Carbon Footprint
Table 23 extends Table 7 with an order-of-magnitude carbon-footprint estimate, following the methodology of Lacoste et al. (2019) and Strubell et al. (2019) (TDP hours PUE regional grid intensity). We use TDP W, TDP W, PUE (typical academic cluster), and a US-grid carbon intensity of kg CO2e / kWh (eGRID 2023 US-average, U.S. EPA). We do not include adversarial-probe generation via the GPT-4o-mini API, whose carbon attribution depends on opaque hyperscaler accounting; we report it separately as “API-side, not estimated.”
| Stage | GPU | GPU-hr | kWh | kg CO2e |
| 3B SFT | 2Ada | 4 | 1.7 | 0.7 |
| 3B GRPO (350 steps) | 2Ada | 16 | 6.7 | 2.8 |
| 7B SFT (LoRA) | A100 80G | 7 | 3.9 | 1.6 |
| SE R2 (LoRA) | A100 80G | 5 | 2.8 | 1.1 |
| SE R3 (Full FT) | A100 80G | 21 | 11.8 | 4.8 |
| SE R4 (mega-FT) | A100 80G | 72 | 40.3 | 16.5 |
| Evaluation (all bench) | Ada | 2 | 0.8 | 0.3 |
| 3B-only | 28 | 11.7 | 4.8 | |
| Full pipeline | 130 | 68 | 28 | |
| API (probe generation) | GPT-4o-mini | not estimated | ||
The full-pipeline estimate ( kg CO2e) is well below the carbon cost of training a single 7B base model from scratch ( t CO2e for comparable scales (Strubell et al., 2019)); our cost is dominated by Round 4 mega-FT, which provides the strongest evidence for the persistence of the specialization effect at data scale. For practitioners primarily interested in the process-reward GRPO contribution (§2.3), the 3B-only path ( kg CO2e) is a sufficient reproduction target.
Appendix O Statistical Significance and Variance
We quantify the noise floor under which our claims should be read. On large benchmarks (ClearFacts, ) we report paired BCa bootstrap intervals; on the auxiliary slices we report seed-to-seed variance under three stratified-sampling seeds. Effect sizes that survive both tests carry the load-bearing weight of the paper; numbers in the noise band are flagged as trends.
Bootstrap confidence intervals.
On ClearFacts (), we estimate 95% bias-corrected and accelerated (BCa) bootstrap intervals for the headline F1 numbers using resamples of the test set. Seva-SFT reaches F1 with a 95% CI of ; Seva-GRPO reaches F1 with 95% CI . The paired bootstrap (resampling claim-level predictions in tandem so that the same items contribute to both estimates) gives a F1 of with a 95% CI of , well clear of zero. A paired McNemar test on per-claim correctness gives , consistent with the bootstrap result.
Per-benchmark variance.
For the auxiliary benchmarks where we evaluate stratified - or -sample slices (Appendix B), we re-run evaluation with three different seeds for stratified sampling. The seed-to-seed standard deviation of F1 is on FEVER, on TruthfulQA, and on HaluEval — comfortably below the / / effect sizes reported in Table 4. The HaluEval regression () does not survive a paired-bootstrap test (); we therefore interpret it as a trend rather than a significant degradation, but its direction is corroborated by the systematic negative-prediction bias documented in §4.
Self-evolution result stability.
The per-round F1 values in Appendix L are computed on the same - or -sample stratified slices and inherit the same seed variance. The TQA HE swap ( / at Round 4) is far larger than the seed standard deviation, so the asymmetry holds under all three seeds we tested. Round-to-round movement on CF/FEVER (within pp) sits closer to the seed noise band and should be read as “flat” rather than “small win.”
Appendix P Ethics, Bias, and Broader Impact
A verifier is, by construction, a power that decides which model outputs the world sees as “correct.” That power has to be exercised with explicit limits: a clear intended use, a documented bias profile, a stated misuse vector, and a concrete handover protocol to human reviewers. This appendix states each in turn.
Intended use.
Seva is designed as a tool that flags unsupported claims and surfaces structured diagnoses to a downstream consumer — another agent that can self-correct, or a human reviewer who can audit. It is not a stand-alone arbiter of factual truth; it judges whether a claim is supported by a specific source document, which is a narrower question than “is this claim true.” Deployments that conflate these two questions (e.g. using Seva to label public statements as “true” or “false” without reference to a specific source) misuse the system and risk false-confidence harms.
Bias and asymmetric error costs.
The negative-prediction bias documented in §4 (false positives dominate; agent over-predicts Not Attributable) has direct fairness implications. If Seva is deployed downstream of a generative agent whose outputs are routed to different audiences, the bias can disproportionately suppress responses to user populations whose factual claims paraphrase a source rather than copy it verbatim — a population that includes non-native English speakers and users from domains where the canonical source documents are stylistically distant from common phrasing. Mitigations include: (i) reporting both label and structured diagnosis so that downstream systems can re-examine borderline negatives; (ii) calibrating the asymmetric component with label-conditional reward normalization; (iii) auditing deployment logs for false-positive rate disparities across writing styles before promoting the verifier to a blocking position in the agent pipeline.
Dual-use considerations.
A high-quality structured verifier with named error types and fix suggestions could be repurposed for adversarial use: generating fact-perturbed claims that defeat existing verifiers (the same Probe-stage machinery in §2.5). We mitigate this by keeping the adversarial probe-generation prompt focused on six well-defined error types — which an attacker would already have to enumerate manually — and releasing only the trained checkpoints and probe data, not the adversarial-generation LLM weights. The release decision was made jointly with the host institution’s research-compliance reviewer.
Privacy.
All training data is derived from public NLP benchmarks (ANLI, FEVER, TruthfulQA, HaluEval, ClearFacts) and does not contain user-identifiable information. Adversarial probes generated via GPT-4o-mini are operated on these public claims; no PII is transmitted to the API.
Environmental cost.
Pipeline carbon footprint is reported in Appendix N: kg CO2e for the full pipeline, kg CO2e for the 3B-only path. Both are small relative to base-model pre-training but non-zero; the per-iteration cost of self-evolution is the primary driver, and any future work that adds rounds should weigh the carbon cost against the specialization-vs-generalization trade-off that §2.5 surfaces.
Limits we cannot mitigate.
Seva inherits the parametric biases of its Qwen2.5 base, the annotation biases of GPT-4o-mini (the teacher used to generate structured training samples), and the topic distribution of its training benchmarks. We do not claim universal verification competence, and users should evaluate Seva on representative samples from their target domain before deployment.
Appendix Q ICML Reproducibility Checklist
We present a structured reproducibility checklist following the ICML 2026 template.
- •
-
•
Datasets: All four evaluation benchmarks are publicly released (ClearFacts (Seo et al., 2025), FEVER (Thorne et al., 2018), TruthfulQA (Lin et al., 2022), HaluEval (Li et al., 2023)). Statistics in Table 12; stratified-sampling protocol in Appendix B. Training data is built from ANLI (Nie et al., 2020); structured annotations are released alongside code.
-
•
Model: Base models (Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct) are publicly released under the Apache 2.0 license; fine-tuned checkpoints will be released on publication.
- •
- •
-
•
Evaluation: Macro-F1 and accuracy computed against ground-truth labels with greedy decoding (temperature 0); confidence intervals via paired BCa bootstrap with (Appendix O).
- •
-
•
Random seeds: Seed-to-seed standard deviation reported in Appendix O ( F1 on auxiliary benchmarks).
-
•
Licenses: All released artifacts are licensed Apache 2.0 (code) and CC-BY-4.0 (data), consistent with the upstream benchmark licenses.
- •
Reproducibility Statement
To ensure reproducibility:
-
•
Code: Full training and evaluation code — including reward functions, GRPO configuration, and data-processing pipelines — is publicly available at https://github.com/Justin0504/Verifiable_agent.
-
•
Data: The structured Seva training data ( samples), GRPO prompts ( samples), adversarial probes from the self-evolution loop (// for Rounds 2/3/4), and evaluation splits are released alongside the code.
-
•
Models: We use publicly available base models (Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct).
-
•
Hyperparameters: All hyperparameters are listed in Appendix A.
-
•
Compute: Experiments require 28 GPU-hours total (Table 7), accessible to academic labs.
-
•
Evaluation: We report macro F1 and accuracy on standard benchmarks with fixed random seeds. Evaluation uses greedy decoding (temperature 0) for deterministic results.