License: CC BY 4.0
arXiv:2606.27632v1 [cs.CL] 26 Jun 2026

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

Yuvion Team, Alibaba Security AGI Lab
Abstract

As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from natural inputs alone, but from strategic attempts to evade model policies and safeguards. However, existing general-purpose model development largely overlook this adversarial nature, and often remain insufficient for realistic safety scenarios involving planning, tool use, and multi-step reasoning, causing measured safety performance to overestimate real deployment robustness. To address this gap, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM treats adversarial robustness and agentic capability as first-class objectives. Its pipeline combines adversarially aware data construction, knowledge-enhanced continued pretraining, and policy-grounded multi-task safety post-training, including risk-aware supervised fine-tuning and reinforcement learning-based policy optimization, together with safety-aware agentic reinforcement learning for tool use and multi-step reasoning in complex safety scenarios. We further introduce the Yuvion LLM RiskEval (YLRE), a collection of 93 benchmarks across four evaluation categories, covering diverse open and internal evaluations with a focus on safety, adversarial robustness, and real-world capability requirements. Across these evaluations, Yuvion LLM demonstrates clear advantages on safety-focused benchmarks and particularly strong robustness under adversarial conditions, while maintaining solid overall capability. Notably, Yuvion-8B outperforms most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks.

footnotetext: Partial open-source release is planned. For trial access, contact honghaiwen.hhw@alibaba-inc.com.
Refer to caption
Figure 1: Evaluation overview across benchmark settings: Open-source Safety (Macro F1 on public content safety benchmarks), Adversarial Safety Static and Dynamic (F1 and 100100-Combined Score on self-constructed red-team adversarial benchmarks), and Industrial Deployment (in-house capability and business composite). Yuvion-32B achieves the best results across all panels, with scores of 78.2, 94.2, 79.4, and 86.1, outperforming all baselines including GPT-5.4 and Qwen3-MAX. Yuvion-8B also surpasses most baselines and remains competitive against substantially larger models.

1 Introduction

Large language models (LLMs) are increasingly being deployed in real-world systems for content moderation, user interaction, decision support, tool use, and multi-step task execution (Brown et al., 2020; Achiam et al., 2023; Schick et al., 2023; Yao et al., 2023; Wang et al., 2024a; Qwen Team, 2025). As these systems become integrated into social and economic activity, safety failures can lead to harmful content, prohibited transactions, and dangerous information access through jailbreaks or policy circumvention. Recent LLM-based safety systems (Inan et al., 2023; Meta, 2025; Qwen Team, 2025) have helped mitigate many of these risks. However, we identify a fundamental limitation shared by these systems: fragility under human adversarial behavior.

Figure 2 illustrates this limitation concretely. The underlying unsafe intent remains unchanged, yet a general LLM that correctly rejects the original input fails once the request is reframed through euphemism, symbolic substitution, or cross-lingual mixing. This points to the core challenge: unsafe behaviors in deployment arise not from natural inputs alone, but from strategic, adaptive attempts to evade model safeguards. In content safety, harmful intent can be hidden through lexical obfuscation, coded language, contextual disguise, or off-platform traffic obfuscation; in broader AI safety, models can be manipulated through jailbreaks, prompt injection, role-playing, and multi-turn attacks (Perez et al., 2022; Wei et al., 2023; Mazeika and others, 2024). Existing general-purpose models are developed and evaluated largely without accounting for such strategic human behavior, causing measured safety performance to systematically overestimate real deployment robustness.

This adversarial challenge intensifies in agentic settings. Production safety scenarios often involve cascading policy rules, cross-modal evidence gathering, and specialized tool invocation. For example, determining whether a product listing infringes a registered trademark may require following a multi-step decision procedure, invoking image detection tools, and synthesizing cross-modal evidence before rendering a judgment. These operational demands require genuine agentic capabilities: planning multi-step reasoning chains, invoking external tools, retrieving and grounding decisions in evolving policy documents, and adapting execution paths based on intermediate results. However, existing safety-oriented models, including dedicated guard models (Inan et al., 2023; Meta, 2025; Qwen Team, 2025), remain confined to single-turn safety judgment and lack support for tool use, policy retrieval, or multi-step execution. As content safety systems scale from text classification toward multi-modal, multi-tool audit pipelines, agentic capability becomes a structural requirement for practical deployment.

These two gaps, fragility under adversarial evasion and the absence of agentic capability, motivate the central premise of our work: adversarial robustness and agentic safety capability must be built into model development and evaluation by design, rather than added only as post-hoc safeguards. Robust safety requires models that can recognize obfuscated risks, remain aligned under adversarial pressure, and operate reliably in realistic safety scenarios involving planning, policy grounding, tool use, and adaptive multi-step reasoning.

Refer to caption
Figure 2: Illustrative examples of adversarial safety evasion. Left: a violent or terror-related request is progressively obfuscated through euphemistic expression and cross-lingual mixing (Pinyin substitution). Right: an extremism-related request escalates from direct attack planning queries to euphemistic framing and authority-based context reframing.

To this end, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM is designed around two first-class objectives: robust safety under adversarial conditions and agentic safety capability for realistic deployment scenarios. To realize these objectives, Yuvion adopts a progressive safety training paradigm consisting of three stages: knowledge-enhanced continued pretraining, which injects safety-domain knowledge; policy-grounded multi-task safety post-training, which elicits risk understanding, fine-grained risk identification, and policy-consistent decision making under adversarial variation; and safety-aware agentic reinforcement learning, which extends the model to retrieval, tool use, and multi-step reasoning in safety scenarios. Together, these stages strengthen the model’s safety-relevant knowledge, adversarial robustness, policy-grounded decision capability, and trajectory-level reliability in realistic safety scenarios. In this way, Yuvion LLM is intended not merely as a guard classifier, but as a more general safety-oriented model for practical deployment.

We further introduce the Yuvion LLM RiskEval (YLRE), a four-level progressive evaluation framework covering 93 benchmarks across open-source general benchmarks, open-source safety benchmarks, a self-constructed adversarial safety benchmark, and in-house capability and business benchmarks. This framework is designed to jointly evaluate general capability retention, public safety performance, adversarial robustness, and real-world operational value. More specifically, the first two levels measure whether safety specialization preserves general competence and remains competitive on established public benchmarks; the third level stresses adversarial robustness under controlled yet realistic transformation patterns; and the fourth level evaluates performance in practical settings involving agentic capabilities and business-facing requirements. This progressive design enables us to examine not only whether a model understands risk and safety in principle, but also whether it remains reliable and precise in real-world, complex, multi-turn, and adversarially intensive production environments at scale. The main contributions of this work are as follows:

  • We argue that safety should be treated as an inherently adversarial problem, and that both adversarial robustness and agentic capability must be built into model development by design. Guided by this principle, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety, with a development pipeline integrating adversarially aware data construction, knowledge-enhanced continued pretraining, post-training for safety tasks, and agentic reinforcement learning that equips the model with tool invocation, multi-step reasoning, and task execution capabilities for complex safety scenarios.

  • We introduce the Yuvion LLM RiskEval (YLRE), a four-level evaluation framework covering 93 benchmarks across open-source general benchmarks, open-source safety benchmarks, a self-constructed adversarial safety benchmark, and in-house capability and business benchmarks, enabling systematic assessment from public benchmarks to real-world deployment scenarios.

  • Comprehensive experiments show that Yuvion LLM outperforms both open-source and proprietary baselines. Yuvion-32B achieves 78.2% Macro F1 on content safety (vs. GPT-5.4 72.2%) and 86.1% on industrial deployment (vs. GPT-5.4 80.6%). The gains are also scale-efficient: despite its smaller size, Yuvion-8B surpasses most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks, indicating that targeted adversarial and agentic safety training can matter more than model scale alone.

2 Content-Safety-Oriented Data System

2.1 Overview

A core premise of Yuvion is that effective safety modeling requires not only a dedicated training pipeline, but also a data system aligned with the adversarial and operational nature of real-world safety tasks. Unlike general-purpose language modeling, safety-oriented data must support not only ordinary risk understanding and identification, but also policy-grounded judgment, adversarial robustness, and agentic safety behaviors. This is particularly important because Yuvion is designed not only for explicit content classification, but also for realistic settings in which unsafe intent may be obfuscated, reformulated, or embedded in multi-step interactions.

To support these requirements, Yuvion is trained on a multi-source data system spanning general, safety-specific, adversarial, agentic, and synthetic or expert-constructed data. Rather than treating all samples as a single homogeneous corpus, we organize data by functional role so that different data types support different aspects of capability formation across the training pipeline.

2.2 Data Categories

Table 1 summarizes the main data categories used in Yuvion. At a high level, the data system is designed to balance five goals: preserving general language ability, strengthening safety-domain knowledge, improving robustness to adversarial inputs, supporting agentic capability, and expanding coverage of long-tail and complex scenarios.

Data Type Main Role Representative Content
General data Preserve broad language competence and stabilize safety adaptation General instruction following, question answering, reasoning, reading comprehension, and other common language tasks, etc.
Safety-domain data Build core safety knowledge and task capability Risk understanding and identification, hierarchical safety categories, policy-grounded judgment, and evidence-based responses, etc.
Adversarial data Improve robustness to evasive and obfuscated unsafe inputs Lexical variation, symbol or homophone substitution, semantic camouflage, contextual disguise, and other policy-evasive expressions, etc.
Agentic data Support multi-step safety workflows involving reasoning, retrieval, and tool interaction Tool-use trajectories, search-based reasoning, multi-step decomposition, and retrieval-augmented decision making, etc.
Synthetic and expert- constructed data Expand long-tail coverage and provide high-quality supervision for complex tasks Rare or difficult scenarios, policy-intensive cases, structured reasoning samples, preference data, and reward-oriented optimization data, etc.
Table 1: Overview of the main data categories in the Yuvion safety-oriented data system.
General data.

General-domain data is used to preserve broad language competence and reduce overspecialization during safety adaptation. Although such data is not safety-specific, it plays an important regularizing role by helping the model retain general instruction-following, comprehension, and reasoning ability.

Safety-domain data.

Safety-domain data forms the core of Yuvion’s safety capability learning. It covers a wide range of safety-relevant tasks, including risk understanding, risk identification, policy-sensitive categorization, evidence attribution, and structured decision generation. Compared with conventional safety classification data, this portion places greater emphasis on richer supervision and policy grounding, enabling the model to produce not only labels but also more interpretable and policy-consistent responses when needed.

Adversarial data.

Because safety is inherently adversarial, the Yuvion data system includes a dedicated adversarial subset that explicitly models realistic evasion patterns. These data cover both surface-form perturbations and deeper semantic concealment strategies, helping the model avoid over-reliance on shallow lexical cues and improving robustness to obfuscated unsafe intent.

Agentic data.

Agentic data is introduced to support safety scenarios that require more than single-turn classification. This category covers structured trajectories involving multi-step reasoning, retrieval, tool invocation, and action-conditioned responses. It provides training signals for behaviors such as decomposing complex tasks, selecting appropriate tools, interacting with external systems, and synthesizing intermediate observations into grounded decisions. In Yuvion, such data is particularly important for later-stage optimization of realistic safety scenarios.

Synthetic and expert-constructed data.

Synthetic and expert-constructed data are introduced to improve coverage of rare, high-risk, long-tail, or structurally complex scenarios that are insufficiently represented in naturally collected data. They are particularly useful for difficult policy cases, structured safety tasks, and later-stage optimization settings requiring higher-quality supervision, preference signals, or reward-oriented annotations.

2.3 Summary

The Yuvion data system is a multi-source training foundation designed for adversarially robust and deployment-oriented safety modeling. By combining general data, safety-domain data, adversarial data, agentic data, and synthetic or expert-constructed supervision, it covers both ordinary and adversarial safety scenarios while supporting the broader capability requirements of realistic safety scenarios.

3 Yuvion LLM: Progressive Safety Training Paradigm

3.1 Overview

Yuvion LLM is built on the premise that adversarial robustness and agentic safety should be developed as first-class model capabilities rather than appended through post-hoc patching. In realistic deployment, a safety model must go beyond recognizing overtly harmful content: it must internalize safety-domain knowledge, infer latent unsafe intent under obfuscation, remain policy-consistent under adversarial pressure, and operate reliably in structured workflows involving retrieval, tool use, and multi-step decision making. To this end, Yuvion adopts a progressive training paradigm that transforms a general-purpose instruct model into a deployable safety model through staged capability shaping.

The full pipeline consists of three stages: knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning. Stage 1 injects and internalizes safety-domain knowledge through knowledge-enhanced continued pretraining. Stage 2 converts this knowledge into task-level capability for policy-grounded risk understanding, risk identification, and adversarially robust safety decision making. Stage 3 further extends the model from single-turn safety judgments to trajectory-level reasoning and action in structured safety workflows.

Throughout all stages, Yuvion is formulated as a unified autoregressive conditional generator. Given a context cc, the model generates an output object zz according to

zpθ(zc),z\sim p_{\theta}(z\mid c), (1)

where θ\theta denotes the model parameters. The instantiation of cc and zz varies across stages. In continued pretraining, cc is a raw token prefix and zz is its continuation. In safety post-training, c=(x,)c=(x,\mathcal{I}), where xx denotes the input content or task instance and \mathcal{I} denotes the task instruction, while zz corresponds to a task output such as a risk label, policy-grounded explanation, or structured decision. In agentic reinforcement learning, cc additionally includes interaction history and intermediate observations, and zz may represent a multi-step reasoning or action trajectory. Under the autoregressive factorization,

pθ(zc)==1Lpθ(zz<,c),p_{\theta}(z\mid c)=\prod_{\ell=1}^{L}p_{\theta}(z_{\ell}\mid z_{<\ell},c), (2)

where LL denotes the output length.

As illustrated in Figure 3, this paradigm enables Yuvion to progressively acquire safety-domain knowledge, policy-grounded decision capability, robustness to adversarial manipulation, and agentic competence for realistic safety deployment.

Refer to caption
Figure 3: Overview of the Yuvion training pipeline. The model is progressively trained through knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning for structured tool-use and planning workflows.

3.2 Target Capability Design

The Yuvion training paradigm is organized around four tightly coupled capabilities that are central to realistic safety deployment. Rather than treating them as isolated objectives, Yuvion develops them progressively so that each stage provides a stronger foundation for the next.

  • Risk understanding: capturing safety-relevant semantics, latent intent, contextual cues, and the policy meaning of user inputs;

  • Policy-grounded risk identification: producing fine-grained safety judgments, category attribution, and evidence-aware decisions aligned with moderation policy;

  • Adversarial robustness: maintaining stable and policy-consistent behavior under lexical obfuscation, semantic disguise, paraphrastic attacks, and other evasive transformations;

  • Agentic safety capability: supporting structured outputs, multi-step reasoning, retrieval, tool use, and evidence-seeking interaction in complex safety workflows.

These capabilities are developed in a staged manner. Stage 1 focuses on knowledge loading and representation adaptation; Stage 2 elicits policy-grounded risk understanding, instruction-conditioned task execution, and robustness under adversarial variation; Stage 3 extends safety capability from single-turn outputs to trajectory-level reasoning and action.

3.3 Stage 1: Knowledge-Enhanced Continued Pretraining

The first stage performs continued pretraining on the instruct model using a knowledge-enhanced corpus, denoted as 𝒟cp\mathcal{D}_{\mathrm{cp}}. Rather than relying on raw safety-domain text alone, this corpus is explicitly constructed to facilitate the internalization of safety-domain knowledge. In particular, we leverage large-scale domain knowledge bases and transform them into training instances at multiple granularities, including structured triple-level samples, sentence-level descriptions, and other knowledge-derived textual forms. These knowledge-oriented data are combined with broader safety-domain corpora and a smaller proportion of general-domain data, enabling the model to absorb structured risk knowledge while preserving broad language competence.

The training objective is the standard autoregressive next-token prediction loss:

cp=x𝒟cpt=1|x|logpθ(xtx<t).\mathcal{L}_{\mathrm{cp}}=-\sum_{x\in\mathcal{D}_{\mathrm{cp}}}\sum_{t=1}^{|x|}\log p_{\theta}(x_{t}\mid x_{<t}). (3)

Within Yuvion, this stage serves as a safety knowledge infusion phase before task-level post-training. By exposing the model to moderation policies, risk taxonomies, violation patterns, knowledge-base-derived facts, and long-tail adversarial expressions, the model internalizes safety-relevant concepts and their relations at the distribution level. The use of multi-granularity knowledge-derived samples is particularly important: structured instances help anchor explicit semantic relations, while sentence-level realizations improve the model’s ability to recognize how such knowledge is expressed in natural language. As a result, this stage provides a stronger initialization for downstream risk understanding, policy-grounded identification, and adversarially robust safety reasoning. The output checkpoint of this stage is denoted as Yuvion-CP.

3.4 Stage 2: Policy-Grounded Multi-Task Safety Post-Training

The second stage converts safety-domain knowledge into task-level capability for realistic safety deployment. Its goal is not merely to improve performance on isolated safety tasks, but to elicit policy-grounded risk understanding, fine-grained risk identification, and robust decision making under adversarially manipulated inputs. To this end, Yuvion performs multi-task safety post-training through two complementary components: risk-aware supervised instruction tuning for broad capability initialization, and reinforcement learning-based policy optimization for behavior refinement under ambiguity and attack.

3.4.1 Risk-Aware Supervised Fine-Tuning

Supervised fine-tuning initializes the model on safety-oriented tasks using a multi-task supervised instruction dataset, denoted as 𝒟sft\mathcal{D}_{\mathrm{sft}}:

𝒟sft={(x(i),(i),y(i))}i=1Nsft,\mathcal{D}_{\mathrm{sft}}=\{(x^{(i)},\mathcal{I}^{(i)},y^{(i)})\}_{i=1}^{N_{\mathrm{sft}}},

where x(i)x^{(i)} is the input content or task instance, (i)\mathcal{I}^{(i)} is the task instruction, and y(i)y^{(i)} is the target output. Under the unified framework, this stage instantiates the context as c=(x(i),(i))c=(x^{(i)},\mathcal{I}^{(i)}) and optimizes the conditional likelihood of the target response:

sft=i=1Nsft=1|y(i)|logpθ(y(i)y<(i),x(i),(i)).\mathcal{L}_{\mathrm{sft}}=-\sum_{i=1}^{N_{\mathrm{sft}}}\sum_{\ell=1}^{|y^{(i)}|}\log p_{\theta}\!\left(y_{\ell}^{(i)}\mid y_{<\ell}^{(i)},x^{(i)},\mathcal{I}^{(i)}\right). (4)

A central design choice is to formulate heterogeneous safety objectives under a unified instruction-following interface. This enables the model to learn not only task-specific prediction, but also instruction-conditioned behavior across diverse safety scenarios, such that it can reliably switch between risk judgment, fine-grained categorization, policy-grounded explanation, safety question answering, and structured decision generation within a single generative framework. More importantly, this formulation encourages outputs that are not only task-appropriate, but also consistently grounded in moderation policy, thereby strengthening policy-conditioned response generation across heterogeneous safety tasks.

Crucially, the training set combines naturally distributed supervision with adversarially constructed examples. These examples include obfuscated, paraphrased, or semantically disguised unsafe inputs that preserve harmful intent while altering surface realization. As a result, the model is encouraged to base its judgments on latent intent, contextual evidence, and policy semantics rather than superficial lexical patterns alone, improving robustness to realistic evasion strategies and stabilizing policy-consistent behavior under adversarial variation. In addition, structured response formats are introduced at this stage to establish a consistent task interface for downstream reinforcement learning and deployment-oriented use cases.

The output checkpoint of this stage is denoted as Yuvion-SFT.

3.4.2 Reinforcement Learning-Based Policy Optimization

While supervised instruction tuning establishes broad multi-task capability, many important safety behaviors are difficult to optimize with single-reference targets alone. This is particularly true for ambiguous cases, adversarial attacks, policy edge cases, and tasks where multiple outputs may be acceptable but differ substantially in policy faithfulness, reasoning quality, or robustness. Yuvion therefore further refines model behavior using reinforcement learning-based policy optimization.

We instantiate this stage with GRPO (Group Relative Policy Optimization). Given a safety context c=(x,)c=(x,\mathcal{I}), the current policy samples a group of candidate outputs {y(g)}g=1G\{y^{(g)}\}_{g=1}^{G}. Each candidate is evaluated by a reward function tailored to safety-specific objectives, and the policy is updated using group-relative advantage estimates:

rl=𝔼(x,)𝒟rl,{y(g)}pθold[g=1GA^(g)pθ(y(g)x,)pθold(y(g)x,)],\mathcal{L}_{\mathrm{rl}}=-\mathbb{E}_{(x,\mathcal{I})\sim\mathcal{D}_{\mathrm{rl}},\,\{y^{(g)}\}\sim p_{\theta_{\mathrm{old}}}}\left[\sum_{g=1}^{G}\hat{A}^{(g)}\cdot\frac{p_{\theta}(y^{(g)}\mid x,\mathcal{I})}{p_{\theta_{\mathrm{old}}}(y^{(g)}\mid x,\mathcal{I})}\right], (5)

where the normalized advantage of the gg-th candidate is

A^(g)=r(g)mean({r(g)}g=1G)std({r(g)}g=1G),\hat{A}^{(g)}=\frac{r^{(g)}-\mathrm{mean}(\{r^{(g^{\prime})}\}_{g^{\prime}=1}^{G})}{\mathrm{std}(\{r^{(g^{\prime})}\}_{g^{\prime}=1}^{G})}, (6)

and r(g)r^{(g)} denotes the scalar reward assigned to candidate y(g)y^{(g)}.

The reward is designed to capture multiple dimensions of output quality, including final decision correctness, consistency with policy basis, attribution or reasoning quality, and reliability under adversarial or structurally complex inputs. Compared with supervised instruction tuning, this stage is better suited to optimizing behaviors for which a single reference is insufficient, especially when the model must remain stable under ambiguity, obfuscation, or distribution shift. It therefore plays a central role in sharpening policy-grounded decision boundaries and improving robustness in realistic safety scenarios.

3.5 Stage 3: Safety-Aware Agentic Reinforcement Learning

Beyond single-turn safety classification and reasoning, deployable safety models must support structured multi-step workflows involving planning, retrieval, and tool interaction. In Stage 2, GRPO optimizes the quality of individual safety outputs given a fixed input context. However, many real-world safety tasks are inherently interactive: resolving a content moderation case may require querying an external policy knowledge base, invoking a specialized classifier, or retrieving contextual evidence before a final decision can be made. The reward signal in such settings is delayed and sparse, arriving only after a full interaction trajectory completes. This motivates a dedicated agentic RL stage that optimizes trajectory-level decision quality rather than single-turn output quality.

Tool-integrated reasoning.

For tool-use tasks, the model interacts with a tool set 𝒯={t1,t2,,tn}\mathcal{T}=\{t_{1},t_{2},\ldots,t_{n}\} under a task context cc, generating a reasoning trajectory

τ=((r1,T1,o1),,(rk,Tk,ok)),\tau=\big((r_{1},T_{1},o_{1}),\ldots,(r_{k},T_{k},o_{k})\big),

where rir_{i} denotes step-wise reasoning, Ti𝒯T_{i}\subseteq\mathcal{T} denotes the invoked tools, and oio_{i} denotes returned observations. At each step, the model must jointly reason, choose tools, and produce valid invocations.

Following ToolRL (Qian et al., 2025), we adopt decomposed rewards to train this behavior. The format reward Rformat{0,1}R_{\mathrm{format}}\in\{0,1\} checks whether the output contains the required structural fields, and the correctness reward Rcorrect[3,3]R_{\mathrm{correct}}\in[-3,3] evaluates predicted tool calls against ground-truth calls along tool-name, parameter-name, and parameter-value matching dimensions. The final reward is

Rfinal=Rformat+Rcorrect[3,4].R_{\mathrm{final}}=R_{\mathrm{format}}+R_{\mathrm{correct}}\in[-3,4]. (7)

This design provides denser process-level supervision than coarse binary rewards and improves the stability of tool-use optimization. Compared with Stage 2, the optimization target here shifts from single-turn task outputs to full interaction trajectories, while retaining the same GRPO-based reinforcement learning framework. Beyond tool-use proficiency itself, this training also strengthens the model’s general instruction-following capability: the format reward enforces strict adherence to prescribed output structures, and the decomposed correctness reward requires the model to precisely follow tool specifications including exact function signatures and parameter constraints. This structured compliance training transfers broadly, improving the model’s ability to follow complex, format-sensitive instructions in non-tool-use settings as well.

Search-augmented reasoning.

Content safety tasks frequently require evidence gathering beyond the model’s parametric knowledge. For example, verifying whether user-generated content constitutes misinformation requires cross-referencing external sources; determining whether a piece of text violates platform policy may depend on retrieving the latest regulatory guidelines; and investigating coordinated abuse patterns demands tracing and synthesizing information across multiple documents. These tasks cannot be resolved by single-turn classification—they inherently require multi-turn search, evidence evaluation, and grounded synthesis.

We train Yuvion as a search-augmented reasoning agent for such scenarios. Given a complex question and the accumulated interaction history, the model iteratively decides whether to issue a new search query qtq_{t}, visit a retrieved page vtv_{t}, or produce a final answer aa. The objective is to learn effective search strategies, evidence evaluation, and stopping decisions under extended interaction horizons.

We design a two-component reward for this track. The execution reward RexecR_{\mathrm{exec}} evaluates whether each tool invocation in the trajectory executes successfully: search queries that return no results, visits to invalid or inaccessible URLs, and malformed tool calls all receive negative signals, encouraging the model to formulate precise queries and well-structured invocations. The result reward RresultR_{\mathrm{result}} uses an LLM-as-judge to evaluate the final answer against ground truth for factual correctness and completeness. The combined search reward is:

Rsearch=Rresult+βRexec,R_{\mathrm{search}}=R_{\mathrm{result}}+\beta\cdot R_{\mathrm{exec}}, (8)

where β\beta controls the relative weight of execution quality. This decomposition provides denser learning signals than outcome-only rewards: the model receives feedback not only on what it concludes, but also on how well it interacts with the search environment along the way.

3.6 Summary

Yuvion transforms a general-purpose instruct model into a deployable safety model through progressive safety capability shaping. The full pipeline consists of knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning. Across these three stages, the model progressively internalizes safety-domain knowledge, learns policy-grounded and adversarially robust decision capability, and develops trajectory-level agentic competence for realistic safety workflows.

4 Evaluation Framework

4.1 Overview

A deployable safety model cannot be adequately assessed by a single benchmark category. Beyond public safety performance, it must retain general capability, remain robust under adversarial and domain-specific conditions, and demonstrate practical value in real workflows. To support such assessment, we establish Yuvion LLM RiskEval (YLRE), a collection of multi-level benchmarks covering Open-source General Benchmarks, Open-source Content Safety Benchmarks, a Self-constructed Adversarial Content Safety Benchmark, and in-house Capability and Business Benchmarks.

The benchmarks follow a progressive logic: Level 1 measures general capability retention, Level 2 measures public safety comparability, Level 3 measures domain-specific and adversarial robustness, and Level 4 measures deployment-oriented capability and business value. Figure 4 provides an overview of the full evaluation hierarchy.

Refer to caption
Figure 4: Architecture of the Yuvion LLM RiskEval. The four-level benchmark hierarchy spans open-source general capability (Level 1), open-source content safety (Level 2), static and dynamic adversarial robustness (Level 3), and industrial deployment including business and agent scenarios (Level 4).

4.2 Level 1: Open-source General Benchmarks

Level 1 evaluates whether the model retains broad language competence after safety specialization. This is a necessary baseline, since practical safety workflows require not only risk judgment but also general language understanding, knowledge, and reasoning.

The open-source general benchmark suite includes more than 30 public evaluation sets organized into two groups. The general-purpose benchmark group covers broad knowledge, reasoning, Chinese language capability, and scientific problem solving. General-purpose benchmarks include MMLU (Hendrycks et al., 2021a), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), GPQA (Rein et al., 2023), ARC-Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), TriviaQA (Joshi et al., 2017), Xiezhi-EN, C-Eval (Huang et al., 2023), CMMLU (Li et al., 2023a), C3 (Sun et al., 2020), CHID (Zheng et al., 2019), CLUEWSC (Xu et al., 2020), GSM8K-ZH (Cobbe et al., 2021), BBH (Suzgun et al., 2022), etc. The agentic benchmark group evaluates capabilities relevant to Yuvion’s agentic safety workflows, including tool use, function calling, and multi-step interactive problem solving. Agentic benchmarks include API-Bank (Li et al., 2023b), BFCL (Patil et al., 2024), and Seal-0 (Tongyi DeepResearch Team et al., 2025). For these benchmarks, we report Accuracy as the primary metric where applicable.

4.3 Level 2: Open-source Content Safety Benchmarks

Level 2 evaluates Yuvion on publicly available safety tasks and enables comparison with prior models and reported results. It includes two benchmark groups: content safety benchmarks and guard benchmarks.

The content safety benchmarks focus on the recognition of harmful or policy-violating content. Content safety benchmarks include ChineseHarm (Liu et al., 2025), COLD (Deng et al., 2022), Moderation (Markov et al., 2022), HateXplain (Mathew et al., 2021), ToxiGen (Hartvigsen et al., 2022), Jigsaw (Jigsaw/Conversation AI, 2018), CivilComments (Borkan et al., 2019), and SafetyBench (Zhang et al., 2023), covering risks such as pornography, fraud, offensive language, hate speech, and implicit toxicity. In total, this group contains 8 evaluation sets. The guard benchmarks focus on safety judgment, refusal behavior, and risk-aware response capability, evaluated following the protocol of YuFeng-XGuard (Lin et al., 2026). Guard benchmarks include SEval (Yuan et al., 2024), AEGIS (Ghosh et al., 2024), and more than 20 sub-datasets across five dimensions: prompt classification, response classification, multilingual classification, attack defense, and safe completion. For classification-oriented content safety tasks, we report Macro F1-Score as the primary metric.

4.4 Level 3: Self-Constructed Adversarial Robustness Benchmark

Level 3 complements public benchmarks with a self-constructed adversarial robustness benchmark designed to measure robustness against realistic evasion attacks and distribution shifts that standard benchmarks fail to capture. Although public benchmarks support cross-model comparison, they are limited in their coverage of long-tail risk categories, evolving adversarial expressions, and fine-grained policy taxonomies. The self-constructed benchmark fills this gap by incorporating both naturally occurring human-written evasive content and systematically generated adversarial variants. Seed samples are collected from real-world business scenarios and pre-screened via an LLM-assisted filter to retain instances with clear adversarial transformation patterns, including lexical substitution, homophonic rewriting, character decomposition, symbol insertion, and coded expressions. These seeds are then fed into an automated red-teaming pipeline that generates transformed variants while preserving harmful intent. All retained samples are annotated by five professional content moderation experts under a dual-annotator protocol with third-party adjudication (see Appendix A.3 and D.1 for detailed construction methodology).

The benchmark covers five major risk categories: advertising and traffic diversion, gambling and fraud, abusive content, pornographic content, and spam and flooding. Self-constructed benchmarks include static evaluation sets and dynamic evaluation sets across all five risk categories. It is divided into two parts. The static evaluation sets focus on relatively stable and canonical risk expressions under standard distribution conditions, measuring baseline domain recognition performance. The dynamic evaluation sets are specifically designed to evaluate adversarial robustness: they include recent, transformed, and evolving expressions constructed through an automated red-teaming framework that generates paraphrasing, camouflage, euphemistic wording, and structurally transformed variants intended to bypass safety filters, thereby measuring robustness under realistic adversarial conditions. For the dynamic sets, we adopt a combined score metric defined as the product of bypass success rate and semantic fidelity score; a lower combined score indicates stronger robustness against adversarial attacks.

4.5 Level 4: In-house Capability and Business Benchmarks

Level 4 evaluates the model in realistic operational settings derived from large-scale industrial deployment and commercial content moderation practice. While the first three levels measure general capability retention, public safety comparability, and adversarial robustness, they do not directly capture whether the model is useful in practical safety scenarios at production scale. The in-house benchmark suite is designed to fill this gap by reflecting the actual task distributions, policy complexity, and quality expectations encountered in real-world commercial platforms serving hundreds of millions of users.

This level includes more than 15 evaluation sets across two groups, all constructed from anonymized production data and validated against real moderation decisions. The capability benchmarks assess abilities beyond simple risk classification, including risk understanding, risk attribution, safety reasoning, and policy-aware judgment—skills that are essential for deployment but rarely tested by academic benchmarks. Capability benchmarks include Political Risk, Political Entity, Knowledge MCQ, Redline Text, Domain Instruction Following, Political NER, Prohibited Content, Insult, Low-Info Text, Porn Text, and Emotion Analysis. The business benchmarks directly mirror end-to-end production workflows such as UGC review assistance, AIGC safety filtering, structured decision support, and moderation suggestion generation, measuring whether the model can serve as a drop-in component in commercial safety infrastructure. Business benchmarks include UGC Moderation, AIGC Moderation, Business Porn Detection, Multi-Scenario Risk Detection, and Data Security NER. For capability tasks, we report task-specific metrics such as risk recognition F1-score and attribution accuracy; for business-oriented tasks, we combine quantitative metrics with workflow-level indicators to reflect both model capability and operational usefulness.

In addition to the standard Yuvion-32B and Yuvion-8B variants, we also evaluate Yuvion-32B (Agent), which is trained with an additional agentic reinforcement learning stage, on the same benchmark suite. This allows us to measure both its dedicated gains on agentic benchmarks and its incremental improvements on realistic in-house scenarios.

4.6 Summary of Evaluation Design

Together, the benchmarks provide a progressive and deployment-oriented view of model quality: from general capability retention, through public safety comparability and adversarial robustness, to in-house operational value. No single benchmark group is sufficient to characterize a safety foundation model in full, and Yuvion LLM RiskEval is designed to assess these complementary dimensions within one unified framework. Detailed descriptions of the benchmarks are provided in Appendix A.

5 Experimental Results and Analysis

5.1 Experimental Setup

Baselines.

We compare Yuvion LLM against a comprehensive set of baseline models spanning general-purpose open-weight models, frontier proprietary models, and publicly released AI safety guard models, in order to provide a thorough and multi-dimensional reference for performance assessment.

General-purpose open-weight models include the Qwen3 family (Qwen3-8B, Qwen3-32B, and Qwen3-30B-A3B-2507) (Qwen Team, 2025), the Qwen3.5 family (Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-397B-A17B) (Qwen Team, 2026), DeepSeek-R1 (DeepSeek-AI, 2025a), DeepSeek-V3.2 (DeepSeek-AI, 2025b), Kimi-K2.5 (Kimi Team, 2026), MiniMax-M2.5 (MiniMax Team, 2026), and GLM-5 (GLM-5-Team, 2026).

Frontier proprietary models include Qwen3-Max (Qwen Team, 2025), Qwen3.5-Plus (Qwen Team, 2026), Qwen3.6-Plus (Qwen Team and Qwen Team, 2026), and GPT-5.4 (OpenAI, 2026). These models serve as upper-bound reference points for contextualizing the performance of open-weight and domain-specialized systems.

AI safety guard models, where publicly available, are included as additional reference points. This category includes Qwen3Guard-8B (Zhao et al., 2025) and Llama-Guard4-12B (Meta, 2025).

The above list is not exhaustive; additional baselines are included in specific benchmark evaluations where relevant comparisons are available. All baseline models are evaluated under the same prompt format and decoding configuration to ensure fair and consistent comparison across benchmark levels.

Evaluation protocol.

We follow the four-level evaluation framework defined in Section 4. For multiple-choice and general reasoning tasks, we report Accuracy. For content safety classification tasks, we report Macro F1-Score as the primary metric to account for class imbalance. For adversarial robustness evaluation on the dynamic benchmark, we report the combined score (bypass success rate ×\times semantic fidelity; lower is better). For domain capability and business benchmarks, we report task-specific metrics including risk recognition F1, attribution accuracy, and workflow-level indicators as appropriate. Detailed scoring formulas are provided in Appendix B.2 and Appendix B.3.

Implementation details.

For open-weight models, we use the officially released instruction-tuned checkpoints and apply the corresponding chat templates as recommended by each model’s documentation. For proprietary models, we access the models via their official APIs at the versions available at the time of evaluation. For the dynamic adversarial benchmark, adversarial rewrites are generated using an automated red-teaming pipeline, with semantic fidelity assessed by an independent LLM evaluator; rewrites that are over-obfuscated to the point of losing human readability are penalized to a combined score of zero. Training-stage ablations use the same evaluation sets and metrics as the main experiments to ensure consistency. Detailed prompt templates and decoding settings are provided in Appendix B.1.

5.2 Main Results

5.2.1 Open-source General Benchmark Results

We first evaluate Yuvion LLM on the open-source general benchmark suite to verify that domain-specific training does not materially degrade the model’s general language competence. The evaluation is organized into two groups: general-purpose benchmarks covering broad knowledge, reasoning, Chinese language capability, and scientific problem solving, and agentic benchmarks assessing tool use, function calling, and multi-step interactive problem solving. Together, these two groups span more than 30 public evaluation sets.

General-purpose benchmarks.

The general-purpose benchmark group includes 33 representative evaluations covering knowledge understanding, Chinese language understanding, mathematical reasoning, and commonsense and reading comprehension. We compare Yuvion-32B against both same-scale open-weight models, such as Qwen3-32B and Qwen3.5-27B, and substantially larger frontier systems including Qwen3.5-397B-A17B, Qwen3.6-Plus, GLM-5, and Kimi-K2.5. For proprietary models whose parameter counts are not officially disclosed, we report widely cited community estimates only as rough scale references. Results are presented in Table 2.

Table 2: Comparison among Yuvion-32B and representative baseline models on open-source general-purpose benchmarks. Accuracy is reported.
Category Benchmark Yuvion-32B (32B) Qwen3-32B (32B) Qwen3.5-27B (27B) Qwen3.5-397B -A17B (397B) Qwen3.6-Plus (\approx600B) GLM-5 (\approx700B) Kimi-K2.5 (\approx1T)
Knowledge Understanding MMLU 0.8276 0.8476 0.8912 0.9123 0.9249 0.9034 0.9057
MMLU-Redux 0.8263 0.8410 0.8713 0.8940 0.8930 0.8883 0.8950
MMLU-Pro 0.7180 0.7305 0.7880 0.8065 0.8040 0.8225 0.8300
GPQA 0.5051 0.5354 0.4293 0.4343 0.4091 0.5606 0.6111
ARC-Challenge 0.9573 0.9608 0.9735 0.9778 0.9744 0.9701 0.9778
OpenBookQA 0.9420 0.9560 0.9640 0.9780 0.9700 0.9640 0.9680
TriviaQA 0.6580 0.6820 0.7270 0.8290 0.8505 0.8480 0.8675
Xiezhi-EN 0.7010 0.7035 0.7140 0.7340 0.7190 0.7235 0.7420
Chinese Language Understanding C-Eval 0.8216 0.8580 0.8840 0.9115 0.9041 0.8989 0.9301
CMMLU 0.8120 0.8460 0.8840 0.8980 0.8935 0.8900 0.9250
C3 0.9540 0.9589 0.9819 0.9814 0.9830 0.9764 0.9825
CHID 0.8561 0.8876 0.9181 0.9156 0.9316 0.9036 0.9071
CLUEWSC 0.9047 0.9221 0.9529 0.9529 0.9467 0.9283 0.9488
OCNLI 0.7369 0.7544 0.8067 0.7897 0.7861 0.8246 0.7948
CSEM 0.9212 0.9364 0.9568 0.9508 0.9492 0.9424 0.9576
Xiezhi-CN 0.7900 0.8140 0.8250 0.8160 0.8120 0.8130 0.8125
Mathematical Reasoning GSM8K-ZH 0.9219 0.9249 0.9439 0.9477 0.9416 0.9386 0.9500
MATH 0.7750 0.7725 0.6990 0.7385 0.7265 0.7795 0.8390
APE210K 0.8785 0.8730 0.9105 0.9070 0.9005 0.9010 0.9215
TAL-SCQ5K-CN 0.8280 0.8010 0.8560 0.8675 0.8345 0.8870 0.9190
TAL-SCQ5K-EN 0.9135 0.9165 0.9175 0.9255 0.9190 0.9355 0.9315
TheoremQA 0.4612 0.4375 0.4662 0.4863 0.4788 0.4838 0.5363
Commonsense & Reading Comprehension BoolQ 0.8853 0.8768 0.8612 0.8862 0.8859 0.8789 0.8676
CommonsenseQA 0.8468 0.8477 0.8747 0.8935 0.8812 0.8600 0.8608
HellaSwag 0.8457 0.8953 0.9527 0.9550 0.9602 0.9176 0.9385
PIQA 0.8939 0.9064 0.9499 0.9505 0.9587 0.9456 0.9461
SIQA 0.7677 0.7897 0.8158 0.8188 0.8245 0.7912 0.7994
WinoGrande 0.7758 0.7987 0.8958 0.9242 0.9361 0.8934 0.9116
DROP 0.9075 0.8960 0.9365 0.9320 0.9365 0.9275 0.9490
SQuAD 2.0 0.7565 0.7740 0.8275 0.8465 0.8515 0.7950 0.8065
StoryCloze 0.9887 0.9874 0.9954 0.9954 0.9927 0.9921 0.9927
BBH 0.7635 0.7555 0.8685 0.8770 0.8890 0.9155 0.9020
WPLC 0.2175 0.2410 0.2820 0.2775 0.3335 0.2900 0.3490
Average (all benchmarks) 0.7988 0.8099 0.8370 0.8488 0.8485 0.8482 0.8629

Overall, Yuvion-32B achieves an average accuracy of 0.7988 across all 33 benchmarks, compared to 0.8099 for Qwen3-32B, indicating that general capability is well preserved after knowledge-enhanced continued pretraining and policy-grounded multi-task safety post-training. While substantially larger frontier models still achieve higher absolute scores, Yuvion remains broadly comparable to same-scale general-purpose models, which is the more relevant reference point for evaluating capability retention. Yuvion-32B also outperforms Qwen3-32B on several individual benchmarks, including TAL-SCQ5K-CN, TheoremQA, DROP, BoolQ, and StoryCloze, showing that safety specialization does not uniformly reduce general capability. Overall, these results indicate that Yuvion preserves broad general utility at its model scale while specializing for safety-critical deployment.

Agentic benchmarks.

We further evaluate Yuvion-32B on agentic benchmarks to examine the effectiveness of the safety-aware agentic reinforcement learning stage (Section 3.5). As shown in Table 3, the evaluation covers two categories: tool-use benchmarks and search-agent benchmarks.

The tool use benchmark suite includes API-Bank (Li et al., 2023b), which assesses tool selection and execution in multi-turn dialogues over 73 API tools, and BFCL (Patil et al., 2024) (Berkeley Function Calling Leaderboard), which evaluates function-calling capability across dimensions such as AST accuracy, execution accuracy, live API interactions, multi-turn conversations, and relevance detection. For search-augmented reasoning, we evaluate on Seal-0, a benchmark built on the Tongyi DeepResearch framework (Tongyi DeepResearch Team et al., 2025), which measures the model’s effectiveness in orchestrating multi-step search actions for complex information-seeking queries.

As reported in Table 3, Yuvion-32B attains the top score on API-Bank (90.45%) and an average accuracy of 65.72%, matching substantially larger proprietary models such as Qwen3-Max (66.09%). Relative to its base model Qwen3-32B, Yuvion-32B delivers consistent gains of +2.85+2.85, +1.47+1.47, and +9.91+9.91 points on API-Bank, BFCL, and Seal-0, respectively, translating into a +4.75+4.75-point improvement in average accuracy. The notable lift on Seal-0 demonstrates that the search-agent RL stage substantially enhances multi-turn search planning and evidence synthesis, whereas the gains on API-Bank and BFCL show that the tool-use RL track sharpens fine-grained function-calling capability without being eroded by safety-oriented specialization.

Table 3: Comparison among Yuvion-32B and baseline models on agentic capability benchmarks. Accuracy (%) is reported for all benchmarks. The highest score in each column is in bold font, and the second is underlined.
Model API-Bank BFCL Seal-0 Avg.
Yuvion-32B 90.45 66.16 40.54 65.72
Qwen3-32B 87.60 64.69 30.63 60.97
Qwen3-Max 89.28 68.44 40.54 66.09
Qwen3.5-Plus 84.09 74.41 46.90 68.47
Qwen3.6-Plus 87.44 70.42 54.05 70.64
DeepSeek-V3.2 76.05 55.08 50.45 60.53
Kimi-K2.5 80.40 68.27 43.24 63.97
GLM-5 83.25 64.79 44.14 64.06

5.2.2 Open Content Safety Benchmark Results

This evaluation covers two benchmark groups: content safety benchmarks and guard benchmarks. The content safety benchmarks focus on harmful or policy-violating content recognition, containing 8 evaluation sets spanning both Chinese and English. The guard benchmarks focus on safety judgment, refusal behavior, and safety-aligned response capability, including benchmarks such as SEval and AEGIS and containing more than 20 evaluation sets.

Content safety benchmarks.

Table 4 reports Macro F1-Score across 8 content safety evaluation sets spanning both Chinese and English, covering risks such as pornography, fraud, offensive language, hate speech, and implicit toxicity. We evaluate Yuvion-8B and Yuvion-32B alongside representative general-purpose baselines including GPT-5.4, Qwen3-32B, Qwen3.5-27B, Qwen3-Max, and DeepSeek-R1.

Yuvion-32B achieves the highest average Macro F1 of 78.2%, surpassing all baselines including GPT-5.4 (72.2%) and Qwen3-Max (73.9%). It obtains the best results on 5 out of 8 benchmarks: ChineseHarm (97.9%), HateXplain (63.6%), ToxiGen (86.0%), Jigsaw (76.0%), and CivilComments (65.4%). At the same model scale, Yuvion-32B outperforms Qwen3-32B (69.1%) by 9.1 points, while even Yuvion-8B reaches 73.3%, exceeding several much larger general-purpose baselines. These results show that Yuvion achieves state-of-the-art performance in the safety domain and exhibits a clear cross-scale advantage.

Yuvion-32B also maintains balanced bilingual performance across Chinese and English safety tasks, whereas several general-purpose baselines show stronger cross-lingual imbalance or over-moderation. Overall, the results confirm that Yuvion’s safety specialization yields substantial gains on public content safety benchmarks without sacrificing robustness across languages.

Table 4: Comparison among Yuvion variants and baseline models on open content safety benchmarks. Macro F1-Score (%) is reported. The highest score in each row is in bold font, and the second is underlined.
Benchmark Yuvion-32B Yuvion-8B Qwen3-32B Qwen3-8B GPT-5.4 Qwen3-Max Qwen3.5-27B DeepSeek-R1
ChineseHarm 97.9 96.8 93.2 90.8 85.3 87.9 81.2 54.4
COLD 72.6 69.6 64.4 61.2 74.8 73.5 71.0 70.8
Moderation 76.6 76.5 67.1 79.3 72.9 77.8 77.1 74.0
HateXplain 63.6 58.5 50.7 51.8 55.6 58.3 63.4 58.6
ToxiGen 86.0 83.3 73.3 73.1 73.9 77.4 75.8 34.5
Jigsaw 76.0 68.5 67.2 68.7 69.2 74.6 72.2 38.0
CivilComments 65.4 51.2 50.3 51.7 56.9 63.2 61.3 56.2
SafetyBench 87.5 81.6 86.2 82.4 89.3 78.4 87.7 85.9
Avg. 78.2 73.3 69.1 69.9 72.2 73.9 73.7 59.0
Guard benchmarks.

To evaluate Yuvion’s safety detection capability, we adopt the guard benchmark protocol of YuFeng-XGuard (Lin et al., 2026), which covers 28 sub-datasets across five dimensions: prompt classification, response classification, multilingual classification, attack defense, and safe completion. We also measure the false positive rate (FPR) on benign instruction-following datasets (Alpaca, Belle) to assess over-refusal. We compare each Yuvion model against its corresponding Qwen3 base model to isolate the effect of safety-oriented training.

Table 5: Comparison between Yuvion models and their corresponding base models on guard benchmarks. Average F1-Score (%) and false positive rate (FPR, %) are reported. Δ\Delta denotes the absolute change of Yuvion over the corresponding base model. For FPR, lower is better (\downarrow).
Model Prompt Response Multi- Attack Safe Avg. FPR
Cls. Cls. lingual Def. Compl. (\downarrow)
Qwen3-8B 13.0 20.5 10.9 35.8 8.1 17.6 0.35
Yuvion-8B 74.4 78.9 71.4 79.4 71.5 75.1 0.29
Δ\Delta +61.4 +58.4 +60.5 +43.6 +63.4 +57.5 -0.06
Qwen3-32B 73.1 59.1 67.2 77.6 76.2 70.6 2.91
Yuvion-32B 77.6 82.3 70.8 84.7 79.1 78.9 0.18
Δ\Delta +4.5 +23.2 +3.6 +7.1 +2.9 +8.3 -2.73
Refer to caption
Figure 5: Average pairwise win rate (%) on guard benchmarks across 28 sub-datasets (51 evaluation instances covering query-only and query-response settings). Each model’s win rate is computed against all other models in the comparison set.

As shown in Table 5, the safety training pipeline brings consistent and substantial gains across all dimensions. Yuvion-8B improves over Qwen3-8B by an average of 57.5 percentage points (from 17.6% to 75.1%), with the largest gains on prompt classification (+61.4%) and safe completion (+63.4%). Yuvion-32B improves over the already stronger Qwen3-32B by 8.3 percentage points on average (from 70.6% to 78.9%), with particularly notable gains on response classification (+23.2%). Crucially, the improved detection does not come at the cost of over-refusal: Yuvion-32B achieves an average FPR of only 0.18%, substantially lower than Qwen3-32B (2.91%), and Yuvion-8B (0.29%) also outperforms Qwen3-8B (0.35%). These results demonstrate that the safety training pipeline effectively equips the model with robust guard capability while maintaining low false positive rates.

Comparison with dedicated guard models.

To further contextualize Yuvion’s guard capability, we conduct a pairwise win rate analysis against a comprehensive set of dedicated guard models across 51 sub-datasets. (derived from 28 sub-datasets, where some datasets are evaluated under both query-only and query+response settings). As shown in Figure 5, Yuvion-32B achieves a win rate of 71.7%, ranking second overall. Yuvion-8B attains 54.1%, surpassing several guard models (WildGuard-7B, NemoGuard-V2, LlamaGuard3/4, ShieldGemma-9B) but falling below others such as Qwen3Guard-8B (58.2%) and GPT-OSS-Guard-20B (64.0%). This positioning reflects a deliberate design choice: unlike dedicated guard models that concentrate their entire training budget on safety classification, Yuvion distributes its capacity across domain knowledge acquisition, adversarial robustness, agentic capabilities, and business-oriented workflows—guard-style judgment is only one of many training objectives.

Notably, when the same training paradigm is applied with the data pipeline concentrated on AI safety and guard-related tasks—incorporating a dynamic policy mechanism that enables runtime policy adjustment without retraining—the resulting model, Yuvion-Guard-8B (Lin et al., 2026), achieves the highest win rate of 85.6%, surpassing all dedicated guard models by a substantial margin. This confirms that the gap between Yuvion-8B and top guard models is attributable to training data allocation rather than a methodological limitation. Yuvion LLM prioritizes comprehensive content safety competence at a modest cost to guard-specific classification, while the same methodology readily produces a state-of-the-art dedicated guard model when narrower specialization is desired. The weights of Yuvion-Guard-8B have been publicly released to support community research and deployment.000https://huggingface.co/Alibaba-AAIG/YuFeng-XGuard-Reason-8B

5.2.3 Self-Constructed Adversarial Robustness Benchmark Results

We evaluate Yuvion LLM on the self-constructed adversarial robustness benchmark described in Section 4.4, covering five major risk categories across both static and dynamic evaluation sets.

Static evaluation sets.

Table 6 reports classification performance (Macro F1-Score, %) on the static evaluation sets, which assess standard domain recognition capability under canonical risk expression distributions across five major risk categories.

Table 6: Comparison among Yuvion variants and baseline models on the static content safety evaluation sets. Macro F1-Score (%) is reported. Avg. denotes the average excluding the Spam & Flooding category. The highest score in each column is in bold font, and the second is underlined.
Model Adv. & Traffic Gambling & Fraud Abusive Pornographic Spam & Flooding Avg. Avg.
Qwen3-8B 81.3 83.0 79.6 80.4 60.9 77.0 81.1
Qwen3-32B 84.0 84.9 83.8 83.6 75.7 82.4 84.1
Qwen3-Max 88.7 88.3 87.9 88.7 72.8 85.3 88.4
Qwen3.5-27B 88.0 90.0 87.5 87.2 56.8 81.9 88.2
Qwen3.5-122B-A10B 91.1 93.6 90.6 88.8 51.5 83.1 91.0
DeepSeek-R1 89.0 89.5 88.6 86.9 74.1 85.6 88.5
GPT 5.4 88.9 90.5 90.8 87.7 68.0 85.2 89.5
Yuvion-8B 94.3 95.3 93.8 88.9 41.9 82.8 93.1
Yuvion-32B 93.8 95.7 95.3 92.3 57.2 86.8 94.2

Yuvion-32B achieves the highest overall average of 86.8% and an Avg. of 94.2%, outperforming all compared baselines overall and leading all baselines on three of the five categories. Yuvion-8B achieves the best score on Advertising & Traffic (94.3%) and reaches an Avg. of 93.1%, surpassing most baselines, including substantially larger models such as Qwen3.5-122B-A10B (91.0%), Qwen3-Max (88.4%), DeepSeek-R1 (88.5%), and GPT 5.4 (89.5%). These results indicate strong cross-scale competitiveness of Yuvion models on core content safety recognition tasks. Performance on Spam & Flooding is notably unstable for all models, largely because certain e-commerce platform interaction content shares overlapping surface features with Spam & Flooding instances, creating inherent boundary ambiguity in both annotation and evaluation; excluding it, Yuvion’s advantage becomes more pronounced.

Dynamic evaluation sets.

Table 7 reports the adversarial robustness evaluation on the dynamic sets. The combined score is defined as the product of bypass success rate and semantic fidelity score; a lower combined score indicates stronger robustness against adversarial attacks. Due to the higher computational cost of the automated red-teaming pipeline, the dynamic evaluation is conducted on a representative subset of baseline models.

Table 7: Comparison among Yuvion variants and baseline models on the dynamic content safety benchmark. Combined score (%) is reported; lower is better (\downarrow). The highest score in each column is in bold font, and the second is underlined.
Model Adv. & Traffic Pornographic Abusive Spam & Flooding Gambling & Fraud Overall
Qwen3-8B 16.5 42.7 25.6 32.6 54.8 34.4
Qwen3-32B 15.5 35.3 17.5 29.3 40.9 27.7
Qwen3-Max 9.9 31.4 17.1 28.3 35.2 24.4
GPT-5.4 7.0 29.6 10.7 24.5 38.5 22.3
Yuvion-8B 13.9 33.8 21.3 23.5 48.3 28.1
Yuvion-32B 6.7 16.8 16.7 29.1 33.7 20.6

Yuvion-32B achieves the lowest overall combined score of 20.6%, outperforming all compared models, followed by GPT-5.4 (22.3%), Qwen3-Max (24.4%), Qwen3-32B (27.7%), Yuvion-8B (28.1%), and Qwen3-8B (34.4%). It delivers the best performance on 3 out of 5 individual categories—advertising & traffic diversion, pornographic, and gambling & fraud—with particularly strong robustness on advertising & traffic (6.7%). GPT-5.4 also shows strong adversarial robustness, ranking second overall and achieving the best result on abusive content (10.7%). By contrast, gambling & fraud remains the most difficult category across all models, indicating that adversarial rewrites in this domain more easily exploit semantic gray areas. Importantly, Yuvion-8B achieves the best score on spam & flooding (23.5%) and an overall score (28.1%) comparable to much larger baselines such as Qwen3-32B (27.7%) and Qwen3-Max (24.4%), demonstrating that safety-specialized training enables compact models to rival general-purpose models at a fraction of the parameter cost. Overall, these results show that Yuvion’s advantage extends beyond standard classification to realistic adversarial robustness.

5.2.4 In-house Capability and Business Benchmark Results

We evaluate Yuvion on the in-house capability and business benchmarks described in Section 4.5, covering more than 15 evaluation sets across domain capability benchmarks and business-oriented benchmarks. Results are reported for Yuvion-8B, Yuvion-32B, and Yuvion-32B (Agent), together with a broad set of open-weight, proprietary, and guard-model baselines. This benchmark suite is designed to evaluate not only generic safety judgment, but also domain-specific policy understanding, fine-grained content risk recognition, and practical deployment utility in realistic moderation workflows.

Tables 8 and 9 report the full in-house evaluation results on domain capability and business-oriented benchmarks, respectively. Overall, Yuvion shows consistently strong performance across both benchmark groups. In particular, Yuvion-32B achieves 85.78 on the in-house domain benchmark composite and 86.34 on the business benchmark composite, while Yuvion-8B reaches 82.38 and 82.72, respectively. Yuvion-32B (Agent) further improves these results to 86.10 and 87.34, showing that agentic RL yields modest but consistent gains on in-house workflows in addition to its larger improvements on dedicated agentic benchmarks. For a unified summary across these two benchmark groups, we additionally report an overall composite score in Appendix C.

Domain capability results.

On the in-house domain capability benchmarks, Yuvion-32B achieves an overall score of 85.78, outperforming the strongest proprietary baselines such as Qwen3-Max (81.41), Qwen3.6-Plus (81.83), and GPT-5.4 (80.73), as well as strong open-weight baselines including DeepSeek-R1 (80.54) and GLM-5 (79.84). Yuvion-32B shows particularly strong results on Political Risk (95.00), Knowledge MCQ (80.50), Redline Text (78.65), Prohibited Content (92.76), Low-Info Text (77.93), and Porn Text (94.56). Yuvion-8B also performs strongly, reaching 82.38 and surpassing all open-weight and proprietary baselines in this benchmark group. Yuvion-32B (Agent) further improves the overall score to 86.10, indicating that agentic RL does not harm core safety capability and can bring additional gains in structured domain tasks.

Business benchmark results.

On the in-house business benchmarks, Yuvion-32B achieves the highest overall score of 86.34, outperforming all evaluated open-weight and proprietary baselines. In particular, it obtains 97.21 on UGC Moderation, 83.28 on AIGC Moderation, 74.32 on Business Porn Detection, 85.83 on Multi-Scenario Risk Detection, and 91.04 on Data Security NER. Yuvion-8B also performs competitively, achieving 82.72 and exceeding all open-weight baselines. Yuvion-32B (Agent) further improves the business overall score to 87.34, with gains on AIGC Moderation, Multi-Scenario Risk Detection, and Data Security NER, suggesting that agentic RL is particularly helpful for structured and workflow-oriented business tasks.

Table 8: Comparison among Yuvion variants and representative open-source, proprietary, and guard baselines on in-house domain capability benchmarks. Overall denotes the composite score. The highest score in each column is in bold font, and the second is underlined.
Category Model Overall Political Risk Political Entity Knowledge MCQ Redline Text Domain Instr. Follow. Political NER Prohibited Content Insult Low-Info Text Porn Text Emotion Analysis
Open-source Qwen3-8B 73.71 81.26 89.09 69.46 57.51 68.35 65.29 80.88 90.65 61.82 83.10 63.42
Qwen3-32B 79.79 88.40 91.41 73.53 66.27 74.76 79.77 86.02 94.22 66.33 90.77 66.24
Qwen3-30B-A3B-2507 77.14 87.31 92.16 71.48 68.50 71.08 77.60 80.72 87.87 60.86 88.91 62.01
Qwen3.5-9B 72.68 78.48 84.55 68.48 64.55 69.38 48.06 82.40 87.73 65.13 84.20 66.49
Qwen3.5-27B 77.17 86.55 90.66 72.57 68.73 77.54 58.92 84.06 92.74 68.20 83.70 65.20
Qwen3.5-35B-A3B 75.46 83.41 90.36 71.02 64.32 75.85 55.48 83.06 90.81 67.21 81.53 67.03
Qwen3.5-122B-A10B 77.30 84.65 91.11 73.79 67.96 76.33 59.54 84.14 94.50 67.85 85.53 64.93
Qwen3.5-397B-A17B 77.45 88.00 91.94 77.39 62.11 81.35 51.77 81.67 93.99 72.47 85.21 66.09
DeepSeek-R1 80.54 88.00 91.45 77.06 67.94 82.81 75.02 88.86 94.79 71.98 88.08 60.00
DeepSeek-V3.2 76.27 85.20 88.95 68.75 49.80 71.62 78.04 82.89 94.38 70.73 83.26 65.37
KIMI-K2.5 74.61 87.52 91.87 58.48 38.79 82.71 76.06 79.61 91.26 71.91 74.72 67.82
Minimax-M2.5 72.60 88.10 91.35 55.34 61.10 74.40 79.26 40.80 91.18 71.03 83.26 62.76
GLM-5 79.84 84.57 90.37 77.20 62.36 82.56 82.42 87.36 92.47 70.47 88.61 59.90
Proprietary Qwen3-Max 81.41 89.30 90.42 76.74 68.77 83.14 80.78 87.74 93.31 69.05 89.78 66.49
Qwen3.5-Plus 77.92 89.75 92.38 78.91 61.02 82.79 53.49 79.80 94.19 73.78 85.19 65.79
Qwen3.6-Plus 81.83 87.13 92.34 78.11 73.13 83.12 71.18 86.23 94.74 72.26 92.46 69.44
GPT-5.4 80.73 87.90 93.56 79.77 64.22 75.61 82.69 85.16 94.77 76.80 77.53 70.00
Guard Qwen3Guard-8B 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Llama-Guard4-12B 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Yuvion (Ours) Yuvion-8B 82.38 93.20 87.31 73.51 77.54 73.53 82.05 91.54 91.48 71.30 96.73 67.96
Yuvion-32B 85.78 95.00 90.56 80.50 78.65 80.97 82.39 92.76 94.55 77.93 94.56 75.66
Yuvion-32B (Agent) 86.10 95.35 91.21 81.16 78.95 83.66 83.52 92.49 94.53 77.78 94.78 73.67
Table 9: Comparison among Yuvion variants and representative open-weight, proprietary, and guard baselines on in-house business benchmarks. Overall denotes the business composite score. The highest score in each row is in bold font, and the second is underlined.
Category Model Overall UGC Moderation AIGC Moderation Business Porn Detection Multi-Scenario Risk Detection Data Security NER
Open-source Qwen3-8B 67.57 26.38 66.03 78.50 81.96 84.98
Qwen3-32B 68.10 32.25 56.11 74.83 82.87 94.43
Qwen3-30B-A3B-2507 56.14 9.27 36.34 72.74 68.66 93.67
Qwen3.5-9B 69.89 55.44 66.13 63.82 80.80 83.28
Qwen3.5-27B 73.92 66.83 71.13 63.39 79.62 88.65
Qwen3.5-35B-A3B 72.75 69.80 72.17 74.27 81.60 65.89
Qwen3.5-122B-A10B 76.61 65.72 74.19 76.45 80.30 86.41
Qwen3.5-397B-A17B 72.62 72.65 72.50 42.51 83.44 91.98
DeepSeek-R1 69.85 54.65 62.72 81.30 80.94 69.62
DeepSeek-V3.2 69.88 55.43 62.37 75.01 75.00 81.60
KIMI-K2.5 79.84 84.91 77.83 69.92 84.98 81.56
Minimax-M2.5 56.32 60.84 66.16 47.45 50.27 56.86
GLM-5 73.96 47.28 64.35 77.32 84.42 96.42
Proprietary Qwen3-Max 83.00 83.09 77.38 78.75 78.64 97.12
Qwen3.5-Plus 73.38 79.17 75.35 31.01 83.89 97.47
Qwen3.6-Plus 78.44 73.77 72.04 78.54 82.31 85.56
GPT-5.4 80.40 73.85 70.86 80.51 78.95 97.83
Guard Qwen3Guard-8B 0.01 0.03 0.03 0.00 0.00 0.00
Llama-Guard4-12B 0.00 0.00 0.01 0.00 0.00 0.00
Yuvion (Ours) Yuvion-8B 82.72 95.94 82.43 69.39 83.28 82.57
Yuvion-32B 86.34 97.21 83.28 74.32 85.83 91.04
Yuvion-32B (Agent) 87.34 96.34 83.44 73.51 85.98 97.45
Comparison with open-weight and proprietary baselines.

A notable result is that Yuvion establishes a clear cross-scale advantage over both open-weight and proprietary baselines on in-house safety evaluations. Yuvion-8B already surpasses all evaluated general-purpose baselines, including frontier proprietary models, on the in-house domain benchmarks and remains highly competitive on business benchmarks, while Yuvion-32B and Yuvion-32B (Agent) achieve the strongest overall results. This is particularly striking given model scale: Yuvion-32B substantially outperforms much larger open-weight baselines such as Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, and even Yuvion-8B exceeds these larger models as well as several proprietary systems on the in-house domain benchmarks. These results indicate that for realistic content safety deployment, targeted safety-oriented training can outweigh raw model scale and even close the gap to, or surpass, stronger proprietary general-purpose models.

Comparison with guard models.

The near-zero scores of Qwen3Guard-8B and Llama-Guard4-12B on both domain capability and business benchmarks reveal a fundamental limitation of guard-style models in realistic deployment settings. Although such models are designed for generic safety judgment, they lack the domain-specific knowledge, fine-grained policy understanding, and structured output capability required by practical moderation and business workflows. This large performance gap highlights the importance of a dedicated training paradigm such as Yuvion, which targets deployable content safety competence rather than only generic safety filtering behavior.

5.3 Ablation Study

5.3.1 Ablation on Domain Knowledge Data

To investigate the contribution of domain-specific knowledge data in the continued pretraining stage, we conduct an ablation study comparing two intermediate versions of Yuvion LLM: one trained without domain knowledge data and one with a partial set of knowledge descriptions incorporated. Both models share identical configurations across all other training stages; the only difference lies in whether a curated subset of domain knowledge corpora—covering content safety knowledge and policies, regulatory guidelines, and domain-specific annotations—are included during continued pretraining. Note that both models represent intermediate checkpoints in the iterative development pipeline rather than the final released version of Yuvion LLM; specifically, both checkpoints share the same model parameter scale as the final release, but are trained on a reduced SFT dataset as part of an earlier experimental iteration. Table 10 reports results across the in-house capability benchmark suite.

Table 10: Ablation study on domain knowledge data. Results are reported for intermediate Yuvion checkpoints during iterative development. Domain Composite denotes the weighted average score over all domain capability benchmarks.
Model Domain Composite Domain Capability Benchmarks
Pol. Risk Pol. Entity Know. Text Red- line UCMF EN Porn Txt v1 NER Politics Prohib. Cls Insult Meaning- less Biz Porn Emotion
w/o knowledge data 79.68 94.46 91.04 78.80 72.35 79.26 60.65 82.96 92.30 70.64 66.68 96.37 70.60
w/ knowledge data 83.64 95.40 90.67 80.60 78.29 78.89 62.13 82.21 92.79 94.84 78.12 94.31 75.38
Δ\Delta +3.96 +0.94 -0.37 +1.80 +5.94 -0.37 +1.48 -0.75 +0.49 +24.20 +11.44 -2.06 +4.78

The results demonstrate that incorporating even a partial set of domain knowledge descriptions during continued pretraining yields consistent and meaningful improvements across domain capability benchmarks. Overall, adding knowledge data produces a substantial gain of +3.96% in the domain composite score, with particularly pronounced improvements on the Insult benchmark (+24.20%) and the Meaningless benchmark (+11.44%)—two categories that require nuanced semantic understanding of domain-specific expressions that are inherently difficult to acquire from general corpora alone. Notable gains are also observed on Red-line (+5.94%), Knowledge Text (+1.80%), and Emotion (+4.78%), further confirming that even a partial injection of structured domain knowledge can effectively anchor the model’s understanding of fine-grained, domain-sensitive concepts.

A small number of benchmarks show marginal regressions (e.g., Politics Entity, UCMF EN, Biz Porn), which we attribute to distribution shift introduced by the knowledge corpora. Importantly, these regressions remain limited in scale and are largely mitigated in subsequent training stages. Overall, these findings confirm that incorporating structured domain knowledge descriptions—even partially—is a valuable component of the continued pretraining stage, providing semantic grounding that meaningfully improves the model’s domain capability and motivating the inclusion of a more comprehensive knowledge corpus in the full training pipeline.

5.3.2 Ablation on Reinforcement Learning Design

To isolate the contribution of each training stage to the model’s agentic capabilities, we conduct a progressive ablation comparing three checkpoints: (1) the model after safety-oriented SFT only, (2) after adding tool-use RL, and (3) after further adding search-agent RL (the full Yuvion-32B Agent). All checkpoints share identical configurations for continued pretraining and SFT; the only difference is the scope of agentic RL training applied. Evaluation is conducted on the agentic benchmarks (API-Bank, BFCL, Seal-0), as shown in Table 11.

Table 11: Progressive ablation on agentic RL training stages. Accuracy (%) on agentic benchmarks.
Training Stage API-Bank BFCL Seal-0
SFT only 83.75 45.07 19.82
+ Tool Use RL 88.78 54.64 31.53
+ Tool Use RL + Search Agent RL 90.45 66.16 40.54

The SFT-only model exhibits degraded agentic performance compared with the base Qwen3-32B, reflecting a partial loss of general-purpose tool-use and search capabilities after intensive safety-oriented specialization. Adding tool-use RL yields substantial improvements on API-Bank and BFCL, which directly evaluate function calling and tool selection. Further adding search-agent RL produces a large gain on Seal-0 while preserving the tool-use gains on API-Bank and BFCL. Together, these results show that agentic RL primarily restores and extends dedicated agentic capability, while also contributing modest but consistent improvements on realistic in-house workflows, as seen in the in-house benchmark results above.

5.4 Case Study

5.4.1 Adversarial Evasion Robustness

To qualitatively validate our quantitative findings, we present representative cases spanning two violation categories—Drug Trafficking and Gambling & Fraud —each examined across three progressively adversarial variants: an explicit original, a lexically camouflaged variant, and a fully disguised adversarial variant. As illustrated in Figure 6, general-purpose models perform adequately on explicit violations but exhibit systematic degradation as evasion sophistication increases. In the Drug Trafficking category, replacing the drug term with the slang expression “liu bing” is detectable by both models; however, encoding the same intent via a numeral–emoji substitution of the colloquial expression within a venue recommendation query is sufficient to mislead the general-purpose model into treating the inquiry as a routine leisure activity consultation. In the Gambling & Fraud category, both the explicit inquiry and the euphemistic variant (“poker entertainment auxiliary device”) are successfully blocked; however, once the product-specific keyword “poker” is removed and the request is reframed as a neutral inquiry about “entertainment auxiliary devices” available through “relevant channels,” the general-purpose model returns a Pass decision, citing insufficient evidence of violation.

Refer to caption
Figure 6: Case study comparing General LLM and Yuvion LLM across three progressively adversarial variants in two violation categories: Drug Trafficking and Gambling & Fraud. Gray, orange, and red headers indicate increasing levels of evasion sophistication. Blocked and Pass denote the model decision; colored tags indicate the predicted policy category assigned by Yuvion LLM.

Yuvion LLM maintains consistent and policy-grounded detection across all six cases. In the Drug Trafficking category, it reconstructs the numeral–emoji obfuscation as a colloquial reference to illicit substance use and correctly categorizes the content under Drug Trafficking regardless of the recreational framing. In the Gambling & Fraud category, it identifies the covert combination of category-neutral device referral and implicit channel solicitation as indicative of illegal goods promotion under the Gambling Devices policy category, even when no product-specific keyword is present. These qualitative observations confirm that the adversarially-aware training paradigm of Yuvion LLM—spanning domain-adaptive continued pretraining, risk-aware SFT, and RL-based policy optimization—equips the model with the semantic generalization necessary to detect violations that evade general-purpose models, while consistently grounding its decisions in explicit domain policy categories.

5.4.2 Agentic Task Execution

To qualitatively illustrate the behavioral impact of agentic reinforcement learning, we present representative cases comparing model outputs before and after agentic RL training. We examine two complementary scenarios—open-domain search and domain-specific business tool calling—to demonstrate that agentic RL induces systematic improvements in both general retrieval strategies and specialized workflow execution.

Case 1: Search avoidance \rightarrow active decomposition.

Given the query “Which Vietnamese professor explained the mysterious case of two monks said to have been meditating for over 400 years?,” the pre-RL model returns a refusal: “could not be identified through the available search results. Further direct access … would be required.” It fails to decompose the query into actionable sub-searches and terminates prematurely. After agentic RL, the model instead decomposes the problem into sequential sub-queries—first identifying the cultural phenomenon (embalmed monks at Dau Pagoda), then narrowing to the specific scholar—ultimately retrieving the correct answer (Professor Nguyen Lan Cuong) through iterative evidence accumulation.

Case 2: Tool bypass \rightarrow systematic tool-assisted verification.

In an intellectual property infringement review workflow, the model receives a trademark complaint case containing the product’s textual information (title, attributes, promotional text), multiple categories of product images (main images, detail images, SKU images), and the complainant’s registered trademark. The audit rules follow a cascading structure: Rule 1 checks whether the trademark appears verbatim in the product’s text fields; if not, Rule 2 checks whether it appears in the product images; subsequent rules handle special cases such as mixed Chinese–English trademarks and case-insensitive matching. The system provides check_image_tool (an image-based trademark detection tool) and finish (for submitting the final judgment), among other tools. The pre-RL model bypasses the tool invocation step entirely, directly calling finish with a judgment (“pass—the trademark was found in the product’s text information”), attempting to reason about trademark matching using only its own interpretation of the text fields and short-circuiting the image verification required by the cascading rules. The post-RL model correctly invokes check_image_tool with the relevant image categories (product main images and detail images) to perform trademark detection before making any judgment, following the prescribed evidence-first workflow.

Discussion.

These two cases illustrate complementary dimensions of behavioral change induced by agentic RL. In open-domain search (Case 1), the pre-RL model defaults to refusal when parametric recall is insufficient, whereas the post-RL model treats search as the primary reasoning mechanism—actively decomposing complex questions and grounding answers in retrieved evidence. In domain-specific business workflows (Case 2), the pre-RL model attempts to short-circuit rule-based processes by producing direct judgments without tool-assisted verification, whereas the post-RL model adheres to the prescribed workflow by invoking the appropriate tools to gather evidence before reaching a conclusion. Together, these cases demonstrate that agentic RL not only improves search and retrieval strategies but also instills disciplined tool-use behavior in structured business processes—a critical capability for real-world content safety deployment.

5.5 Summary of Evaluation Results

Taken together, the multi-level Yuvion LLM RiskEval provides a progressive and deployment-oriented assessment of Yuvion LLM, covering general capability retention, public safety comparability, domain-specific robustness, and real-world deployment value. Results across all levels consistently validate the effectiveness of Yuvion’s staged training paradigm.

A clear pattern emerges from the full evaluation. On open-source general benchmarks, Yuvion preserves broad general capability and remains broadly comparable to same-scale general-purpose open-weight models, showing that safety specialization does not materially undermine general utility. On safety-oriented evaluations, however, Yuvion achieves state-of-the-art performance with a clear cross-scale advantage. On open content safety benchmarks, Yuvion-32B reaches an average Macro F1 of 78.2% across 8 evaluation sets, surpassing the strongest baselines including Qwen3-Max (73.9%). On the self-constructed benchmark, Yuvion-32B achieves an Avg. of 94.2% on static evaluation sets and the lowest adversarial combined score of 20.6% on dynamic evaluation sets. On the in-house benchmarks, Yuvion-32B achieves 85.78 on the in-house domain benchmark composite and 86.34 on the business benchmark composite, while Yuvion-32B (Agent) further improves these to 86.10 and 87.34.

Two findings stand out. First, targeted safety-domain training consistently outperforms raw model scale: both Yuvion-8B and Yuvion-32B surpass substantially larger general-purpose models across public, adversarial, and in-house safety evaluations. Second, existing guard models are not sufficient for real-world content safety deployment: the near-zero scores of Qwen3Guard-8B and Llama-Guard4-12B on the in-house benchmarks reveal a large gap between generic guard capability and the fine-grained domain competence required in operational safety workflows. Overall, Yuvion demonstrates that it is possible to preserve broad general capability while achieving state-of-the-art performance in the safety domain.

6 Closed-Loop Iteration for Content Safety Model Evolution

Content safety and AI safety are an ongoing adversarial game: violating content producers continuously adapt their evasion strategies, creating an arms race in which any fixed model will inevitably fall behind. Yuvion is therefore embedded within a closed-loop iteration framework (Figure 7) that integrates four mechanisms—knowledge injection, adversarial game-playing, agentic capability reinforcement, and deployment-driven feedback—into a unified cycle operating across the full model lifecycle.

Refer to caption
Figure 7: Closed-loop iteration framework for Yuvion model evolution. The four core mechanisms form a continuous cycle, supported by domain expert oversight and validated through the evaluation system before re-deployment.
Knowledge Injection.

A continuous pipeline feeds updated policy documents, regulatory guidelines, emerging risk case analyses, and expert annotations into the model through incremental continued pretraining. Domain experts identify knowledge gaps via systematic capability audits, construct targeted corpora, and validate that injected knowledge is absorbed without degrading existing competence.

Adversarial Game-Playing.

Rather than passively waiting for new evasion patterns to appear in production, the system proactively generates adversarial variants through three channels: (1) red-team self-play, where an attacker model generates evasion transformations against the current Yuvion checkpoint; (2) automated adversarial search over lexical substitution, semantic paraphrasing, code-switching, and contextual reframing; and (3) adversarial patterns extracted from real business traffic. Successful evasion examples are incorporated into the GRPO-based RL training loop, creating a virtuous cycle in which each iteration produces a model strictly harder to evade.

Agentic Capability Reinforcement.

As Yuvion is deployed in workflows requiring tool invocation and multi-step planning, the framework continuously refines agentic skills by constructing realistic environment simulations, collecting feedback from production tool interactions to address failure modes in tool selection and parameter formulation, and progressively extending task complexity from single-tool invocations to multi-step investigation workflows.

Deployment-Driven Feedback.

Production deployment generates three feedback streams: human operator corrections, downstream QA flags (false positives/negatives), and adversarial drift signals indicating shifts in violation distributions. Feedback cases undergo failure analysis and are routed to the appropriate upstream mechanism—knowledge gaps trigger new injection, novel evasion patterns trigger adversarial training, and agentic failures trigger capability reinforcement. Updated models are validated on Yuvion LLM RiskEval and the specific failure patterns before re-deployment.

The four mechanisms correspond to the fundamental challenges of content safety co-evolution: the knowledge landscape changes, evasion strategies evolve, operational complexity grows, and production conditions reveal blind spots invisible during offline training. Each iteration cycle enriches the training corpus with real-world failure cases and expert-curated knowledge, enabling Yuvion to transition from reactive adaptation toward proactive defense.

7 Ethical Considerations

This work involves the construction and modeling of content safety data that by nature includes harmful and policy-violating material spanning categories such as pornography, violence, extremism, politically sensitive content, and fraud. All sensitive data are collected from real-world moderation workflows under strict access control, stored in isolated environments, and not released publicly. The adversarial variants constructed for dynamic benchmark evaluation are used exclusively for robustness assessment under controlled conditions and do not constitute a practical evasion toolkit.

8 Related Work

LLM training.

Modern large language models are typically developed through a two-phase paradigm: large-scale unsupervised pretraining on broad text corpora (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023; Dubey et al., 2024; Qwen Team, 2025), followed by post-training that aligns model behavior with human intent through supervised fine-tuning and reinforcement learning from human feedback (Ouyang et al., 2022; Bai et al., 2022; Rafailov et al., 2023; He et al., 2025a; b; Xu et al., 2026b). This paradigm has been extended to domain-specific settings, where continued pretraining on domain corpora adapts general-purpose representations to specialized knowledge (Gururangan et al., 2020; Raffel et al., 2020; Xu et al., 2026a), and domain-targeted reinforcement learning further sharpens task-specific reasoning (Shao et al., 2024). Yuvion follows this staged design but differs in two key respects: first, its continued pretraining and post-training are optimized around adversarial content safety rather than general helpfulness; second, it introduces a dedicated agentic RL stage that extends the model beyond single-turn classification to multi-step, tool-augmented safety workflows—a capability largely absent from prior domain-adaptation pipelines.

LLM safety.

As large language models (LLMs) are increasingly deployed in real-world systems, safety has become a core requirement alongside capability and reliability. Existing efforts have improved LLM safety through alignment training, refusal shaping, constitutional or policy-based supervision, and dedicated guard models for harmful content detection and policy enforcement (Ouyang et al., 2022; Bai et al., 2022; Ganguli et al., 2022; Inan et al., 2023; Meta, 2025). These approaches have substantially improved performance on public benchmarks and practical moderation tasks, but are often optimized for clean and explicit inputs rather than robustness under adaptive human attack. Our work is motivated by this gap and treats safety as an inherently adversarial problem.

Adversarial robustness.

A growing body of work shows that LLM safety failures often arise through adversarial interaction rather than only naturally occurring inputs. Jailbreak attacks, prompt injection, role-playing, suffix attacks, and multi-turn manipulation can induce unsafe behavior even in strongly aligned models (Perez and Ribeiro, 2022; Perez et al., 2022; Wei et al., 2023; Zou et al., 2023; Wallace et al., 2024; Mazeika and others, 2024). In content safety settings, harmful intent can also be hidden through obfuscation, euphemism, coded language, or contextual disguise, creating a mismatch between human readability and model detectability. These findings suggest that safety cannot be understood purely as static classification on naturally distributed data, and motivate explicitly incorporating adversarial considerations into both training and evaluation.

Safety evaluation.

A number of benchmarks have been proposed to assess toxicity, harmfulness, jailbreak robustness, and guard-model performance, including RealToxicityPrompts, HarmBench, and LlamaGuard-style safety benchmarks (Gehman et al., 2020; Perez et al., 2022; Inan et al., 2023; Mazeika and others, 2024; Meta, 2025). These resources have played an important role in standardizing safety measurement. At the same time, prior work has noted that benchmark performance may substantially overestimate deployment robustness, especially when models face transformed, indirect, or interaction-based attacks (Wei et al., 2023; Zou et al., 2023; Mazeika and others, 2024). This motivates evaluation frameworks that go beyond clean public benchmarks and include adversarially transformed and deployment-oriented settings.

Agentic LLMs.

Recent work has extended LLMs from single-turn text generation to agentic settings involving planning, tools and skills (Liang et al., 2026), retrieval, and multi-step interaction (Schick et al., 2023; Yao et al., 2023; Paranjape et al., 2023; Wang et al., 2024a). These capabilities are increasingly relevant to realistic safety workflows, where models may need to retrieve policies, invoke tools, analyze evidence, or coordinate decisions across multiple steps. Reinforcement learning has also been explored for improving tool use and trajectory-level decision making in such environments (Zeng et al., 2023). Our work is related to this line of research, but differs in focusing specifically on safety-oriented agentic capability: we seek not only general task completion, but robust and policy-consistent behavior in adversarial and deployment-oriented safety scenarios.

9 Conclusion

We present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Motivated by the fundamental mismatch between general-purpose LLM training assumptions and the adversarial, policy-grounded demands of real-world content safety, we develop a progressive safety training paradigm spanning adversarially aware data construction, knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning. We further introduce Yuvion LLM RiskEval (YLRE), a multi-level evaluation framework that systematically measures performance from general capability retention to adversarial robustness and real-world deployment value.

Experimental results show that Yuvion LLM significantly outperforms both general-purpose models and existing guard models under adversarial conditions, while maintaining strong general language competence. More broadly, this work suggests that content safety should be treated as a specialized foundation-model domain with its own training paradigm, evaluation methodology, and deployment constraints. The adversarial gap exposed here is not merely a minor engineering issue, but a more fundamental limitation rooted in how general-purpose models are trained and assessed. We hope Yuvion LLM and the accompanying framework serve as a useful reference for future research and practice in risk governance.

10 Limitations and Future Work

Despite promising results, Yuvion LLM has several limitations that warrant acknowledgment. First, while our adversarial training significantly improves robustness against known evasion strategies, the arms race between moderation systems and violating content producers is continuous—novel evasion patterns that fall outside the distribution of our training data may still pose challenges. Second, the current evaluation is conducted primarily in Chinese- and English-language safety scenarios; generalization to broader multilingual and cross-cultural moderation contexts remains to be systematically assessed. Third, although Yuvion LLM retains competitive general language competence, a modest performance gap relative to the base model persists on certain general benchmarks, suggesting room for further improvement in balancing domain specialization and general capability retention.

Future work will focus on three directions: (1) building a continuous red-teaming and closed-loop data refresh pipeline to keep pace with evolving evasion strategies in production environments; (2) extending the training and evaluation framework to multilingual safety scenarios; and (3) further strengthening the model’s agentic capabilities to support more complex, multi-step safety governance workflows, including policy retrieval, evidence attribution, and cross-system tool invocation.

Authors

Core Contributors:

Ting Ma , Xiufeng Huang , Benlei Cui , Xiaowen Xu , Shikai Qiu , Ruijie Jian , Hongxing Li , Guanghui Wang , Longtao Huang , Haiwen Hong†∗111Correspondence to: honghaiwen.hhw@alibaba-inc.com.

( denotes equal contribution, and denotes the corresponding author and project lead.)

Contributors:

Haolei Xu, Wenjing Jiang, Ziwen Xu, Zhaoyu Fan, Shaoxuan He, Chuxi Xiao, Yujian Li, Xinyue Chen, Chunyang Chai, Wenxuan Liu, Ziheng Wang, Dongjie Zhang, Yangfan Zhou, Libin Dong, Yupeng Cao, Xiaoqian Xia, Jing Wang, Zhe Jiang, Zhenan Ye, Guang Yang, Bin Liu, Wei Peng, Ziqiang Zhu, Meihui Lian, Kaiwen Lv Kacuila, Haidong Ding, Bingyu Zhu, Yan Wang, Hai Zhao, Xuan Jin, Wei Zhao, Pengfei Sun, Wei Wang, Huiming Zhang, Bin Li, Hui Xue

References

  • J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. Aleman, D. Almeida, J. Altenschmidt, et al. (2023) GPT-4 technical report. Technical report OpenAI. Note: arXiv:2303.08774 Cited by: §1.
  • Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, D. Ganguli, T. Henighan, N. Joseph, et al. (2022) Constitutional ai: harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Cited by: §8, §8.
  • D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman (2019) Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference (WWW), pp. 491–500. Cited by: 7th item, §4.3.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901. Cited by: §1, §8.
  • A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. (2022) PaLM: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311. Cited by: §8.
  • C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 2924–2936. Cited by: 4th item.
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: 1st item, §4.2.
  • K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: 3rd item, §4.2.
  • O. Contributors (2023) OpenCompass: a universal evaluation platform for foundation models. Note: https://github.com/open-compass/opencompass Cited by: §B.1.
  • DeepSeek-AI (2025a) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. External Links: Link Cited by: §5.1.
  • DeepSeek-AI (2025b) DeepSeek-v3.2: pushing the frontier of open large language models. External Links: Link Cited by: §5.1.
  • J. Deng, J. Zhou, H. Sun, C. Zheng, F. Mi, H. Meng, and M. Huang (2022) COLD: a benchmark for chinese offensive language detection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 11580–11599. Cited by: 2nd item, §4.3.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 2368–2378. Cited by: 4th item.
  • A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §8.
  • D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. (2022) Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858. Cited by: §8.
  • S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith (2020) RealToxicityPrompts: evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369. Cited by: §8.
  • A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, et al. (2024) Are we done with mmlu?. arXiv preprint arXiv:2406.04127. Cited by: 1st item, §4.2.
  • S. Ghosh, P. Varshney, E. Galinkin, and C. Parisien (2024) AEGIS: online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993. Cited by: §4.3.
  • GLM-5-Team (2026) GLM-5: from Vibe Coding to Agentic Engineering. External Links: 2602.15763, Link Cited by: §5.1.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of ACL, Cited by: §8.
  • T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar (2022) ToxiGen: a large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3309–3326. Cited by: 5th item, §4.3.
  • B. He, N. Ding, C. Qian, J. Deng, G. Cui, L. Yuan, H. Hong, H. Gao, L. Huang, H. Chen, et al. (2025a) The right time matters: data arrangement affects zero-shot generalization in instruction tuning. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 222–243. Cited by: §8.
  • B. He, W. Zhang, J. Song, C. Qian, Z. Fu, B. Sun, N. Ding, H. Hong, L. Huang, H. Xue, et al. (2025b) Air: a systematic analysis of annotations, instructions, and response pairs in preference dataset. arXiv preprint arXiv:2504.03612. Cited by: §8.
  • D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2021a) Measuring massive multitask language understanding. International Conference on Learning Representations. Cited by: 1st item, §4.2.
  • D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021b) Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: 3rd item.
  • Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, j. lei, Y. Fu, M. Sun, and J. He (2023) C-eval: a multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, Vol. 36, pp. 62991–63010. Cited by: 2nd item, §4.2.
  • H. Inan, K. Upasani, J. Chi, et al. (2023) Llama guard: llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674. Cited by: §1, §1, §8, §8.
  • Jigsaw/Conversation AI (2018) Toxic comment classification challenge. Note: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge Cited by: 6th item, §4.3.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1601–1611. Cited by: 1st item, §4.2.
  • Kimi Team (2026) Kimi K2.5: Visual Agentic Intelligence. External Links: 2602.02276, Link Cited by: §5.1.
  • H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin (2023a) CMMLU: measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212. Cited by: 2nd item, §4.2.
  • M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li (2023b) API-Bank: a comprehensive benchmark for tool-augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3102–3116. Cited by: 1st item, §4.2, §5.2.1.
  • Y. Liang, R. Zhong, H. Xu, C. Jiang, Y. Zhong, R. Fang, J. Gu, S. Deng, Y. Yao, M. Wang, et al. (2026) Skillnet: create, evaluate, and connect ai skills. arXiv preprint arXiv:2603.04448. Cited by: §8.
  • J. Lin, M. Liu, X. Huang, J. Li, H. Hong, X. Yuan, Y. Chen, L. Huang, H. Xue, R. Duan, Z. Chen, Y. Fu, D. Li, L. Gao, and Y. Yang (2026) YuFeng-xguard: a reasoning-centric, interpretable, and flexible guardrail model for large language models. arXiv preprint arXiv:2601.15588. Cited by: §A.2, 3rd item, §4.3, §5.2.2, §5.2.2.
  • K. Liu, S. Cheng, B. Tian, X. Liang, Y. Yin, M. Han, N. Zhang, B. Hooi, X. Chen, and S. Deng (2025) ChineseHarm-bench: a chinese harmful content detection benchmark. arXiv preprint arXiv:2506.10960. Cited by: 1st item, §4.3.
  • T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang, and L. Weng (2022) A holistic approach to undesired content detection. arXiv preprint arXiv:2208.03274. Cited by: 3rd item, §4.3.
  • B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee (2021) HateXplain: a benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 14867–14875. Cited by: 4th item, §4.3.
  • M. Mazeika et al. (2024) HarmBench: a standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249. Cited by: §1, §8, §8.
  • Meta (2025) Llama guard 4 model card. Note: https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/ Cited by: §1, §1, §5.1, §8, §8.
  • T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2381–2391. Cited by: 1st item, §4.2.
  • MiniMax Team (2026) MiniMax-M2.5. External Links: Link Cited by: §5.1.
  • OpenAI (2026) GPT-5.4. Note: https://openai.com/ Cited by: §5.1.
  • L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Vol. 35, pp. 27730–27744. Cited by: §8, §8.
  • A. Paranjape, S. Lundberg, M. T. Ribeiro, H. Hajishirzi, et al. (2023) ART: automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014. Cited by: §8.
  • S. G. Patil, H. Mao, C. Cheng-Jie Ji, F. Yan, V. Suresh, I. Stoica, and J. E. Gonzalez (2024) The berkeley function calling leaderboard (bfcl): from tool use to agentic evaluation of large language models. In Advances in Neural Information Processing Systems, Cited by: 2nd item, §4.2, §5.2.1.
  • E. Perez, S. Ringer, K. Lukošiūtė, K. Nguyen, E. Chen, S. Heiner, L. Zettlemoyer, et al. (2022) Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251. Cited by: §1, §8, §8.
  • F. Perez and I. Ribeiro (2022) Ignore previous prompt: attack techniques for language models. arXiv preprint arXiv:2211.09527. Cited by: §8.
  • C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji (2025) ToolRL: reward is all tool learning needs. arXiv preprint arXiv:2504.13958. Cited by: §3.5.
  • Qwen Team and Qwen Team (2026) Qwen3.6-Plus: towards real world agents. External Links: Link Cited by: §5.1.
  • Qwen Team (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §1, §1, §5.1, §5.1, §8.
  • Qwen Team (2026) Qwen3.5: towards native multimodal agents. External Links: Link Cited by: §5.1, §5.1.
  • R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. In Advances in Neural Information Processing Systems, Vol. 36, pp. 53728–53741. Cited by: §8.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. Cited by: §8.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 784–789. Cited by: 4th item.
  • D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2023) GPQA: a graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022. Cited by: 1st item, §4.2.
  • K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi (2020) WinoGrande: an adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8732–8740. Cited by: 4th item.
  • T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom (2023) Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, Vol. 36, pp. 68539–68551. Cited by: §1, §8.
  • Z. Shao, P. Wang, Q. Zhu, et al. (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §8.
  • K. Sun, D. Yu, D. Yu, and C. Cardie (2020) Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics 8, pp. 141–155. Cited by: 2nd item, §4.2.
  • M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei (2022) Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261. Cited by: 4th item, §4.2.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158. Cited by: 4th item.
  • Tongyi DeepResearch Team, B. Li, B. Zhang, D. Zhang, F. Huang, G. Li, G. Chen, H. Yin, J. Wu, J. Zhou, et al. (2025) Tongyi DeepResearch technical report. arXiv preprint arXiv:2510.24701. Cited by: 3rd item, §4.2, §5.2.1.
  • H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §8.
  • E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel (2024) The instruction hierarchy: training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208. Cited by: §8.
  • X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji (2024a) Executable code actions elicit better llm agents. arXiv preprint arXiv:2402.01030. Cited by: §1, §8.
  • Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. (2024b) MMLU-pro: a more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574. Cited by: 1st item, §4.2.
  • A. Wei, N. Haghtalab, and J. Steinhardt (2023) Jailbroken: how does llm safety training fail?. In Advances in Neural Information Processing Systems, Vol. 36, pp. 80079–80110. Cited by: §1, §8, §8.
  • L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, et al. (2020) CLUE: a chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 4762–4772. Cited by: 2nd item, §4.2.
  • Z. Xu, H. Hong, L. Yu, B. Cui, L. Huang, H. Xue, and N. Zhang (2026a) How lora remembers? a parametric memory law for llm finetuning. arXiv preprint arXiv:2605.30260. Cited by: §8.
  • Z. Xu, C. Wu, H. Sun, H. Hong, M. Wang, Y. Yao, L. Huang, H. Xue, S. Deng, Z. Chu, et al. (2026b) Why steering works: toward a unified view of language model parameter dynamics. arXiv preprint arXiv:2602.02343. Cited by: §8.
  • S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023) ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), Cited by: §1, §8.
  • X. Yuan, J. Li, D. Wang, Y. Chen, X. Mao, L. Huang, J. Chen, H. Xue, X. Liu, W. Wang, K. Ren, and J. Wang (2024) S-eval: automatic and adaptive test generation for benchmarking safety evaluation of large language models. arXiv preprint arXiv:2405.14191. Cited by: §4.3.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4791–4800. Cited by: 4th item.
  • A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang (2023) AgentTuning: enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823. Cited by: §8.
  • Z. Zhang, L. Lei, L. Wu, R. Sun, Y. Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang (2023) SafetyBench: evaluating the safety of large language models. arXiv preprint arXiv:2309.07045. Cited by: 8th item, §4.3.
  • H. Zhao, C. Yuan, F. Huang, X. Hu, Y. Zhang, A. Yang, B. Yu, D. Liu, J. Zhou, J. Lin, et al. (2025) Qwen3Guard technical report. arXiv preprint arXiv:2510.14276. Cited by: §5.1.
  • C. Zheng, M. Huang, and A. Sun (2019) ChID: a large-scale chinese idiom dataset for cloze test. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 778–787. Cited by: 2nd item, §4.2.
  • A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson (2023) Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. Cited by: §8, §8.

Appendix A Detailed Benchmark Descriptions

This appendix provides additional details on the benchmark suites used in the Yuvion LLM RiskEval. For each benchmark group, we summarize its evaluation purpose, task format, metric, and the aspect of model quality it is intended to measure.

A.1 Open-source General Benchmarks

The open-source general benchmark suite is used to evaluate whether Yuvion preserves broad general-purpose capability after domain-specific continued pretraining and safety-oriented post-training. The suite contains more than 30 public evaluation sets and is organized into two groups: general-purpose benchmarks and agentic benchmarks. Unless otherwise specified, we report Accuracy as the primary metric.

General-purpose benchmarks.

The general-purpose benchmark group covers four capability dimensions: knowledge understanding, Chinese language understanding, mathematical reasoning, and commonsense and reading comprehension. These benchmarks are intended to test whether safety specialization preserves the broad utility of the base model.

  • Knowledge understanding. This group includes benchmarks such as MMLU(Hendrycks et al., 2021a), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), GPQA (Rein et al., 2023), ARC-Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), TriviaQA (Joshi et al., 2017), and Xiezhi-EN. These tasks measure factual knowledge, professional-domain reasoning, and general scientific understanding in English.

  • Chinese language understanding. This group includes C-Eval (Huang et al., 2023), CMMLU (Li et al., 2023a), C3 (Sun et al., 2020), CHID (Zheng et al., 2019), CLUEWSC(Xu et al., 2020), OCNLI, CSEM, and Xiezhi-CN. These tasks evaluate Chinese knowledge understanding, idiom completion, reading comprehension, natural language inference, and commonsense reasoning.

  • Mathematical reasoning. This group includes GSM8K-ZH (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), APE210K, TAL-SCQ5K-CN, TAL-SCQ5K-EN, and TheoremQA. These tasks evaluate arithmetic problem solving, formal mathematical reasoning, and theorem-related question answering in both Chinese and English.

  • Commonsense and reading comprehension. This group includes BoolQ (Clark et al., 2019), CommonsenseQA (Talmor et al., 2019), HellaSwag (Zellers et al., 2019), PIQA, SIQA, WinoGrande (Sakaguchi et al., 2020), DROP (Dua et al., 2019), SQuAD 2.0 (Rajpurkar et al., 2018), StoryCloze, BBH (Suzgun et al., 2022), and WPLC. These tasks test commonsense reasoning, reading comprehension, multi-step inference, and robustness on open-ended or ambiguous problems.

Taken together, these benchmarks are used to assess whether Yuvion remains broadly comparable to same-scale general-purpose models after safety-oriented specialization.

Agentic benchmarks.

The agentic benchmark group evaluates capabilities relevant to Yuvion’s agentic safety workflows, including tool use, function calling, and multi-step interactive problem solving. This group is used primarily to measure the effect of the dedicated agentic RL stage.

We use three representative benchmarks:

  • API-Bank (Li et al., 2023b). API-Bank evaluates tool selection and tool application capability in multi-turn dialogue settings. The benchmark spans 73 API tools and contains tasks of increasing difficulty, requiring the model to choose appropriate tools and generate valid tool calls under natural language instructions.

  • BFCL (Patil et al., 2024). The Berkeley Function Calling Leaderboard evaluates function-calling capability across multiple dimensions, including abstract syntax tree (AST) matching, execution accuracy, live API interaction, relevance detection, and multi-turn function-calling behavior.

  • Seal-0. Seal-0 is a search-agent benchmark built on the Tongyi DeepResearch framework (Tongyi DeepResearch Team et al., 2025). It evaluates whether the model can decompose complex information-seeking questions, issue effective search actions, gather evidence across multiple turns, and produce a final answer grounded in retrieved information.

Compared with the general-purpose benchmark group, the agentic benchmark group emphasizes structured action generation and trajectory-level decision making rather than static one-shot question answering.

A.2 Open Content Safety Benchmarks

The open content safety benchmark suite evaluates Yuvion on public safety tasks and supports direct comparison with existing models. It is divided into two groups: content safety benchmarks and guard benchmarks. For classification-oriented tasks, we report Macro F1-Score as the primary metric.

Content safety benchmarks.

The content safety benchmarks focus on recognizing harmful, unsafe, or policy-violating content. This group contains 8 evaluation sets spanning both Chinese and English, and covers risks such as pornography, fraud, offensive language, hate speech, and implicit toxicity.

The benchmarks used in the main paper include:

  • ChineseHarm (Liu et al., 2025), a Chinese-language benchmark for harmful content recognition.

  • COLD (Deng et al., 2022), which evaluates detection of toxic, offensive, or unsafe content patterns under a public safety taxonomy.

  • Moderation (Markov et al., 2022), a benchmark focused on general harmful-content moderation.

  • HateXplain (Mathew et al., 2021), which evaluates hate speech and offensive content detection in English.

  • ToxiGen (Hartvigsen et al., 2022), which focuses on implicit and adversarially phrased toxic content.

  • Jigsaw (Jigsaw/Conversation AI, 2018), a widely used English toxicity benchmark.

  • CivilComments (Borkan et al., 2019), which evaluates toxicity detection under a more naturally distributed comment corpus.

  • SafetyBench (Zhang et al., 2023), which measures broader safety recognition capability under public safety prompts.

These datasets differ in label space and risk taxonomy. In our evaluation protocol, we map model outputs to the benchmark-specific label space and report Macro F1 to reduce the impact of class imbalance and to better reflect balanced recognition performance across categories.

Guard benchmarks.

The guard benchmarks focus on safety judgment, refusal behavior, and safety-aligned response capability. This group contains more than 20 evaluation sets and is evaluated following the guard benchmark protocol used in YuFeng-XGuard (Lin et al., 2026).

The guard suite covers five dimensions:

  • Prompt classification, which evaluates whether the model can detect unsafe user prompts.

  • Response classification, which evaluates whether the model can detect unsafe model responses.

  • Multilingual classification, which extends safety detection to non-English and mixed-language settings.

  • Attack defense, which tests robustness to jailbreaks, prompt attacks, or adversarial reformulations.

  • Safe completion, which evaluates whether the model can generate safe and policy-aligned responses under risky inputs.

In addition, we report the false positive rate (FPR) on benign instruction-following datasets such as Alpaca and Belle to quantify over-refusal. This complements F1-style detection metrics by measuring whether the model remains usable on harmless user requests.

A.3 Self-Constructed Adversarial Robustness Benchmark

Public benchmarks are useful for cross-model comparison, but they do not fully capture the policy granularity, and adversarial distribution shifts encountered in real deployment. To address this gap, we construct a self-constructed adversarial robustness benchmark designed to better reflect realistic domain-specific moderation challenges.

This benchmark covers five major risk categories:

  • Advertising & Traffic Diversion

  • Gambling & Fraud

  • Abusive Content

  • Pornographic Content

  • Spam & Flooding

The benchmark is divided into two parts: static evaluation sets and dynamic evaluation sets.

Static evaluation sets.

The static evaluation sets focus on relatively stable and canonical risk expressions under standard-distribution conditions. They are intended to measure baseline domain recognition capability when the model is presented with standard unsafe expressions that are semantically clear and annotation-consistent.

For the static sets, we report Macro F1-Score. In addition to the full average across all five categories, we also report Avg., which excludes the Spam & Flooding category. This exclusion is motivated by the fact that flooding behavior often exhibits ambiguous boundaries with benign low-information or repetitive content, making it substantially noisier than the other four categories.

Dynamic evaluation sets.

The dynamic evaluation sets are designed to evaluate adversarial robustness under more realistic attack conditions. They contain recent, transformed, adversarial, and evolving unsafe expressions generated through an automated red-teaming framework. Compared with the static sets, these examples are less canonical and more likely to contain paraphrasing, camouflage, euphemistic wording, or structurally transformed expressions intended to bypass safety filters.

For the dynamic sets, we use a combined score metric defined as:

Combined Score=Bypass Success Rate×Semantic Fidelity,\text{Combined Score}=\text{Bypass Success Rate}\times\text{Semantic Fidelity},

where a lower score indicates better robustness. This metric captures two aspects simultaneously: whether the attack successfully bypasses the model’s safety judgment, and whether the adversarial rewrite preserves the original unsafe meaning. Over-obfuscated or semantically distorted rewrites are penalized, since they do not represent realistic successful attacks.

The dynamic benchmark is intended to test whether a model can generalize beyond memorized lexical patterns and remain robust to adversarially evolved unsafe content.

A.4 In-house Capability and Business Benchmarks

The in-house benchmark suite is designed to evaluate whether Yuvion is usable in realistic operational safety workflows. Unlike public safety benchmarks, which mainly measure generic detection or refusal capability, the in-house suite focuses on domain-specific competence, policy-grounded judgment, and workflow-level usefulness in production settings.

This level contains more than 15 evaluation sets organized into two groups: capability benchmarks and business benchmarks.

Capability benchmarks.

The capability benchmarks evaluate domain competence beyond simple harmful-content classification. These tasks test whether the model can perform fine-grained risk understanding, policy-aware judgment, and structured safety reasoning under realistic moderation requirements.

The capability benchmarks used in the main paper include:

  • Political Risk, which evaluates recognition of political risk categories under domain policy definitions.

  • Political Entity, which evaluates political-entity-related understanding and classification.

  • Knowledge MCQ, a knowledge-oriented multiple-choice benchmark for safety-domain concepts and policy knowledge.

  • Redline Text, which evaluates recognition of redline or prohibited expressions under policy-specific standards.

  • Domain Instruction Following, which measures whether the model follows domain-specific moderation instructions and output requirements.

  • Political NER, which evaluates named entity recognition for political entities.

  • Prohibited Content, which measures fine-grained prohibited-content classification.

  • Insult, which evaluates abusive or insulting content recognition.

  • Low-Info Text, which measures recognition of low-information, meaningless, or low-quality text.

  • Porn Text, which evaluates fine-grained recognition of pornographic text content.

  • Emotion Analysis, which evaluates affective or emotional interpretation relevant to domain decision-making.

These benchmarks are aggregated into an Overall score in the main paper.

Business benchmarks.

The business benchmarks focus on workflow-level applications and are designed to measure whether the model is practically useful in operational safety systems. Compared with the capability benchmarks, these tasks place greater emphasis on structured outputs, operational judgments, and decision support in real review pipelines.

The business benchmarks used in the main paper include:

  • UGC Moderation, which evaluates moderation performance on user-generated content review workflows.

  • AIGC Moderation, which evaluates moderation capability on AI-generated content scenarios.

  • Business Porn Detection, which evaluates pornography-related recognition under a business moderation setting.

  • Multi-Scenario Risk Detection, which evaluates risk recognition across heterogeneous business scenes and content types.

  • Data Security NER, which evaluates named entity recognition for data-security-sensitive content.

These benchmarks are aggregated into a business Overall score in the main paper.

Role in the evaluation framework.

Together, the capability and business benchmarks measure whether the model has moved beyond generic public safety competence to become a deployable safety model. They are therefore used as the most deployment-oriented evaluation layer in Yuvion LLM RiskEval. In addition to Yuvion-8B and Yuvion-32B, we also evaluate Yuvion-32B (Agent) on this suite to measure whether the dedicated agentic RL stage yields incremental improvements on realistic safety workflows.

Appendix B Evaluation Protocol Details

This appendix provides additional details on the evaluation protocol used in the Yuvion LLM RiskEval, including prompt construction, decoding settings, metric definitions, and composite-score computation. Unless otherwise specified, all models are evaluated under the same task formulation and decoding regime within each benchmark to ensure fair comparison.

B.1 Prompt Templates and Decoding Settings

General principle.

For the general-purpose benchmarks, we adopt the OpenCompass evaluation protocol (Contributors, 2023) to ensure standardized and comparable testing. For open-weight models, we use the officially released instruction-tuned checkpoints together with the matching tokenizer and prompt format. For proprietary models, we access the models via their official APIs and use the closest available instruction-following interface at evaluation time.

Prompting strategy by benchmark type.

We use benchmark-type-specific prompting rather than a single universal prompt.

  • Multiple-choice and general reasoning tasks. For open-source general benchmarks such as MMLU, C-Eval, GPQA, and related tasks, the prompt presents the original question and candidate options in benchmark-standard format. The model is instructed to output the final answer in a constrained form compatible with automatic answer extraction. Where applicable, we request only the option label or final short answer to reduce variance from verbose generation.

  • Content safety classification tasks. For open content safety benchmarks and the static self-constructed benchmark, the prompt presents the input text together with a concise task instruction describing the target label space. The model is required to output a single class label or a structured label string corresponding to the benchmark-specific taxonomy. For datasets whose original labels differ from our internal taxonomy, we preserve the benchmark’s official evaluation label space and map model outputs accordingly during post-processing.

  • Guard benchmarks. For guard benchmarks, prompts follow the protocol used in YuFeng-XGuard (Lin et al., 2026). Depending on the sub-task, the input may contain only a user prompt, or a user prompt together with a candidate assistant response. The model is asked to judge whether the content is safe or unsafe, or whether a safe completion should be produced. For benign false-positive evaluation, we use harmless instruction-following inputs and measure whether the model over-refuses.

  • Agentic benchmarks. For API-Bank, BFCL, and Seal-0, we preserve the benchmark-standard interaction format. For tool-use tasks, prompts include the user instruction and the tool specification required by the benchmark. For search-agent tasks, prompts allow multi-step interaction and action generation according to the benchmark protocol. The model output is parsed according to the benchmark’s standard evaluation interface.

  • In-house capability and business benchmarks. For in-house tasks, prompts are constructed to match realistic deployment settings. Depending on the task, the input may include raw text, structured fields, domain instructions, policy descriptions, or workflow context. Output formats are constrained when needed, including classification labels, structured JSON-style fields, named entities, or moderation decisions, so that model responses can be evaluated deterministically.

Decoding settings.

We use deterministic decoding wherever possible to reduce evaluation variance and improve reproducibility. For most classification, multiple-choice, and structured-output tasks, generation is performed with low-temperature decoding and a fixed output format. For interaction-heavy tasks such as search-agent evaluation, decoding follows the benchmark-standard execution setting while keeping the configuration consistent across compared models as much as allowed by the evaluation environment.

Output normalization.

Model outputs are normalized before scoring. This includes steps such as stripping extra explanation text, extracting the final option label, canonicalizing class names, normalizing punctuation and whitespace, and mapping semantically equivalent label variants to the benchmark’s official label space. For structured tasks, only the fields required by the benchmark are parsed and scored.

B.2 Metric Definitions

Accuracy.

For multiple-choice, knowledge, reasoning, and many agentic tasks, we use Accuracy as the primary metric:

Accuracy=#Correct Predictions#Total Examples.\text{Accuracy}=\frac{\#\text{Correct Predictions}}{\#\text{Total Examples}}.

A prediction is counted as correct if the extracted final answer exactly matches the gold answer after benchmark-specific normalization.

Macro F1-Score.

For classification-oriented content safety tasks, we use Macro F1-Score as the primary metric to reduce sensitivity to class imbalance. Let F1cF1_{c} denote the F1-score for class cc over CC classes. Then:

Macro F1=1|C|cCF1c.\text{Macro F1}=\frac{1}{|C|}\sum_{c\in C}F1_{c}.

This metric gives equal weight to each class and is therefore more suitable than micro-averaged metrics for imbalanced safety datasets.

False positive rate (FPR).

For benign instruction-following datasets used in guard evaluation, we report the false positive rate (FPR):

FPR=#Benign Examples Incorrectly Flagged as Unsafe#Total Benign Examples.\text{FPR}=\frac{\#\text{Benign Examples Incorrectly Flagged as Unsafe}}{\#\text{Total Benign Examples}}.

A lower FPR indicates less over-refusal and therefore better usability on harmless inputs.

Combined score for dynamic adversarial evaluation.

For the dynamic portion of the self-constructed adversarial robustness benchmark, we use a combined score designed to jointly measure attack success and semantic preservation:

Combined Score=Bypass Success Rate×Semantic Fidelity.\text{Combined Score}=\text{Bypass Success Rate}\times\text{Semantic Fidelity}.

A lower combined score indicates stronger robustness. Intuitively, this metric penalizes two undesirable cases less heavily than a raw bypass metric alone: first, adversarial rewrites that fail to bypass the model; and second, rewrites that change the original meaning so much that they no longer constitute valid adversarial attacks. Over-obfuscated or semantically distorted rewrites are therefore down-weighted or assigned zero contribution under the evaluation protocol.

Task-specific metrics for in-house benchmarks.

The in-house benchmark suite contains heterogeneous tasks, including classification, named entity recognition, instruction following, and workflow-level business decisions. At the task level, we use the benchmark-standard metric appropriate to each task, including:

  • classification accuracy for multiple-choice or label-selection tasks,

  • Macro F1 or related recognition metrics for classification tasks with imbalanced labels,

  • span- or entity-level matching metrics for NER tasks,

  • structured-output correctness for workflow or instruction-following tasks.

For presentation in the main paper, these task-level metrics are further aggregated into composite scores for the in-house domain and business benchmark groups.

B.3 Composite Score Computation

Averaging over benchmark groups.

For benchmark groups reported with a single summary number in the main paper, we use simple arithmetic averaging over the included evaluation sets unless otherwise specified. For example, the Avg. value reported on open-source general benchmarks and open content safety benchmarks is the mean of the benchmark-level scores within the corresponding table.

Static benchmark Avg..

For the static self-constructed adversarial robustness benchmark, we report both Avg. and Avg.. The latter excludes the Spam & Flooding category:

Avg.=14icoresi,\text{Avg.}^{*}=\frac{1}{4}\sum_{i\in\mathcal{R}_{\text{core}}}s_{i},

where core\mathcal{R}_{\text{core}} denotes the four core risk categories excluding Spam & Flooding, and sis_{i} is the benchmark score for category ii. We report this metric because Spam & Flooding exhibits substantially higher label ambiguity than the other risk categories.

In-house domain benchmark overall score.

The in-house domain benchmark table reports an Overall score, which is computed as the aggregate score over the included domain capability evaluation sets. It is intended to summarize model performance across the domain-oriented safety tasks reported in the paper under the final evaluation configuration.

In-house business benchmark overall score.

The in-house business benchmark table also reports an Overall score, which is the aggregate score over the included business evaluation sets. This value summarizes model performance across workflow-oriented moderation and operational tasks such as UGC moderation, AIGC moderation, business porn detection, multi-scenario risk detection, and data security NER.

Agentic benchmark average.

For the agentic benchmark table, the reported Avg. is the arithmetic mean of the scores on API-Bank, BFCL, and Seal-0:

Avg.=13(sAPI-Bank+sBFCL+sSeal-0).\text{Avg.}=\frac{1}{3}\left(s_{\text{API-Bank}}+s_{\text{BFCL}}+s_{\text{Seal-0}}\right).
Interpretation of composite scores.

The composite scores in this work are intended as concise summary indicators for cross-model comparison. Since the included tasks differ in difficulty, label space, and operational form, these composite numbers should be interpreted together with the per-benchmark breakdowns reported in the main paper and appendix, rather than as substitutes for task-level analysis.

Appendix C Additional Experimental Results

C.1 Full Benchmark Results

To provide a more complete view of model performance on our in-house evaluation suite, we additionally report a unified overall composite score that combines the domain composite and business composite. This aggregate metric summarizes model capability across both domain-specific safety understanding and business-oriented moderation utility, and serves as a compact indicator of deployable end-to-end safety performance.

Table 12 presents the full comparison across open-weight models, proprietary models, guard models, and Yuvion variants. Overall, Yuvion continues to show clear advantages under this stricter joint evaluation setting. Yuvion-32B (Agent) achieves the best overall composite score of 86.72, followed by Yuvion-32B with 86.06, while Yuvion-8B reaches 82.55. All three Yuvion variants outperform the strongest proprietary baseline, Qwen3-Max (82.21), with Yuvion-8B already holding a slight advantage and the 32B variants establishing a much clearer lead, while also exceeding all evaluated open-weight baselines.

Balance between domain and business performance.

An important observation from Table 12 is that many strong baselines exhibit visible imbalance between the domain and business composites. For example, KIMI-K2.5 performs relatively strongly on business benchmarks (79.84) but is weaker on domain benchmarks (74.61), while DeepSeek-R1 shows the opposite pattern, performing better on domain benchmarks (80.54) than on business benchmarks (69.85). In contrast, Yuvion models maintain consistently high scores on both axes. Yuvion-32B obtains 85.78 on the domain composite and 86.34 on the business composite, and Yuvion-32B (Agent) further improves these to 86.10 and 87.34, respectively. This balanced profile is particularly important for real deployment, where moderation systems must simultaneously support policy reasoning, fine-grained risk recognition, and structured workflow execution.

Scaling and agentic gains.

The unified results also highlight two clear trends. First, scaling Yuvion from 8B to 32B yields substantial gains, improving the overall composite from 82.55 to 86.06. Second, agentic RL provides additional benefits beyond the base 32B model, pushing the overall composite further to 86.72. Notably, these gains mainly come from stronger performance on workflow-oriented and structured business tasks while preserving high domain capability, which is consistent with the findings reported in the main text.

Guard-model limitation under full-suite evaluation.

Guard models remain near zero on the unified composite metric, with Qwen3Guard-8B scoring 0.01 and Llama-Guard4-12B scoring 0.00. This again confirms that guard-style models, although useful for narrow safety filtering setups, are fundamentally insufficient for realistic moderation scenarios requiring domain expertise, multi-category policy understanding, and structured outputs. The gap between Yuvion and guard baselines is therefore not merely incremental, but reflects a difference in training objective and deployment suitability.

Table 12: Full benchmark comparison using the unified overall composite score together with the domain composite and business composite. The overall composite summarizes performance across the two benchmark groups. The highest score in each column is in bold font.
Category Model Overall Composite Domain Composite Business Composite
Open-weight Qwen3-8B 70.64 73.71 67.57
Qwen3-32B 73.95 79.79 68.10
Qwen3-30B-A3B-2507 66.64 77.14 56.14
Qwen3.5-9B 71.28 72.68 69.89
Qwen3.5-27B 75.55 77.17 73.92
Qwen3.5-35B-A3B 74.11 75.46 72.75
Qwen3.5-122B-A10B 76.96 77.30 76.61
Qwen3.5-397B-A17B 75.04 77.45 72.62
DeepSeek-R1 75.20 80.54 69.85
DeepSeek-V3.2 73.08 76.27 69.88
KIMI-K2.5 77.23 74.61 79.84
Minimax-M2.5 64.46 72.60 56.32
GLM-5 76.90 79.84 73.96
Proprietary Qwen3-Max 82.21 81.41 83.00
Qwen3.5-Plus 75.65 77.92 73.38
Qwen3.6-Plus 80.14 81.83 78.44
GPT-5.4 80.56 80.73 80.40
Guard Qwen3Guard-8B 0.01 0.00 0.01
Llama-Guard4-12B 0.00 0.00 0.00
Yuvion (Ours) Yuvion-8B 82.55 82.38 82.72
Yuvion-32B 86.06 85.78 86.34
Yuvion-32B (Agent) 86.72 86.10 87.34

Appendix D Data Construction and Annotation

D.1 Dynamic Evaluation Sets

The dynamic evaluation sets are implemented as a dynamic adversarial evaluation framework rather than a purely static test collection. The framework covers five major risk categories: advertising and traffic diversion, gambling and fraud, abusive content, pornographic content, and spam and flooding. Its construction begins from real-world risky samples collected from practical business scenarios, including e-commerce, social interaction, and local-service platforms. All samples are anonymized during preprocessing. To ensure that the framework targets realistic moderation-evasion behavior rather than ordinary harmful content, we apply an LLM-assisted pre-screening step to retain seed samples exhibiting clear adversarial transformation patterns, such as lexical substitution, homophonic rewriting, character decomposition, symbol insertion, coded expressions, and other forms of semantic disguise.

These filtered seed samples are then incorporated into an automated red-teaming pipeline that generates transformed variants while preserving harmful intent. As a result, the dynamic evaluation does not rely solely on a fixed set of manually collected adversarial examples; instead, it evaluates models against both naturally occurring human-written evasive expressions and systematically generated adversarial variants derived from them. This design makes the framework better suited for measuring robustness under realistic and evolving attack conditions.

All retained seed samples are annotated by five professional content moderation experts. Each sample is independently reviewed by two annotators, and a third annotator performs adjudication when necessary. For each sample, the annotation includes the primary risk category, the underlying intended meaning, and the major rewriting or evasion strategy. These annotations support benchmark construction, adversarial transformation design, and category-level robustness analysis.