Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

Yuvion Team, Alibaba Security AGI Lab

Abstract

As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from natural inputs alone, but from strategic attempts to evade model policies and safeguards. However, existing general-purpose model development largely overlook this adversarial nature, and often remain insufficient for realistic safety scenarios involving planning, tool use, and multi-step reasoning, causing measured safety performance to overestimate real deployment robustness. To address this gap, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM treats adversarial robustness and agentic capability as first-class objectives. Its pipeline combines adversarially aware data construction, knowledge-enhanced continued pretraining, and policy-grounded multi-task safety post-training, including risk-aware supervised fine-tuning and reinforcement learning-based policy optimization, together with safety-aware agentic reinforcement learning for tool use and multi-step reasoning in complex safety scenarios. We further introduce the Yuvion LLM RiskEval (YLRE), a collection of 93 benchmarks across four evaluation categories, covering diverse open and internal evaluations with a focus on safety, adversarial robustness, and real-world capability requirements. Across these evaluations, Yuvion LLM demonstrates clear advantages on safety-focused benchmarks and particularly strong robustness under adversarial conditions, while maintaining solid overall capability. Notably, Yuvion-8B outperforms most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks.

^†^†footnotetext: Partial open-source release is planned. For trial access, contact honghaiwen.hhw@alibaba-inc.com.

Refer to caption — Figure 1: Evaluation overview across benchmark settings: Open-source Safety (Macro F1 on public content safety benchmarks), Adversarial Safety Static and Dynamic (F1 and $100-$ Combined Score on self-constructed red-team adversarial benchmarks), and Industrial Deployment (in-house capability and business composite). Yuvion-32B achieves the best results across all panels, with scores of 78.2, 94.2, 79.4, and 86.1, outperforming all baselines including GPT-5.4 and Qwen3-MAX. Yuvion-8B also surpasses most baselines and remains competitive against substantially larger models.

1 Introduction

Large language models (LLMs) are increasingly being deployed in real-world systems for content moderation, user interaction, decision support, tool use, and multi-step task execution (Brown et al., 2020; Achiam et al., 2023; Schick et al., 2023; Yao et al., 2023; Wang et al., 2024a; Qwen Team, 2025). As these systems become integrated into social and economic activity, safety failures can lead to harmful content, prohibited transactions, and dangerous information access through jailbreaks or policy circumvention. Recent LLM-based safety systems (Inan et al., 2023; Meta, 2025; Qwen Team, 2025) have helped mitigate many of these risks. However, we identify a fundamental limitation shared by these systems: fragility under human adversarial behavior.

Figure 2 illustrates this limitation concretely. The underlying unsafe intent remains unchanged, yet a general LLM that correctly rejects the original input fails once the request is reframed through euphemism, symbolic substitution, or cross-lingual mixing. This points to the core challenge: unsafe behaviors in deployment arise not from natural inputs alone, but from strategic, adaptive attempts to evade model safeguards. In content safety, harmful intent can be hidden through lexical obfuscation, coded language, contextual disguise, or off-platform traffic obfuscation; in broader AI safety, models can be manipulated through jailbreaks, prompt injection, role-playing, and multi-turn attacks (Perez et al., 2022; Wei et al., 2023; Mazeika and others, 2024). Existing general-purpose models are developed and evaluated largely without accounting for such strategic human behavior, causing measured safety performance to systematically overestimate real deployment robustness.

This adversarial challenge intensifies in agentic settings. Production safety scenarios often involve cascading policy rules, cross-modal evidence gathering, and specialized tool invocation. For example, determining whether a product listing infringes a registered trademark may require following a multi-step decision procedure, invoking image detection tools, and synthesizing cross-modal evidence before rendering a judgment. These operational demands require genuine agentic capabilities: planning multi-step reasoning chains, invoking external tools, retrieving and grounding decisions in evolving policy documents, and adapting execution paths based on intermediate results. However, existing safety-oriented models, including dedicated guard models (Inan et al., 2023; Meta, 2025; Qwen Team, 2025), remain confined to single-turn safety judgment and lack support for tool use, policy retrieval, or multi-step execution. As content safety systems scale from text classification toward multi-modal, multi-tool audit pipelines, agentic capability becomes a structural requirement for practical deployment.

These two gaps, fragility under adversarial evasion and the absence of agentic capability, motivate the central premise of our work: adversarial robustness and agentic safety capability must be built into model development and evaluation by design, rather than added only as post-hoc safeguards. Robust safety requires models that can recognize obfuscated risks, remain aligned under adversarial pressure, and operate reliably in realistic safety scenarios involving planning, policy grounding, tool use, and adaptive multi-step reasoning.

To this end, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM is designed around two first-class objectives: robust safety under adversarial conditions and agentic safety capability for realistic deployment scenarios. To realize these objectives, Yuvion adopts a progressive safety training paradigm consisting of three stages: knowledge-enhanced continued pretraining, which injects safety-domain knowledge; policy-grounded multi-task safety post-training, which elicits risk understanding, fine-grained risk identification, and policy-consistent decision making under adversarial variation; and safety-aware agentic reinforcement learning, which extends the model to retrieval, tool use, and multi-step reasoning in safety scenarios. Together, these stages strengthen the model’s safety-relevant knowledge, adversarial robustness, policy-grounded decision capability, and trajectory-level reliability in realistic safety scenarios. In this way, Yuvion LLM is intended not merely as a guard classifier, but as a more general safety-oriented model for practical deployment.

We further introduce the Yuvion LLM RiskEval (YLRE), a four-level progressive evaluation framework covering 93 benchmarks across open-source general benchmarks, open-source safety benchmarks, a self-constructed adversarial safety benchmark, and in-house capability and business benchmarks. This framework is designed to jointly evaluate general capability retention, public safety performance, adversarial robustness, and real-world operational value. More specifically, the first two levels measure whether safety specialization preserves general competence and remains competitive on established public benchmarks; the third level stresses adversarial robustness under controlled yet realistic transformation patterns; and the fourth level evaluates performance in practical settings involving agentic capabilities and business-facing requirements. This progressive design enables us to examine not only whether a model understands risk and safety in principle, but also whether it remains reliable and precise in real-world, complex, multi-turn, and adversarially intensive production environments at scale. The main contributions of this work are as follows:

•

We argue that safety should be treated as an inherently adversarial problem, and that both adversarial robustness and agentic capability must be built into model development by design. Guided by this principle, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety, with a development pipeline integrating adversarially aware data construction, knowledge-enhanced continued pretraining, post-training for safety tasks, and agentic reinforcement learning that equips the model with tool invocation, multi-step reasoning, and task execution capabilities for complex safety scenarios.
•

We introduce the Yuvion LLM RiskEval (YLRE), a four-level evaluation framework covering 93 benchmarks across open-source general benchmarks, open-source safety benchmarks, a self-constructed adversarial safety benchmark, and in-house capability and business benchmarks, enabling systematic assessment from public benchmarks to real-world deployment scenarios.
•

Comprehensive experiments show that Yuvion LLM outperforms both open-source and proprietary baselines. Yuvion-32B achieves 78.2% Macro F1 on content safety (vs. GPT-5.4 72.2%) and 86.1% on industrial deployment (vs. GPT-5.4 80.6%). The gains are also scale-efficient: despite its smaller size, Yuvion-8B surpasses most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks, indicating that targeted adversarial and agentic safety training can matter more than model scale alone.

2 Content-Safety-Oriented Data System

2.1 Overview

A core premise of Yuvion is that effective safety modeling requires not only a dedicated training pipeline, but also a data system aligned with the adversarial and operational nature of real-world safety tasks. Unlike general-purpose language modeling, safety-oriented data must support not only ordinary risk understanding and identification, but also policy-grounded judgment, adversarial robustness, and agentic safety behaviors. This is particularly important because Yuvion is designed not only for explicit content classification, but also for realistic settings in which unsafe intent may be obfuscated, reformulated, or embedded in multi-step interactions.

To support these requirements, Yuvion is trained on a multi-source data system spanning general, safety-specific, adversarial, agentic, and synthetic or expert-constructed data. Rather than treating all samples as a single homogeneous corpus, we organize data by functional role so that different data types support different aspects of capability formation across the training pipeline.

2.2 Data Categories

Table 1 summarizes the main data categories used in Yuvion. At a high level, the data system is designed to balance five goals: preserving general language ability, strengthening safety-domain knowledge, improving robustness to adversarial inputs, supporting agentic capability, and expanding coverage of long-tail and complex scenarios.

Data Type	Main Role	Representative Content
General data	Preserve broad language competence and stabilize safety adaptation	General instruction following, question answering, reasoning, reading comprehension, and other common language tasks, etc.
Safety-domain data	Build core safety knowledge and task capability	Risk understanding and identification, hierarchical safety categories, policy-grounded judgment, and evidence-based responses, etc.
Adversarial data	Improve robustness to evasive and obfuscated unsafe inputs	Lexical variation, symbol or homophone substitution, semantic camouflage, contextual disguise, and other policy-evasive expressions, etc.
Agentic data	Support multi-step safety workflows involving reasoning, retrieval, and tool interaction	Tool-use trajectories, search-based reasoning, multi-step decomposition, and retrieval-augmented decision making, etc.
Synthetic and expert- constructed data	Expand long-tail coverage and provide high-quality supervision for complex tasks	Rare or difficult scenarios, policy-intensive cases, structured reasoning samples, preference data, and reward-oriented optimization data, etc.

Table 1: Overview of the main data categories in the Yuvion safety-oriented data system.

General data.

General-domain data is used to preserve broad language competence and reduce overspecialization during safety adaptation. Although such data is not safety-specific, it plays an important regularizing role by helping the model retain general instruction-following, comprehension, and reasoning ability.

Safety-domain data.

Safety-domain data forms the core of Yuvion’s safety capability learning. It covers a wide range of safety-relevant tasks, including risk understanding, risk identification, policy-sensitive categorization, evidence attribution, and structured decision generation. Compared with conventional safety classification data, this portion places greater emphasis on richer supervision and policy grounding, enabling the model to produce not only labels but also more interpretable and policy-consistent responses when needed.

Adversarial data.

Because safety is inherently adversarial, the Yuvion data system includes a dedicated adversarial subset that explicitly models realistic evasion patterns. These data cover both surface-form perturbations and deeper semantic concealment strategies, helping the model avoid over-reliance on shallow lexical cues and improving robustness to obfuscated unsafe intent.

Agentic data.

Agentic data is introduced to support safety scenarios that require more than single-turn classification. This category covers structured trajectories involving multi-step reasoning, retrieval, tool invocation, and action-conditioned responses. It provides training signals for behaviors such as decomposing complex tasks, selecting appropriate tools, interacting with external systems, and synthesizing intermediate observations into grounded decisions. In Yuvion, such data is particularly important for later-stage optimization of realistic safety scenarios.

Synthetic and expert-constructed data.

Synthetic and expert-constructed data are introduced to improve coverage of rare, high-risk, long-tail, or structurally complex scenarios that are insufficiently represented in naturally collected data. They are particularly useful for difficult policy cases, structured safety tasks, and later-stage optimization settings requiring higher-quality supervision, preference signals, or reward-oriented annotations.

2.3 Summary

The Yuvion data system is a multi-source training foundation designed for adversarially robust and deployment-oriented safety modeling. By combining general data, safety-domain data, adversarial data, agentic data, and synthetic or expert-constructed supervision, it covers both ordinary and adversarial safety scenarios while supporting the broader capability requirements of realistic safety scenarios.

3 Yuvion LLM: Progressive Safety Training Paradigm

3.1 Overview

Yuvion LLM is built on the premise that adversarial robustness and agentic safety should be developed as first-class model capabilities rather than appended through post-hoc patching. In realistic deployment, a safety model must go beyond recognizing overtly harmful content: it must internalize safety-domain knowledge, infer latent unsafe intent under obfuscation, remain policy-consistent under adversarial pressure, and operate reliably in structured workflows involving retrieval, tool use, and multi-step decision making. To this end, Yuvion adopts a progressive training paradigm that transforms a general-purpose instruct model into a deployable safety model through staged capability shaping.

The full pipeline consists of three stages: knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning. Stage 1 injects and internalizes safety-domain knowledge through knowledge-enhanced continued pretraining. Stage 2 converts this knowledge into task-level capability for policy-grounded risk understanding, risk identification, and adversarially robust safety decision making. Stage 3 further extends the model from single-turn safety judgments to trajectory-level reasoning and action in structured safety workflows.

Throughout all stages, Yuvion is formulated as a unified autoregressive conditional generator. Given a context $c$ , the model generates an output object $z$ according to

z\sim p_{\theta}(z\mid c),

(1)

where $\theta$ denotes the model parameters. The instantiation of $c$ and $z$ varies across stages. In continued pretraining, $c$ is a raw token prefix and $z$ is its continuation. In safety post-training, $c=(x,\mathcal{I})$ , where $x$ denotes the input content or task instance and $\mathcal{I}$ denotes the task instruction, while $z$ corresponds to a task output such as a risk label, policy-grounded explanation, or structured decision. In agentic reinforcement learning, $c$ additionally includes interaction history and intermediate observations, and $z$ may represent a multi-step reasoning or action trajectory. Under the autoregressive factorization,

p_{\theta}(z\mid c)=\prod_{\ell=1}^{L}p_{\theta}(z_{\ell}\mid z_{<\ell},c),

(2)

where $L$ denotes the output length.

As illustrated in Figure 3, this paradigm enables Yuvion to progressively acquire safety-domain knowledge, policy-grounded decision capability, robustness to adversarial manipulation, and agentic competence for realistic safety deployment.

3.2 Target Capability Design

The Yuvion training paradigm is organized around four tightly coupled capabilities that are central to realistic safety deployment. Rather than treating them as isolated objectives, Yuvion develops them progressively so that each stage provides a stronger foundation for the next.

•

Risk understanding: capturing safety-relevant semantics, latent intent, contextual cues, and the policy meaning of user inputs;
•

Policy-grounded risk identification: producing fine-grained safety judgments, category attribution, and evidence-aware decisions aligned with moderation policy;
•

Adversarial robustness: maintaining stable and policy-consistent behavior under lexical obfuscation, semantic disguise, paraphrastic attacks, and other evasive transformations;
•

Agentic safety capability: supporting structured outputs, multi-step reasoning, retrieval, tool use, and evidence-seeking interaction in complex safety workflows.

These capabilities are developed in a staged manner. Stage 1 focuses on knowledge loading and representation adaptation; Stage 2 elicits policy-grounded risk understanding, instruction-conditioned task execution, and robustness under adversarial variation; Stage 3 extends safety capability from single-turn outputs to trajectory-level reasoning and action.

3.3 Stage 1: Knowledge-Enhanced Continued Pretraining

The first stage performs continued pretraining on the instruct model using a knowledge-enhanced corpus, denoted as $\mathcal{D}_{\mathrm{cp}}$ . Rather than relying on raw safety-domain text alone, this corpus is explicitly constructed to facilitate the internalization of safety-domain knowledge. In particular, we leverage large-scale domain knowledge bases and transform them into training instances at multiple granularities, including structured triple-level samples, sentence-level descriptions, and other knowledge-derived textual forms. These knowledge-oriented data are combined with broader safety-domain corpora and a smaller proportion of general-domain data, enabling the model to absorb structured risk knowledge while preserving broad language competence.

The training objective is the standard autoregressive next-token prediction loss:

\mathcal{L}_{\mathrm{cp}}=-\sum_{x\in\mathcal{D}_{\mathrm{cp}}}\sum_{t=1}^{|x|}\log p_{\theta}(x_{t}\mid x_{<t}).

(3)

Within Yuvion, this stage serves as a safety knowledge infusion phase before task-level post-training. By exposing the model to moderation policies, risk taxonomies, violation patterns, knowledge-base-derived facts, and long-tail adversarial expressions, the model internalizes safety-relevant concepts and their relations at the distribution level. The use of multi-granularity knowledge-derived samples is particularly important: structured instances help anchor explicit semantic relations, while sentence-level realizations improve the model’s ability to recognize how such knowledge is expressed in natural language. As a result, this stage provides a stronger initialization for downstream risk understanding, policy-grounded identification, and adversarially robust safety reasoning. The output checkpoint of this stage is denoted as Yuvion-CP.

3.4 Stage 2: Policy-Grounded Multi-Task Safety Post-Training

The second stage converts safety-domain knowledge into task-level capability for realistic safety deployment. Its goal is not merely to improve performance on isolated safety tasks, but to elicit policy-grounded risk understanding, fine-grained risk identification, and robust decision making under adversarially manipulated inputs. To this end, Yuvion performs multi-task safety post-training through two complementary components: risk-aware supervised instruction tuning for broad capability initialization, and reinforcement learning-based policy optimization for behavior refinement under ambiguity and attack.

3.4.1 Risk-Aware Supervised Fine-Tuning

Supervised fine-tuning initializes the model on safety-oriented tasks using a multi-task supervised instruction dataset, denoted as $\mathcal{D}_{\mathrm{sft}}$ :

\mathcal{D}_{\mathrm{sft}}=\{(x^{(i)},\mathcal{I}^{(i)},y^{(i)})\}_{i=1}^{N_{\mathrm{sft}}},

where $x^{(i)}$ is the input content or task instance, $\mathcal{I}^{(i)}$ is the task instruction, and $y^{(i)}$ is the target output. Under the unified framework, this stage instantiates the context as $c=(x^{(i)},\mathcal{I}^{(i)})$ and optimizes the conditional likelihood of the target response:

\mathcal{L}_{\mathrm{sft}}=-\sum_{i=1}^{N_{\mathrm{sft}}}\sum_{\ell=1}^{|y^{(i)}|}\log p_{\theta}\!\left(y_{\ell}^{(i)}\mid y_{<\ell}^{(i)},x^{(i)},\mathcal{I}^{(i)}\right).

(4)

A central design choice is to formulate heterogeneous safety objectives under a unified instruction-following interface. This enables the model to learn not only task-specific prediction, but also instruction-conditioned behavior across diverse safety scenarios, such that it can reliably switch between risk judgment, fine-grained categorization, policy-grounded explanation, safety question answering, and structured decision generation within a single generative framework. More importantly, this formulation encourages outputs that are not only task-appropriate, but also consistently grounded in moderation policy, thereby strengthening policy-conditioned response generation across heterogeneous safety tasks.

Crucially, the training set combines naturally distributed supervision with adversarially constructed examples. These examples include obfuscated, paraphrased, or semantically disguised unsafe inputs that preserve harmful intent while altering surface realization. As a result, the model is encouraged to base its judgments on latent intent, contextual evidence, and policy semantics rather than superficial lexical patterns alone, improving robustness to realistic evasion strategies and stabilizing policy-consistent behavior under adversarial variation. In addition, structured response formats are introduced at this stage to establish a consistent task interface for downstream reinforcement learning and deployment-oriented use cases.

The output checkpoint of this stage is denoted as Yuvion-SFT.

3.4.2 Reinforcement Learning-Based Policy Optimization

While supervised instruction tuning establishes broad multi-task capability, many important safety behaviors are difficult to optimize with single-reference targets alone. This is particularly true for ambiguous cases, adversarial attacks, policy edge cases, and tasks where multiple outputs may be acceptable but differ substantially in policy faithfulness, reasoning quality, or robustness. Yuvion therefore further refines model behavior using reinforcement learning-based policy optimization.

We instantiate this stage with GRPO (Group Relative Policy Optimization). Given a safety context $c=(x,\mathcal{I})$ , the current policy samples a group of candidate outputs $\{y^{(g)}\}_{g=1}^{G}$ . Each candidate is evaluated by a reward function tailored to safety-specific objectives, and the policy is updated using group-relative advantage estimates:

\mathcal{L}_{\mathrm{rl}}=-\mathbb{E}_{(x,\mathcal{I})\sim\mathcal{D}_{\mathrm{rl}},\,\{y^{(g)}\}\sim p_{\theta_{\mathrm{old}}}}\left[\sum_{g=1}^{G}\hat{A}^{(g)}\cdot\frac{p_{\theta}(y^{(g)}\mid x,\mathcal{I})}{p_{\theta_{\mathrm{old}}}(y^{(g)}\mid x,\mathcal{I})}\right],

(5)

where the normalized advantage of the $g$ -th candidate is

\hat{A}^{(g)}=\frac{r^{(g)}-\mathrm{mean}(\{r^{(g^{\prime})}\}_{g^{\prime}=1}^{G})}{\mathrm{std}(\{r^{(g^{\prime})}\}_{g^{\prime}=1}^{G})},

(6)

and $r^{(g)}$ denotes the scalar reward assigned to candidate $y^{(g)}$ .

The reward is designed to capture multiple dimensions of output quality, including final decision correctness, consistency with policy basis, attribution or reasoning quality, and reliability under adversarial or structurally complex inputs. Compared with supervised instruction tuning, this stage is better suited to optimizing behaviors for which a single reference is insufficient, especially when the model must remain stable under ambiguity, obfuscation, or distribution shift. It therefore plays a central role in sharpening policy-grounded decision boundaries and improving robustness in realistic safety scenarios.

3.5 Stage 3: Safety-Aware Agentic Reinforcement Learning

Beyond single-turn safety classification and reasoning, deployable safety models must support structured multi-step workflows involving planning, retrieval, and tool interaction. In Stage 2, GRPO optimizes the quality of individual safety outputs given a fixed input context. However, many real-world safety tasks are inherently interactive: resolving a content moderation case may require querying an external policy knowledge base, invoking a specialized classifier, or retrieving contextual evidence before a final decision can be made. The reward signal in such settings is delayed and sparse, arriving only after a full interaction trajectory completes. This motivates a dedicated agentic RL stage that optimizes trajectory-level decision quality rather than single-turn output quality.

Tool-integrated reasoning.

For tool-use tasks, the model interacts with a tool set $\mathcal{T}=\{t_{1},t_{2},\ldots,t_{n}\}$ under a task context $c$ , generating a reasoning trajectory

\tau=\big((r_{1},T_{1},o_{1}),\ldots,(r_{k},T_{k},o_{k})\big),

where $r_{i}$ denotes step-wise reasoning, $T_{i}\subseteq\mathcal{T}$ denotes the invoked tools, and $o_{i}$ denotes returned observations. At each step, the model must jointly reason, choose tools, and produce valid invocations.

Following ToolRL (Qian et al., 2025), we adopt decomposed rewards to train this behavior. The format reward $R_{\mathrm{format}}\in\{0,1\}$ checks whether the output contains the required structural fields, and the correctness reward $R_{\mathrm{correct}}\in[-3,3]$ evaluates predicted tool calls against ground-truth calls along tool-name, parameter-name, and parameter-value matching dimensions. The final reward is

R_{\mathrm{final}}=R_{\mathrm{format}}+R_{\mathrm{correct}}\in[-3,4].

(7)

This design provides denser process-level supervision than coarse binary rewards and improves the stability of tool-use optimization. Compared with Stage 2, the optimization target here shifts from single-turn task outputs to full interaction trajectories, while retaining the same GRPO-based reinforcement learning framework. Beyond tool-use proficiency itself, this training also strengthens the model’s general instruction-following capability: the format reward enforces strict adherence to prescribed output structures, and the decomposed correctness reward requires the model to precisely follow tool specifications including exact function signatures and parameter constraints. This structured compliance training transfers broadly, improving the model’s ability to follow complex, format-sensitive instructions in non-tool-use settings as well.

Search-augmented reasoning.

Content safety tasks frequently require evidence gathering beyond the model’s parametric knowledge. For example, verifying whether user-generated content constitutes misinformation requires cross-referencing external sources; determining whether a piece of text violates platform policy may depend on retrieving the latest regulatory guidelines; and investigating coordinated abuse patterns demands tracing and synthesizing information across multiple documents. These tasks cannot be resolved by single-turn classification—they inherently require multi-turn search, evidence evaluation, and grounded synthesis.

We train Yuvion as a search-augmented reasoning agent for such scenarios. Given a complex question and the accumulated interaction history, the model iteratively decides whether to issue a new search query $q_{t}$ , visit a retrieved page $v_{t}$ , or produce a final answer $a$ . The objective is to learn effective search strategies, evidence evaluation, and stopping decisions under extended interaction horizons.

We design a two-component reward for this track. The execution reward $R_{\mathrm{exec}}$ evaluates whether each tool invocation in the trajectory executes successfully: search queries that return no results, visits to invalid or inaccessible URLs, and malformed tool calls all receive negative signals, encouraging the model to formulate precise queries and well-structured invocations. The result reward $R_{\mathrm{result}}$ uses an LLM-as-judge to evaluate the final answer against ground truth for factual correctness and completeness. The combined search reward is:

R_{\mathrm{search}}=R_{\mathrm{result}}+\beta\cdot R_{\mathrm{exec}},

(8)

where $\beta$ controls the relative weight of execution quality. This decomposition provides denser learning signals than outcome-only rewards: the model receives feedback not only on what it concludes, but also on how well it interacts with the search environment along the way.

3.6 Summary

Yuvion transforms a general-purpose instruct model into a deployable safety model through progressive safety capability shaping. The full pipeline consists of knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning. Across these three stages, the model progressively internalizes safety-domain knowledge, learns policy-grounded and adversarially robust decision capability, and develops trajectory-level agentic competence for realistic safety workflows.

4 Evaluation Framework

4.1 Overview

A deployable safety model cannot be adequately assessed by a single benchmark category. Beyond public safety performance, it must retain general capability, remain robust under adversarial and domain-specific conditions, and demonstrate practical value in real workflows. To support such assessment, we establish Yuvion LLM RiskEval (YLRE), a collection of multi-level benchmarks covering Open-source General Benchmarks, Open-source Content Safety Benchmarks, a Self-constructed Adversarial Content Safety Benchmark, and in-house Capability and Business Benchmarks.

The benchmarks follow a progressive logic: Level 1 measures general capability retention, Level 2 measures public safety comparability, Level 3 measures domain-specific and adversarial robustness, and Level 4 measures deployment-oriented capability and business value. Figure 4 provides an overview of the full evaluation hierarchy.

4.2 Level 1: Open-source General Benchmarks

Level 1 evaluates whether the model retains broad language competence after safety specialization. This is a necessary baseline, since practical safety workflows require not only risk judgment but also general language understanding, knowledge, and reasoning.

The open-source general benchmark suite includes more than 30 public evaluation sets organized into two groups. The general-purpose benchmark group covers broad knowledge, reasoning, Chinese language capability, and scientific problem solving. General-purpose benchmarks include MMLU (Hendrycks et al., 2021a), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), GPQA (Rein et al., 2023), ARC-Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), TriviaQA (Joshi et al., 2017), Xiezhi-EN, C-Eval (Huang et al., 2023), CMMLU (Li et al., 2023a), C3 (Sun et al., 2020), CHID (Zheng et al., 2019), CLUEWSC (Xu et al., 2020), GSM8K-ZH (Cobbe et al., 2021), BBH (Suzgun et al., 2022), etc. The agentic benchmark group evaluates capabilities relevant to Yuvion’s agentic safety workflows, including tool use, function calling, and multi-step interactive problem solving. Agentic benchmarks include API-Bank (Li et al., 2023b), BFCL (Patil et al., 2024), and Seal-0 (Tongyi DeepResearch Team et al., 2025). For these benchmarks, we report Accuracy as the primary metric where applicable.

4.3 Level 2: Open-source Content Safety Benchmarks

Level 2 evaluates Yuvion on publicly available safety tasks and enables comparison with prior models and reported results. It includes two benchmark groups: content safety benchmarks and guard benchmarks.

The content safety benchmarks focus on the recognition of harmful or policy-violating content. Content safety benchmarks include ChineseHarm (Liu et al., 2025), COLD (Deng et al., 2022), Moderation (Markov et al., 2022), HateXplain (Mathew et al., 2021), ToxiGen (Hartvigsen et al., 2022), Jigsaw (Jigsaw/Conversation AI, 2018), CivilComments (Borkan et al., 2019), and SafetyBench (Zhang et al., 2023), covering risks such as pornography, fraud, offensive language, hate speech, and implicit toxicity. In total, this group contains 8 evaluation sets. The guard benchmarks focus on safety judgment, refusal behavior, and risk-aware response capability, evaluated following the protocol of YuFeng-XGuard (Lin et al., 2026). Guard benchmarks include SEval (Yuan et al., 2024), AEGIS (Ghosh et al., 2024), and more than 20 sub-datasets across five dimensions: prompt classification, response classification, multilingual classification, attack defense, and safe completion. For classification-oriented content safety tasks, we report Macro F1-Score as the primary metric.

4.4 Level 3: Self-Constructed Adversarial Robustness Benchmark

Level 3 complements public benchmarks with a self-constructed adversarial robustness benchmark designed to measure robustness against realistic evasion attacks and distribution shifts that standard benchmarks fail to capture. Although public benchmarks support cross-model comparison, they are limited in their coverage of long-tail risk categories, evolving adversarial expressions, and fine-grained policy taxonomies. The self-constructed benchmark fills this gap by incorporating both naturally occurring human-written evasive content and systematically generated adversarial variants. Seed samples are collected from real-world business scenarios and pre-screened via an LLM-assisted filter to retain instances with clear adversarial transformation patterns, including lexical substitution, homophonic rewriting, character decomposition, symbol insertion, and coded expressions. These seeds are then fed into an automated red-teaming pipeline that generates transformed variants while preserving harmful intent. All retained samples are annotated by five professional content moderation experts under a dual-annotator protocol with third-party adjudication (see Appendix A.3 and D.1 for detailed construction methodology).

The benchmark covers five major risk categories: advertising and traffic diversion, gambling and fraud, abusive content, pornographic content, and spam and flooding. Self-constructed benchmarks include static evaluation sets and dynamic evaluation sets across all five risk categories. It is divided into two parts. The static evaluation sets focus on relatively stable and canonical risk expressions under standard distribution conditions, measuring baseline domain recognition performance. The dynamic evaluation sets are specifically designed to evaluate adversarial robustness: they include recent, transformed, and evolving expressions constructed through an automated red-teaming framework that generates paraphrasing, camouflage, euphemistic wording, and structurally transformed variants intended to bypass safety filters, thereby measuring robustness under realistic adversarial conditions. For the dynamic sets, we adopt a combined score metric defined as the product of bypass success rate and semantic fidelity score; a lower combined score indicates stronger robustness against adversarial attacks.

4.5 Level 4: In-house Capability and Business Benchmarks

Level 4 evaluates the model in realistic operational settings derived from large-scale industrial deployment and commercial content moderation practice. While the first three levels measure general capability retention, public safety comparability, and adversarial robustness, they do not directly capture whether the model is useful in practical safety scenarios at production scale. The in-house benchmark suite is designed to fill this gap by reflecting the actual task distributions, policy complexity, and quality expectations encountered in real-world commercial platforms serving hundreds of millions of users.

This level includes more than 15 evaluation sets across two groups, all constructed from anonymized production data and validated against real moderation decisions. The capability benchmarks assess abilities beyond simple risk classification, including risk understanding, risk attribution, safety reasoning, and policy-aware judgment—skills that are essential for deployment but rarely tested by academic benchmarks. Capability benchmarks include Political Risk, Political Entity, Knowledge MCQ, Redline Text, Domain Instruction Following, Political NER, Prohibited Content, Insult, Low-Info Text, Porn Text, and Emotion Analysis. The business benchmarks directly mirror end-to-end production workflows such as UGC review assistance, AIGC safety filtering, structured decision support, and moderation suggestion generation, measuring whether the model can serve as a drop-in component in commercial safety infrastructure. Business benchmarks include UGC Moderation, AIGC Moderation, Business Porn Detection, Multi-Scenario Risk Detection, and Data Security NER. For capability tasks, we report task-specific metrics such as risk recognition F1-score and attribution accuracy; for business-oriented tasks, we combine quantitative metrics with workflow-level indicators to reflect both model capability and operational usefulness.

In addition to the standard Yuvion-32B and Yuvion-8B variants, we also evaluate Yuvion-32B (Agent), which is trained with an additional agentic reinforcement learning stage, on the same benchmark suite. This allows us to measure both its dedicated gains on agentic benchmarks and its incremental improvements on realistic in-house scenarios.

4.6 Summary of Evaluation Design

Together, the benchmarks provide a progressive and deployment-oriented view of model quality: from general capability retention, through public safety comparability and adversarial robustness, to in-house operational value. No single benchmark group is sufficient to characterize a safety foundation model in full, and Yuvion LLM RiskEval is designed to assess these complementary dimensions within one unified framework. Detailed descriptions of the benchmarks are provided in Appendix A.

5 Experimental Results and Analysis

5.1 Experimental Setup

Baselines.

We compare Yuvion LLM against a comprehensive set of baseline models spanning general-purpose open-weight models, frontier proprietary models, and publicly released AI safety guard models, in order to provide a thorough and multi-dimensional reference for performance assessment.

General-purpose open-weight models include the Qwen3 family (Qwen3-8B, Qwen3-32B, and Qwen3-30B-A3B-2507) (Qwen Team, 2025), the Qwen3.5 family (Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-397B-A17B) (Qwen Team, 2026), DeepSeek-R1 (DeepSeek-AI, 2025a), DeepSeek-V3.2 (DeepSeek-AI, 2025b), Kimi-K2.5 (Kimi Team, 2026), MiniMax-M2.5 (MiniMax Team, 2026), and GLM-5 (GLM-5-Team, 2026).

Frontier proprietary models include Qwen3-Max (Qwen Team, 2025), Qwen3.5-Plus (Qwen Team, 2026), Qwen3.6-Plus (Qwen Team and Qwen Team, 2026), and GPT-5.4 (OpenAI, 2026). These models serve as upper-bound reference points for contextualizing the performance of open-weight and domain-specialized systems.

AI safety guard models, where publicly available, are included as additional reference points. This category includes Qwen3Guard-8B (Zhao et al., 2025) and Llama-Guard4-12B (Meta, 2025).

The above list is not exhaustive; additional baselines are included in specific benchmark evaluations where relevant comparisons are available. All baseline models are evaluated under the same prompt format and decoding configuration to ensure fair and consistent comparison across benchmark levels.

Evaluation protocol.

We follow the four-level evaluation framework defined in Section 4. For multiple-choice and general reasoning tasks, we report Accuracy. For content safety classification tasks, we report Macro F1-Score as the primary metric to account for class imbalance. For adversarial robustness evaluation on the dynamic benchmark, we report the combined score (bypass success rate $\times$ semantic fidelity; lower is better). For domain capability and business benchmarks, we report task-specific metrics including risk recognition F1, attribution accuracy, and workflow-level indicators as appropriate. Detailed scoring formulas are provided in Appendix B.2 and Appendix B.3.

Implementation details.

For open-weight models, we use the officially released instruction-tuned checkpoints and apply the corresponding chat templates as recommended by each model’s documentation. For proprietary models, we access the models via their official APIs at the versions available at the time of evaluation. For the dynamic adversarial benchmark, adversarial rewrites are generated using an automated red-teaming pipeline, with semantic fidelity assessed by an independent LLM evaluator; rewrites that are over-obfuscated to the point of losing human readability are penalized to a combined score of zero. Training-stage ablations use the same evaluation sets and metrics as the main experiments to ensure consistency. Detailed prompt templates and decoding settings are provided in Appendix B.1.

5.2 Main Results

5.2.1 Open-source General Benchmark Results

We first evaluate Yuvion LLM on the open-source general benchmark suite to verify that domain-specific training does not materially degrade the model’s general language competence. The evaluation is organized into two groups: general-purpose benchmarks covering broad knowledge, reasoning, Chinese language capability, and scientific problem solving, and agentic benchmarks assessing tool use, function calling, and multi-step interactive problem solving. Together, these two groups span more than 30 public evaluation sets.

General-purpose benchmarks.

The general-purpose benchmark group includes 33 representative evaluations covering knowledge understanding, Chinese language understanding, mathematical reasoning, and commonsense and reading comprehension. We compare Yuvion-32B against both same-scale open-weight models, such as Qwen3-32B and Qwen3.5-27B, and substantially larger frontier systems including Qwen3.5-397B-A17B, Qwen3.6-Plus, GLM-5, and Kimi-K2.5. For proprietary models whose parameter counts are not officially disclosed, we report widely cited community estimates only as rough scale references. Results are presented in Table 2.

Table 2: Comparison among Yuvion-32B and representative baseline models on open-source general-purpose benchmarks. Accuracy is reported.

Category	Benchmark	Yuvion-32B (32B)	Qwen3-32B (32B)	Qwen3.5-27B (27B)	Qwen3.5-397B -A17B (397B)	Qwen3.6-Plus ( $\approx$ 600B)	GLM-5 ( $\approx$ 700B)	Kimi-K2.5 ( $\approx$ 1T)
Knowledge Understanding	MMLU	0.8276	0.8476	0.8912	0.9123	0.9249	0.9034	0.9057
	MMLU-Redux	0.8263	0.8410	0.8713	0.8940	0.8930	0.8883	0.8950
	MMLU-Pro	0.7180	0.7305	0.7880	0.8065	0.8040	0.8225	0.8300
	GPQA	0.5051	0.5354	0.4293	0.4343	0.4091	0.5606	0.6111
	ARC-Challenge	0.9573	0.9608	0.9735	0.9778	0.9744	0.9701	0.9778
	OpenBookQA	0.9420	0.9560	0.9640	0.9780	0.9700	0.9640	0.9680
	TriviaQA	0.6580	0.6820	0.7270	0.8290	0.8505	0.8480	0.8675
	Xiezhi-EN	0.7010	0.7035	0.7140	0.7340	0.7190	0.7235	0.7420
Chinese Language Understanding	C-Eval	0.8216	0.8580	0.8840	0.9115	0.9041	0.8989	0.9301
	CMMLU	0.8120	0.8460	0.8840	0.8980	0.8935	0.8900	0.9250
	C3	0.9540	0.9589	0.9819	0.9814	0.9830	0.9764	0.9825
	CHID	0.8561	0.8876	0.9181	0.9156	0.9316	0.9036	0.9071
	CLUEWSC	0.9047	0.9221	0.9529	0.9529	0.9467	0.9283	0.9488
	OCNLI	0.7369	0.7544	0.8067	0.7897	0.7861	0.8246	0.7948
	CSEM	0.9212	0.9364	0.9568	0.9508	0.9492	0.9424	0.9576
	Xiezhi-CN	0.7900	0.8140	0.8250	0.8160	0.8120	0.8130	0.8125
Mathematical Reasoning	GSM8K-ZH	0.9219	0.9249	0.9439	0.9477	0.9416	0.9386	0.9500
	MATH	0.7750	0.7725	0.6990	0.7385	0.7265	0.7795	0.8390
	APE210K	0.8785	0.8730	0.9105	0.9070	0.9005	0.9010	0.9215
	TAL-SCQ5K-CN	0.8280	0.8010	0.8560	0.8675	0.8345	0.8870	0.9190
	TAL-SCQ5K-EN	0.9135	0.9165	0.9175	0.9255	0.9190	0.9355	0.9315
	TheoremQA	0.4612	0.4375	0.4662	0.4863	0.4788	0.4838	0.5363
Commonsense & Reading Comprehension	BoolQ	0.8853	0.8768	0.8612	0.8862	0.8859	0.8789	0.8676
	CommonsenseQA	0.8468	0.8477	0.8747	0.8935	0.8812	0.8600	0.8608
	HellaSwag	0.8457	0.8953	0.9527	0.9550	0.9602	0.9176	0.9385
	PIQA	0.8939	0.9064	0.9499	0.9505	0.9587	0.9456	0.9461
	SIQA	0.7677	0.7897	0.8158	0.8188	0.8245	0.7912	0.7994
	WinoGrande	0.7758	0.7987	0.8958	0.9242	0.9361	0.8934	0.9116
	DROP	0.9075	0.8960	0.9365	0.9320	0.9365	0.9275	0.9490
	SQuAD 2.0	0.7565	0.7740	0.8275	0.8465	0.8515	0.7950	0.8065
	StoryCloze	0.9887	0.9874	0.9954	0.9954	0.9927	0.9921	0.9927
	BBH	0.7635	0.7555	0.8685	0.8770	0.8890	0.9155	0.9020
	WPLC	0.2175	0.2410	0.2820	0.2775	0.3335	0.2900	0.3490
Average (all benchmarks)		0.7988	0.8099	0.8370	0.8488	0.8485	0.8482	0.8629

Overall, Yuvion-32B achieves an average accuracy of 0.7988 across all 33 benchmarks, compared to 0.8099 for Qwen3-32B, indicating that general capability is well preserved after knowledge-enhanced continued pretraining and policy-grounded multi-task safety post-training. While substantially larger frontier models still achieve higher absolute scores, Yuvion remains broadly comparable to same-scale general-purpose models, which is the more relevant reference point for evaluating capability retention. Yuvion-32B also outperforms Qwen3-32B on several individual benchmarks, including TAL-SCQ5K-CN, TheoremQA, DROP, BoolQ, and StoryCloze, showing that safety specialization does not uniformly reduce general capability. Overall, these results indicate that Yuvion preserves broad general utility at its model scale while specializing for safety-critical deployment.

Agentic benchmarks.

We further evaluate Yuvion-32B on agentic benchmarks to examine the effectiveness of the safety-aware agentic reinforcement learning stage (Section 3.5). As shown in Table 3, the evaluation covers two categories: tool-use benchmarks and search-agent benchmarks.

The tool use benchmark suite includes API-Bank (Li et al., 2023b), which assesses tool selection and execution in multi-turn dialogues over 73 API tools, and BFCL (Patil et al., 2024) (Berkeley Function Calling Leaderboard), which evaluates function-calling capability across dimensions such as AST accuracy, execution accuracy, live API interactions, multi-turn conversations, and relevance detection. For search-augmented reasoning, we evaluate on Seal-0, a benchmark built on the Tongyi DeepResearch framework (Tongyi DeepResearch Team et al., 2025), which measures the model’s effectiveness in orchestrating multi-step search actions for complex information-seeking queries.

As reported in Table 3, Yuvion-32B attains the top score on API-Bank (90.45%) and an average accuracy of 65.72%, matching substantially larger proprietary models such as Qwen3-Max (66.09%). Relative to its base model Qwen3-32B, Yuvion-32B delivers consistent gains of $+2.85$ , $+1.47$ , and $+9.91$ points on API-Bank, BFCL, and Seal-0, respectively, translating into a $+4.75$ -point improvement in average accuracy. The notable lift on Seal-0 demonstrates that the search-agent RL stage substantially enhances multi-turn search planning and evidence synthesis, whereas the gains on API-Bank and BFCL show that the tool-use RL track sharpens fine-grained function-calling capability without being eroded by safety-oriented specialization.

Table 3: Comparison among Yuvion-32B and baseline models on agentic capability benchmarks. Accuracy (%) is reported for all benchmarks. The highest score in each column is in bold font, and the second is underlined.

Model	API-Bank	BFCL	Seal-0	Avg.
Yuvion-32B	90.45	66.16	40.54	65.72
Qwen3-32B	87.60	64.69	30.63	60.97
Qwen3-Max	89.28	68.44	40.54	66.09
Qwen3.5-Plus	84.09	74.41	46.90	68.47
Qwen3.6-Plus	87.44	70.42	54.05	70.64
DeepSeek-V3.2	76.05	55.08	50.45	60.53
Kimi-K2.5	80.40	68.27	43.24	63.97
GLM-5	83.25	64.79	44.14	64.06

5.2.2 Open Content Safety Benchmark Results

This evaluation covers two benchmark groups: content safety benchmarks and guard benchmarks. The content safety benchmarks focus on harmful or policy-violating content recognition, containing 8 evaluation sets spanning both Chinese and English. The guard benchmarks focus on safety judgment, refusal behavior, and safety-aligned response capability, including benchmarks such as SEval and AEGIS and containing more than 20 evaluation sets.

Content safety benchmarks.

Table 4 reports Macro F1-Score across 8 content safety evaluation sets spanning both Chinese and English, covering risks such as pornography, fraud, offensive language, hate speech, and implicit toxicity. We evaluate Yuvion-8B and Yuvion-32B alongside representative general-purpose baselines including GPT-5.4, Qwen3-32B, Qwen3.5-27B, Qwen3-Max, and DeepSeek-R1.

Yuvion-32B achieves the highest average Macro F1 of 78.2%, surpassing all baselines including GPT-5.4 (72.2%) and Qwen3-Max (73.9%). It obtains the best results on 5 out of 8 benchmarks: ChineseHarm (97.9%), HateXplain (63.6%), ToxiGen (86.0%), Jigsaw (76.0%), and CivilComments (65.4%). At the same model scale, Yuvion-32B outperforms Qwen3-32B (69.1%) by 9.1 points, while even Yuvion-8B reaches 73.3%, exceeding several much larger general-purpose baselines. These results show that Yuvion achieves state-of-the-art performance in the safety domain and exhibits a clear cross-scale advantage.

Yuvion-32B also maintains balanced bilingual performance across Chinese and English safety tasks, whereas several general-purpose baselines show stronger cross-lingual imbalance or over-moderation. Overall, the results confirm that Yuvion’s safety specialization yields substantial gains on public content safety benchmarks without sacrificing robustness across languages.

Table 4: Comparison among Yuvion variants and baseline models on open content safety benchmarks. Macro F1-Score (%) is reported. The highest score in each row is in bold font, and the second is underlined.

Benchmark	Yuvion-32B	Yuvion-8B	Qwen3-32B	Qwen3-8B	GPT-5.4	Qwen3-Max	Qwen3.5-27B	DeepSeek-R1
ChineseHarm	97.9	96.8	93.2	90.8	85.3	87.9	81.2	54.4
COLD	72.6	69.6	64.4	61.2	74.8	73.5	71.0	70.8
Moderation	76.6	76.5	67.1	79.3	72.9	77.8	77.1	74.0
HateXplain	63.6	58.5	50.7	51.8	55.6	58.3	63.4	58.6
ToxiGen	86.0	83.3	73.3	73.1	73.9	77.4	75.8	34.5
Jigsaw	76.0	68.5	67.2	68.7	69.2	74.6	72.2	38.0
CivilComments	65.4	51.2	50.3	51.7	56.9	63.2	61.3	56.2
SafetyBench	87.5	81.6	86.2	82.4	89.3	78.4	87.7	85.9
Avg.	78.2	73.3	69.1	69.9	72.2	73.9	73.7	59.0

Guard benchmarks.

To evaluate Yuvion’s safety detection capability, we adopt the guard benchmark protocol of YuFeng-XGuard (Lin et al., 2026), which covers 28 sub-datasets across five dimensions: prompt classification, response classification, multilingual classification, attack defense, and safe completion. We also measure the false positive rate (FPR) on benign instruction-following datasets (Alpaca, Belle) to assess over-refusal. We compare each Yuvion model against its corresponding Qwen3 base model to isolate the effect of safety-oriented training.

Table 5: Comparison between Yuvion models and their corresponding base models on guard benchmarks. Average F1-Score (%) and false positive rate (FPR, %) are reported.

\Delta

denotes the absolute change of Yuvion over the corresponding base model. For FPR, lower is better (

\downarrow

Model	Prompt	Response	Multi-	Attack	Safe	Avg.	FPR
	Cls.	Cls.	lingual	Def.	Compl.		( $\downarrow$ )
Qwen3-8B	13.0	20.5	10.9	35.8	8.1	17.6	0.35
Yuvion-8B	74.4	78.9	71.4	79.4	71.5	75.1	0.29
$\Delta$	+61.4	+58.4	+60.5	+43.6	+63.4	+57.5	-0.06
Qwen3-32B	73.1	59.1	67.2	77.6	76.2	70.6	2.91
Yuvion-32B	77.6	82.3	70.8	84.7	79.1	78.9	0.18
$\Delta$	+4.5	+23.2	+3.6	+7.1	+2.9	+8.3	-2.73

As shown in Table 5, the safety training pipeline brings consistent and substantial gains across all dimensions. Yuvion-8B improves over Qwen3-8B by an average of 57.5 percentage points (from 17.6% to 75.1%), with the largest gains on prompt classification (+61.4%) and safe completion (+63.4%). Yuvion-32B improves over the already stronger Qwen3-32B by 8.3 percentage points on average (from 70.6% to 78.9%), with particularly notable gains on response classification (+23.2%). Crucially, the improved detection does not come at the cost of over-refusal: Yuvion-32B achieves an average FPR of only 0.18%, substantially lower than Qwen3-32B (2.91%), and Yuvion-8B (0.29%) also outperforms Qwen3-8B (0.35%). These results demonstrate that the safety training pipeline effectively equips the model with robust guard capability while maintaining low false positive rates.

Comparison with dedicated guard models.

To further contextualize Yuvion’s guard capability, we conduct a pairwise win rate analysis against a comprehensive set of dedicated guard models across 51 sub-datasets. (derived from 28 sub-datasets, where some datasets are evaluated under both query-only and query+response settings). As shown in Figure 5, Yuvion-32B achieves a win rate of 71.7%, ranking second overall. Yuvion-8B attains 54.1%, surpassing several guard models (WildGuard-7B, NemoGuard-V2, LlamaGuard3/4, ShieldGemma-9B) but falling below others such as Qwen3Guard-8B (58.2%) and GPT-OSS-Guard-20B (64.0%). This positioning reflects a deliberate design choice: unlike dedicated guard models that concentrate their entire training budget on safety classification, Yuvion distributes its capacity across domain knowledge acquisition, adversarial robustness, agentic capabilities, and business-oriented workflows—guard-style judgment is only one of many training objectives.

Notably, when the same training paradigm is applied with the data pipeline concentrated on AI safety and guard-related tasks—incorporating a dynamic policy mechanism that enables runtime policy adjustment without retraining—the resulting model, Yuvion-Guard-8B (Lin et al., 2026), achieves the highest win rate of 85.6%, surpassing all dedicated guard models by a substantial margin. This confirms that the gap between Yuvion-8B and top guard models is attributable to training data allocation rather than a methodological limitation. Yuvion LLM prioritizes comprehensive content safety competence at a modest cost to guard-specific classification, while the same methodology readily produces a state-of-the-art dedicated guard model when narrower specialization is desired. The weights of Yuvion-Guard-8B have been publicly released to support community research and deployment.⁰⁰0https://huggingface.co/Alibaba-AAIG/YuFeng-XGuard-Reason-8B

5.2.3 Self-Constructed Adversarial Robustness Benchmark Results

We evaluate Yuvion LLM on the self-constructed adversarial robustness benchmark described in Section 4.4, covering five major risk categories across both static and dynamic evaluation sets.

Static evaluation sets.

Table 6 reports classification performance (Macro F1-Score, %) on the static evaluation sets, which assess standard domain recognition capability under canonical risk expression distributions across five major risk categories.

Table 6: Comparison among Yuvion variants and baseline models on the static content safety evaluation sets. Macro F1-Score (%) is reported. Avg.^∗ denotes the average excluding the Spam & Flooding category. The highest score in each column is in bold font, and the second is underlined.

Model	Adv. & Traffic	Gambling & Fraud	Abusive	Pornographic	Spam & Flooding	Avg.	Avg.^∗
Qwen3-8B	81.3	83.0	79.6	80.4	60.9	77.0	81.1
Qwen3-32B	84.0	84.9	83.8	83.6	75.7	82.4	84.1
Qwen3-Max	88.7	88.3	87.9	88.7	72.8	85.3	88.4
Qwen3.5-27B	88.0	90.0	87.5	87.2	56.8	81.9	88.2
Qwen3.5-122B-A10B	91.1	93.6	90.6	88.8	51.5	83.1	91.0
DeepSeek-R1	89.0	89.5	88.6	86.9	74.1	85.6	88.5
GPT 5.4	88.9	90.5	90.8	87.7	68.0	85.2	89.5
Yuvion-8B	94.3	95.3	93.8	88.9	41.9	82.8	93.1
Yuvion-32B	93.8	95.7	95.3	92.3	57.2	86.8	94.2

Yuvion-32B achieves the highest overall average of 86.8% and an Avg.^∗ of 94.2%, outperforming all compared baselines overall and leading all baselines on three of the five categories. Yuvion-8B achieves the best score on Advertising & Traffic (94.3%) and reaches an Avg.^∗ of 93.1%, surpassing most baselines, including substantially larger models such as Qwen3.5-122B-A10B (91.0%), Qwen3-Max (88.4%), DeepSeek-R1 (88.5%), and GPT 5.4 (89.5%). These results indicate strong cross-scale competitiveness of Yuvion models on core content safety recognition tasks. Performance on Spam & Flooding is notably unstable for all models, largely because certain e-commerce platform interaction content shares overlapping surface features with Spam & Flooding instances, creating inherent boundary ambiguity in both annotation and evaluation; excluding it, Yuvion’s advantage becomes more pronounced.

Dynamic evaluation sets.

Table 7 reports the adversarial robustness evaluation on the dynamic sets. The combined score is defined as the product of bypass success rate and semantic fidelity score; a lower combined score indicates stronger robustness against adversarial attacks. Due to the higher computational cost of the automated red-teaming pipeline, the dynamic evaluation is conducted on a representative subset of baseline models.

Table 7: Comparison among Yuvion variants and baseline models on the dynamic content safety benchmark. Combined score (%) is reported; lower is better (

\downarrow

). The highest score in each column is in bold font, and the second is underlined.

Model	Adv. & Traffic	Pornographic	Abusive	Spam & Flooding	Gambling & Fraud	Overall
Qwen3-8B	16.5	42.7	25.6	32.6	54.8	34.4
Qwen3-32B	15.5	35.3	17.5	29.3	40.9	27.7
Qwen3-Max	9.9	31.4	17.1	28.3	35.2	24.4
GPT-5.4	7.0	29.6	10.7	24.5	38.5	22.3
Yuvion-8B	13.9	33.8	21.3	23.5	48.3	28.1
Yuvion-32B	6.7	16.8	16.7	29.1	33.7	20.6

Yuvion-32B achieves the lowest overall combined score of 20.6%, outperforming all compared models, followed by GPT-5.4 (22.3%), Qwen3-Max (24.4%), Qwen3-32B (27.7%), Yuvion-8B (28.1%), and Qwen3-8B (34.4%). It delivers the best performance on 3 out of 5 individual categories—advertising & traffic diversion, pornographic, and gambling & fraud—with particularly strong robustness on advertising & traffic (6.7%). GPT-5.4 also shows strong adversarial robustness, ranking second overall and achieving the best result on abusive content (10.7%). By contrast, gambling & fraud remains the most difficult category across all models, indicating that adversarial rewrites in this domain more easily exploit semantic gray areas. Importantly, Yuvion-8B achieves the best score on spam & flooding (23.5%) and an overall score (28.1%) comparable to much larger baselines such as Qwen3-32B (27.7%) and Qwen3-Max (24.4%), demonstrating that safety-specialized training enables compact models to rival general-purpose models at a fraction of the parameter cost. Overall, these results show that Yuvion’s advantage extends beyond standard classification to realistic adversarial robustness.

5.2.4 In-house Capability and Business Benchmark Results

We evaluate Yuvion on the in-house capability and business benchmarks described in Section 4.5, covering more than 15 evaluation sets across domain capability benchmarks and business-oriented benchmarks. Results are reported for Yuvion-8B, Yuvion-32B, and Yuvion-32B (Agent), together with a broad set of open-weight, proprietary, and guard-model baselines. This benchmark suite is designed to evaluate not only generic safety judgment, but also domain-specific policy understanding, fine-grained content risk recognition, and practical deployment utility in realistic moderation workflows.

Tables 8 and 9 report the full in-house evaluation results on domain capability and business-oriented benchmarks, respectively. Overall, Yuvion shows consistently strong performance across both benchmark groups. In particular, Yuvion-32B achieves 85.78 on the in-house domain benchmark composite and 86.34 on the business benchmark composite, while Yuvion-8B reaches 82.38 and 82.72, respectively. Yuvion-32B (Agent) further improves these results to 86.10 and 87.34, showing that agentic RL yields modest but consistent gains on in-house workflows in addition to its larger improvements on dedicated agentic benchmarks. For a unified summary across these two benchmark groups, we additionally report an overall composite score in Appendix C.

Domain capability results.

On the in-house domain capability benchmarks, Yuvion-32B achieves an overall score of 85.78, outperforming the strongest proprietary baselines such as Qwen3-Max (81.41), Qwen3.6-Plus (81.83), and GPT-5.4 (80.73), as well as strong open-weight baselines including DeepSeek-R1 (80.54) and GLM-5 (79.84). Yuvion-32B shows particularly strong results on Political Risk (95.00), Knowledge MCQ (80.50), Redline Text (78.65), Prohibited Content (92.76), Low-Info Text (77.93), and Porn Text (94.56). Yuvion-8B also performs strongly, reaching 82.38 and surpassing all open-weight and proprietary baselines in this benchmark group. Yuvion-32B (Agent) further improves the overall score to 86.10, indicating that agentic RL does not harm core safety capability and can bring additional gains in structured domain tasks.

Business benchmark results.

On the in-house business benchmarks, Yuvion-32B achieves the highest overall score of 86.34, outperforming all evaluated open-weight and proprietary baselines. In particular, it obtains 97.21 on UGC Moderation, 83.28 on AIGC Moderation, 74.32 on Business Porn Detection, 85.83 on Multi-Scenario Risk Detection, and 91.04 on Data Security NER. Yuvion-8B also performs competitively, achieving 82.72 and exceeding all open-weight baselines. Yuvion-32B (Agent) further improves the business overall score to 87.34, with gains on AIGC Moderation, Multi-Scenario Risk Detection, and Data Security NER, suggesting that agentic RL is particularly helpful for structured and workflow-oriented business tasks.

Table 8: Comparison among Yuvion variants and representative open-source, proprietary, and guard baselines on in-house domain capability benchmarks. Overall denotes the composite score. The highest score in each column is in bold font, and the second is underlined.

Category	Model	Overall	Political Risk	Political Entity	Knowledge MCQ	Redline Text	Domain Instr. Follow.	Political NER	Prohibited Content	Insult	Low-Info Text	Porn Text	Emotion Analysis
Open-source	Qwen3-8B	73.71	81.26	89.09	69.46	57.51	68.35	65.29	80.88	90.65	61.82	83.10	63.42
	Qwen3-32B	79.79	88.40	91.41	73.53	66.27	74.76	79.77	86.02	94.22	66.33	90.77	66.24
	Qwen3-30B-A3B-2507	77.14	87.31	92.16	71.48	68.50	71.08	77.60	80.72	87.87	60.86	88.91	62.01
	Qwen3.5-9B	72.68	78.48	84.55	68.48	64.55	69.38	48.06	82.40	87.73	65.13	84.20	66.49
	Qwen3.5-27B	77.17	86.55	90.66	72.57	68.73	77.54	58.92	84.06	92.74	68.20	83.70	65.20
	Qwen3.5-35B-A3B	75.46	83.41	90.36	71.02	64.32	75.85	55.48	83.06	90.81	67.21	81.53	67.03
	Qwen3.5-122B-A10B	77.30	84.65	91.11	73.79	67.96	76.33	59.54	84.14	94.50	67.85	85.53	64.93
	Qwen3.5-397B-A17B	77.45	88.00	91.94	77.39	62.11	81.35	51.77	81.67	93.99	72.47	85.21	66.09
	DeepSeek-R1	80.54	88.00	91.45	77.06	67.94	82.81	75.02	88.86	94.79	71.98	88.08	60.00
	DeepSeek-V3.2	76.27	85.20	88.95	68.75	49.80	71.62	78.04	82.89	94.38	70.73	83.26	65.37
	KIMI-K2.5	74.61	87.52	91.87	58.48	38.79	82.71	76.06	79.61	91.26	71.91	74.72	67.82
	Minimax-M2.5	72.60	88.10	91.35	55.34	61.10	74.40	79.26	40.80	91.18	71.03	83.26	62.76
	GLM-5	79.84	84.57	90.37	77.20	62.36	82.56	82.42	87.36	92.47	70.47	88.61	59.90
Proprietary	Qwen3-Max	81.41	89.30	90.42	76.74	68.77	83.14	80.78	87.74	93.31	69.05	89.78	66.49
	Qwen3.5-Plus	77.92	89.75	92.38	78.91	61.02	82.79	53.49	79.80	94.19	73.78	85.19	65.79
	Qwen3.6-Plus	81.83	87.13	92.34	78.11	73.13	83.12	71.18	86.23	94.74	72.26	92.46	69.44
	GPT-5.4	80.73	87.90	93.56	79.77	64.22	75.61	82.69	85.16	94.77	76.80	77.53	70.00
Guard	Qwen3Guard-8B	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Guard	Llama-Guard4-12B	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Yuvion (Ours)	Yuvion-8B	82.38	93.20	87.31	73.51	77.54	73.53	82.05	91.54	91.48	71.30	96.73	67.96
	Yuvion-32B	85.78	95.00	90.56	80.50	78.65	80.97	82.39	92.76	94.55	77.93	94.56	75.66
	Yuvion-32B (Agent)	86.10	95.35	91.21	81.16	78.95	83.66	83.52	92.49	94.53	77.78	94.78	73.67

Table 9: Comparison among Yuvion variants and representative open-weight, proprietary, and guard baselines on in-house business benchmarks. Overall denotes the business composite score. The highest score in each row is in bold font, and the second is underlined.

Category	Model	Overall	UGC Moderation	AIGC Moderation	Business Porn Detection	Multi-Scenario Risk Detection	Data Security NER
Open-source	Qwen3-8B	67.57	26.38	66.03	78.50	81.96	84.98
	Qwen3-32B	68.10	32.25	56.11	74.83	82.87	94.43
	Qwen3-30B-A3B-2507	56.14	9.27	36.34	72.74	68.66	93.67
	Qwen3.5-9B	69.89	55.44	66.13	63.82	80.80	83.28
	Qwen3.5-27B	73.92	66.83	71.13	63.39	79.62	88.65
	Qwen3.5-35B-A3B	72.75	69.80	72.17	74.27	81.60	65.89
	Qwen3.5-122B-A10B	76.61	65.72	74.19	76.45	80.30	86.41
	Qwen3.5-397B-A17B	72.62	72.65	72.50	42.51	83.44	91.98
	DeepSeek-R1	69.85	54.65	62.72	81.30	80.94	69.62
	DeepSeek-V3.2	69.88	55.43	62.37	75.01	75.00	81.60
	KIMI-K2.5	79.84	84.91	77.83	69.92	84.98	81.56
	Minimax-M2.5	56.32	60.84	66.16	47.45	50.27	56.86
	GLM-5	73.96	47.28	64.35	77.32	84.42	96.42
Proprietary	Qwen3-Max	83.00	83.09	77.38	78.75	78.64	97.12
	Qwen3.5-Plus	73.38	79.17	75.35	31.01	83.89	97.47
	Qwen3.6-Plus	78.44	73.77	72.04	78.54	82.31	85.56
	GPT-5.4	80.40	73.85	70.86	80.51	78.95	97.83
Guard	Qwen3Guard-8B	0.01	0.03	0.03	0.00	0.00	0.00
Guard	Llama-Guard4-12B	0.00	0.00	0.01	0.00	0.00	0.00
Yuvion (Ours)	Yuvion-8B	82.72	95.94	82.43	69.39	83.28	82.57
	Yuvion-32B	86.34	97.21	83.28	74.32	85.83	91.04
	Yuvion-32B (Agent)	87.34	96.34	83.44	73.51	85.98	97.45

Comparison with open-weight and proprietary baselines.

A notable result is that Yuvion establishes a clear cross-scale advantage over both open-weight and proprietary baselines on in-house safety evaluations. Yuvion-8B already surpasses all evaluated general-purpose baselines, including frontier proprietary models, on the in-house domain benchmarks and remains highly competitive on business benchmarks, while Yuvion-32B and Yuvion-32B (Agent) achieve the strongest overall results. This is particularly striking given model scale: Yuvion-32B substantially outperforms much larger open-weight baselines such as Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, and even Yuvion-8B exceeds these larger models as well as several proprietary systems on the in-house domain benchmarks. These results indicate that for realistic content safety deployment, targeted safety-oriented training can outweigh raw model scale and even close the gap to, or surpass, stronger proprietary general-purpose models.

Comparison with guard models.

The near-zero scores of Qwen3Guard-8B and Llama-Guard4-12B on both domain capability and business benchmarks reveal a fundamental limitation of guard-style models in realistic deployment settings. Although such models are designed for generic safety judgment, they lack the domain-specific knowledge, fine-grained policy understanding, and structured output capability required by practical moderation and business workflows. This large performance gap highlights the importance of a dedicated training paradigm such as Yuvion, which targets deployable content safety competence rather than only generic safety filtering behavior.

5.3 Ablation Study

5.3.1 Ablation on Domain Knowledge Data

To investigate the contribution of domain-specific knowledge data in the continued pretraining stage, we conduct an ablation study comparing two intermediate versions of Yuvion LLM: one trained without domain knowledge data and one with a partial set of knowledge descriptions incorporated. Both models share identical configurations across all other training stages; the only difference lies in whether a curated subset of domain knowledge corpora—covering content safety knowledge and policies, regulatory guidelines, and domain-specific annotations—are included during continued pretraining. Note that both models represent intermediate checkpoints in the iterative development pipeline rather than the final released version of Yuvion LLM; specifically, both checkpoints share the same model parameter scale as the final release, but are trained on a reduced SFT dataset as part of an earlier experimental iteration. Table 10 reports results across the in-house capability benchmark suite.

Table 10: Ablation study on domain knowledge data. Results are reported for intermediate Yuvion checkpoints during iterative development. Domain Composite denotes the weighted average score over all domain capability benchmarks.

Model	Domain Composite	Domain Capability Benchmarks
Model	Domain Composite	Pol. Risk	Pol. Entity	Know. Text	Red- line	UCMF EN	Porn Txt v1	NER Politics	Prohib. Cls	Insult	Meaning- less	Biz Porn	Emotion
w/o knowledge data	79.68	94.46	91.04	78.80	72.35	79.26	60.65	82.96	92.30	70.64	66.68	96.37	70.60
w/ knowledge data	83.64	95.40	90.67	80.60	78.29	78.89	62.13	82.21	92.79	94.84	78.12	94.31	75.38
$\Delta$	+3.96	+0.94	$-$ 0.37	+1.80	+5.94	$-$ 0.37	+1.48	$-$ 0.75	+0.49	+24.20	+11.44	$-$ 2.06	+4.78

The results demonstrate that incorporating even a partial set of domain knowledge descriptions during continued pretraining yields consistent and meaningful improvements across domain capability benchmarks. Overall, adding knowledge data produces a substantial gain of +3.96% in the domain composite score, with particularly pronounced improvements on the Insult benchmark (+24.20%) and the Meaningless benchmark (+11.44%)—two categories that require nuanced semantic understanding of domain-specific expressions that are inherently difficult to acquire from general corpora alone. Notable gains are also observed on Red-line (+5.94%), Knowledge Text (+1.80%), and Emotion (+4.78%), further confirming that even a partial injection of structured domain knowledge can effectively anchor the model’s understanding of fine-grained, domain-sensitive concepts.

A small number of benchmarks show marginal regressions (e.g., Politics Entity, UCMF EN, Biz Porn), which we attribute to distribution shift introduced by the knowledge corpora. Importantly, these regressions remain limited in scale and are largely mitigated in subsequent training stages. Overall, these findings confirm that incorporating structured domain knowledge descriptions—even partially—is a valuable component of the continued pretraining stage, providing semantic grounding that meaningfully improves the model’s domain capability and motivating the inclusion of a more comprehensive knowledge corpus in the full training pipeline.

5.3.2 Ablation on Reinforcement Learning Design

To isolate the contribution of each training stage to the model’s agentic capabilities, we conduct a progressive ablation comparing three checkpoints: (1) the model after safety-oriented SFT only, (2) after adding tool-use RL, and (3) after further adding search-agent RL (the full Yuvion-32B Agent). All checkpoints share identical configurations for continued pretraining and SFT; the only difference is the scope of agentic RL training applied. Evaluation is conducted on the agentic benchmarks (API-Bank, BFCL, Seal-0), as shown in Table 11.

Table 11: Progressive ablation on agentic RL training stages. Accuracy (%) on agentic benchmarks.

Training Stage	API-Bank	BFCL	Seal-0
SFT only	83.75	45.07	19.82
+ Tool Use RL	88.78	54.64	31.53
+ Tool Use RL + Search Agent RL	90.45	66.16	40.54

The SFT-only model exhibits degraded agentic performance compared with the base Qwen3-32B, reflecting a partial loss of general-purpose tool-use and search capabilities after intensive safety-oriented specialization. Adding tool-use RL yields substantial improvements on API-Bank and BFCL, which directly evaluate function calling and tool selection. Further adding search-agent RL produces a large gain on Seal-0 while preserving the tool-use gains on API-Bank and BFCL. Together, these results show that agentic RL primarily restores and extends dedicated agentic capability, while also contributing modest but consistent improvements on realistic in-house workflows, as seen in the in-house benchmark results above.

5.4 Case Study

5.4.1 Adversarial Evasion Robustness

To qualitatively validate our quantitative findings, we present representative cases spanning two violation categories—Drug Trafficking and Gambling & Fraud —each examined across three progressively adversarial variants: an explicit original, a lexically camouflaged variant, and a fully disguised adversarial variant. As illustrated in Figure 6, general-purpose models perform adequately on explicit violations but exhibit systematic degradation as evasion sophistication increases. In the Drug Trafficking category, replacing the drug term with the slang expression “liu bing” is detectable by both models; however, encoding the same intent via a numeral–emoji substitution of the colloquial expression within a venue recommendation query is sufficient to mislead the general-purpose model into treating the inquiry as a routine leisure activity consultation. In the Gambling & Fraud category, both the explicit inquiry and the euphemistic variant (“poker entertainment auxiliary device”) are successfully blocked; however, once the product-specific keyword “poker” is removed and the request is reframed as a neutral inquiry about “entertainment auxiliary devices” available through “relevant channels,” the general-purpose model returns a Pass decision, citing insufficient evidence of violation.

Yuvion LLM maintains consistent and policy-grounded detection across all six cases. In the Drug Trafficking category, it reconstructs the numeral–emoji obfuscation as a colloquial reference to illicit substance use and correctly categorizes the content under Drug Trafficking regardless of the recreational framing. In the Gambling & Fraud category, it identifies the covert combination of category-neutral device referral and implicit channel solicitation as indicative of illegal goods promotion under the Gambling Devices policy category, even when no product-specific keyword is present. These qualitative observations confirm that the adversarially-aware training paradigm of Yuvion LLM—spanning domain-adaptive continued pretraining, risk-aware SFT, and RL-based policy optimization—equips the model with the semantic generalization necessary to detect violations that evade general-purpose models, while consistently grounding its decisions in explicit domain policy categories.

5.4.2 Agentic Task Execution

To qualitatively illustrate the behavioral impact of agentic reinforcement learning, we present representative cases comparing model outputs before and after agentic RL training. We examine two complementary scenarios—open-domain search and domain-specific business tool calling—to demonstrate that agentic RL induces systematic improvements in both general retrieval strategies and specialized workflow execution.

Case 1: Search avoidance $\rightarrow$ active decomposition.

Given the query “Which Vietnamese professor explained the mysterious case of two monks said to have been meditating for over 400 years?,” the pre-RL model returns a refusal: “could not be identified through the available search results. Further direct access … would be required.” It fails to decompose the query into actionable sub-searches and terminates prematurely. After agentic RL, the model instead decomposes the problem into sequential sub-queries—first identifying the cultural phenomenon (embalmed monks at Dau Pagoda), then narrowing to the specific scholar—ultimately retrieving the correct answer (Professor Nguyen Lan Cuong) through iterative evidence accumulation.

Case 2: Tool bypass $\rightarrow$ systematic tool-assisted verification.

In an intellectual property infringement review workflow, the model receives a trademark complaint case containing the product’s textual information (title, attributes, promotional text), multiple categories of product images (main images, detail images, SKU images), and the complainant’s registered trademark. The audit rules follow a cascading structure: Rule 1 checks whether the trademark appears verbatim in the product’s text fields; if not, Rule 2 checks whether it appears in the product images; subsequent rules handle special cases such as mixed Chinese–English trademarks and case-insensitive matching. The system provides check_image_tool (an image-based trademark detection tool) and finish (for submitting the final judgment), among other tools. The pre-RL model bypasses the tool invocation step entirely, directly calling finish with a judgment (“pass—the trademark was found in the product’s text information”), attempting to reason about trademark matching using only its own interpretation of the text fields and short-circuiting the image verification required by the cascading rules. The post-RL model correctly invokes check_image_tool with the relevant image categories (product main images and detail images) to perform trademark detection before making any judgment, following the prescribed evidence-first workflow.

Discussion.

These two cases illustrate complementary dimensions of behavioral change induced by agentic RL. In open-domain search (Case 1), the pre-RL model defaults to refusal when parametric recall is insufficient, whereas the post-RL model treats search as the primary reasoning mechanism—actively decomposing complex questions and grounding answers in retrieved evidence. In domain-specific business workflows (Case 2), the pre-RL model attempts to short-circuit rule-based processes by producing direct judgments without tool-assisted verification, whereas the post-RL model adheres to the prescribed workflow by invoking the appropriate tools to gather evidence before reaching a conclusion. Together, these cases demonstrate that agentic RL not only improves search and retrieval strategies but also instills disciplined tool-use behavior in structured business processes—a critical capability for real-world content safety deployment.

5.5 Summary of Evaluation Results

Taken together, the multi-level Yuvion LLM RiskEval provides a progressive and deployment-oriented assessment of Yuvion LLM, covering general capability retention, public safety comparability, domain-specific robustness, and real-world deployment value. Results across all levels consistently validate the effectiveness of Yuvion’s staged training paradigm.

A clear pattern emerges from the full evaluation. On open-source general benchmarks, Yuvion preserves broad general capability and remains broadly comparable to same-scale general-purpose open-weight models, showing that safety specialization does not materially undermine general utility. On safety-oriented evaluations, however, Yuvion achieves state-of-the-art performance with a clear cross-scale advantage. On open content safety benchmarks, Yuvion-32B reaches an average Macro F1 of 78.2% across 8 evaluation sets, surpassing the strongest baselines including Qwen3-Max (73.9%). On the self-constructed benchmark, Yuvion-32B achieves an Avg.^∗ of 94.2% on static evaluation sets and the lowest adversarial combined score of 20.6% on dynamic evaluation sets. On the in-house benchmarks, Yuvion-32B achieves 85.78 on the in-house domain benchmark composite and 86.34 on the business benchmark composite, while Yuvion-32B (Agent) further improves these to 86.10 and 87.34.

Two findings stand out. First, targeted safety-domain training consistently outperforms raw model scale: both Yuvion-8B and Yuvion-32B surpass substantially larger general-purpose models across public, adversarial, and in-house safety evaluations. Second, existing guard models are not sufficient for real-world content safety deployment: the near-zero scores of Qwen3Guard-8B and Llama-Guard4-12B on the in-house benchmarks reveal a large gap between generic guard capability and the fine-grained domain competence required in operational safety workflows. Overall, Yuvion demonstrates that it is possible to preserve broad general capability while achieving state-of-the-art performance in the safety domain.

6 Closed-Loop Iteration for Content Safety Model Evolution

Content safety and AI safety are an ongoing adversarial game: violating content producers continuously adapt their evasion strategies, creating an arms race in which any fixed model will inevitably fall behind. Yuvion is therefore embedded within a closed-loop iteration framework (Figure 7) that integrates four mechanisms—knowledge injection, adversarial game-playing, agentic capability reinforcement, and deployment-driven feedback—into a unified cycle operating across the full model lifecycle.

Knowledge Injection.

A continuous pipeline feeds updated policy documents, regulatory guidelines, emerging risk case analyses, and expert annotations into the model through incremental continued pretraining. Domain experts identify knowledge gaps via systematic capability audits, construct targeted corpora, and validate that injected knowledge is absorbed without degrading existing competence.

Adversarial Game-Playing.

Rather than passively waiting for new evasion patterns to appear in production, the system proactively generates adversarial variants through three channels: (1) red-team self-play, where an attacker model generates evasion transformations against the current Yuvion checkpoint; (2) automated adversarial search over lexical substitution, semantic paraphrasing, code-switching, and contextual reframing; and (3) adversarial patterns extracted from real business traffic. Successful evasion examples are incorporated into the GRPO-based RL training loop, creating a virtuous cycle in which each iteration produces a model strictly harder to evade.

Agentic Capability Reinforcement.

As Yuvion is deployed in workflows requiring tool invocation and multi-step planning, the framework continuously refines agentic skills by constructing realistic environment simulations, collecting feedback from production tool interactions to address failure modes in tool selection and parameter formulation, and progressively extending task complexity from single-tool invocations to multi-step investigation workflows.

Deployment-Driven Feedback.

Production deployment generates three feedback streams: human operator corrections, downstream QA flags (false positives/negatives), and adversarial drift signals indicating shifts in violation distributions. Feedback cases undergo failure analysis and are routed to the appropriate upstream mechanism—knowledge gaps trigger new injection, novel evasion patterns trigger adversarial training, and agentic failures trigger capability reinforcement. Updated models are validated on Yuvion LLM RiskEval and the specific failure patterns before re-deployment.

The four mechanisms correspond to the fundamental challenges of content safety co-evolution: the knowledge landscape changes, evasion strategies evolve, operational complexity grows, and production conditions reveal blind spots invisible during offline training. Each iteration cycle enriches the training corpus with real-world failure cases and expert-curated knowledge, enabling Yuvion to transition from reactive adaptation toward proactive defense.

7 Ethical Considerations

This work involves the construction and modeling of content safety data that by nature includes harmful and policy-violating material spanning categories such as pornography, violence, extremism, politically sensitive content, and fraud. All sensitive data are collected from real-world moderation workflows under strict access control, stored in isolated environments, and not released publicly. The adversarial variants constructed for dynamic benchmark evaluation are used exclusively for robustness assessment under controlled conditions and do not constitute a practical evasion toolkit.

8 Related Work

LLM training.

Modern large language models are typically developed through a two-phase paradigm: large-scale unsupervised pretraining on broad text corpora (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023; Dubey et al., 2024; Qwen Team, 2025), followed by post-training that aligns model behavior with human intent through supervised fine-tuning and reinforcement learning from human feedback (Ouyang et al., 2022; Bai et al., 2022; Rafailov et al., 2023; He et al., 2025a; b; Xu et al., 2026b). This paradigm has been extended to domain-specific settings, where continued pretraining on domain corpora adapts general-purpose representations to specialized knowledge (Gururangan et al., 2020; Raffel et al., 2020; Xu et al., 2026a), and domain-targeted reinforcement learning further sharpens task-specific reasoning (Shao et al., 2024). Yuvion follows this staged design but differs in two key respects: first, its continued pretraining and post-training are optimized around adversarial content safety rather than general helpfulness; second, it introduces a dedicated agentic RL stage that extends the model beyond single-turn classification to multi-step, tool-augmented safety workflows—a capability largely absent from prior domain-adaptation pipelines.

LLM safety.

As large language models (LLMs) are increasingly deployed in real-world systems, safety has become a core requirement alongside capability and reliability. Existing efforts have improved LLM safety through alignment training, refusal shaping, constitutional or policy-based supervision, and dedicated guard models for harmful content detection and policy enforcement (Ouyang et al., 2022; Bai et al., 2022; Ganguli et al., 2022; Inan et al., 2023; Meta, 2025). These approaches have substantially improved performance on public benchmarks and practical moderation tasks, but are often optimized for clean and explicit inputs rather than robustness under adaptive human attack. Our work is motivated by this gap and treats safety as an inherently adversarial problem.

Adversarial robustness.

A growing body of work shows that LLM safety failures often arise through adversarial interaction rather than only naturally occurring inputs. Jailbreak attacks, prompt injection, role-playing, suffix attacks, and multi-turn manipulation can induce unsafe behavior even in strongly aligned models (Perez and Ribeiro, 2022; Perez et al., 2022; Wei et al., 2023; Zou et al., 2023; Wallace et al., 2024; Mazeika and others, 2024). In content safety settings, harmful intent can also be hidden through obfuscation, euphemism, coded language, or contextual disguise, creating a mismatch between human readability and model detectability. These findings suggest that safety cannot be understood purely as static classification on naturally distributed data, and motivate explicitly incorporating adversarial considerations into both training and evaluation.

Safety evaluation.

A number of benchmarks have been proposed to assess toxicity, harmfulness, jailbreak robustness, and guard-model performance, including RealToxicityPrompts, HarmBench, and LlamaGuard-style safety benchmarks (Gehman et al., 2020; Perez et al., 2022; Inan et al., 2023; Mazeika and others, 2024; Meta, 2025). These resources have played an important role in standardizing safety measurement. At the same time, prior work has noted that benchmark performance may substantially overestimate deployment robustness, especially when models face transformed, indirect, or interaction-based attacks (Wei et al., 2023; Zou et al., 2023; Mazeika and others, 2024). This motivates evaluation frameworks that go beyond clean public benchmarks and include adversarially transformed and deployment-oriented settings.

Agentic LLMs.

Recent work has extended LLMs from single-turn text generation to agentic settings involving planning, tools and skills (Liang et al., 2026), retrieval, and multi-step interaction (Schick et al., 2023; Yao et al., 2023; Paranjape et al., 2023; Wang et al., 2024a). These capabilities are increasingly relevant to realistic safety workflows, where models may need to retrieve policies, invoke tools, analyze evidence, or coordinate decisions across multiple steps. Reinforcement learning has also been explored for improving tool use and trajectory-level decision making in such environments (Zeng et al., 2023). Our work is related to this line of research, but differs in focusing specifically on safety-oriented agentic capability: we seek not only general task completion, but robust and policy-consistent behavior in adversarial and deployment-oriented safety scenarios.

9 Conclusion

We present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Motivated by the fundamental mismatch between general-purpose LLM training assumptions and the adversarial, policy-grounded demands of real-world content safety, we develop a progressive safety training paradigm spanning adversarially aware data construction, knowledge-enhanced continued pretraining, policy-grounded multi-task safety post-training, and safety-aware agentic reinforcement learning. We further introduce Yuvion LLM RiskEval (YLRE), a multi-level evaluation framework that systematically measures performance from general capability retention to adversarial robustness and real-world deployment value.

Experimental results show that Yuvion LLM significantly outperforms both general-purpose models and existing guard models under adversarial conditions, while maintaining strong general language competence. More broadly, this work suggests that content safety should be treated as a specialized foundation-model domain with its own training paradigm, evaluation methodology, and deployment constraints. The adversarial gap exposed here is not merely a minor engineering issue, but a more fundamental limitation rooted in how general-purpose models are trained and assessed. We hope Yuvion LLM and the accompanying framework serve as a useful reference for future research and practice in risk governance.

10 Limitations and Future Work

Despite promising results, Yuvion LLM has several limitations that warrant acknowledgment. First, while our adversarial training significantly improves robustness against known evasion strategies, the arms race between moderation systems and violating content producers is continuous—novel evasion patterns that fall outside the distribution of our training data may still pose challenges. Second, the current evaluation is conducted primarily in Chinese- and English-language safety scenarios; generalization to broader multilingual and cross-cultural moderation contexts remains to be systematically assessed. Third, although Yuvion LLM retains competitive general language competence, a modest performance gap relative to the base model persists on certain general benchmarks, suggesting room for further improvement in balancing domain specialization and general capability retention.

Future work will focus on three directions: (1) building a continuous red-teaming and closed-loop data refresh pipeline to keep pace with evolving evasion strategies in production environments; (2) extending the training and evaluation framework to multilingual safety scenarios; and (3) further strengthening the model’s agentic capabilities to support more complex, multi-step safety governance workflows, including policy retrieval, evidence attribution, and cross-system tool invocation.

Authors

Core Contributors:

Ting Ma^∗ , Xiufeng Huang^∗ , Benlei Cui^∗ , Xiaowen Xu^∗ , Shikai Qiu^∗ , Ruijie Jian^∗ , Hongxing Li^∗ , Guanghui Wang^∗ , Longtao Huang^∗ , Haiwen Hong^†∗¹¹1Correspondence to: honghaiwen.hhw@alibaba-inc.com.

(^∗ denotes equal contribution, and ^† denotes the corresponding author and project lead.)

Contributors:

Haolei Xu, Wenjing Jiang, Ziwen Xu, Zhaoyu Fan, Shaoxuan He, Chuxi Xiao, Yujian Li, Xinyue Chen, Chunyang Chai, Wenxuan Liu, Ziheng Wang, Dongjie Zhang, Yangfan Zhou, Libin Dong, Yupeng Cao, Xiaoqian Xia, Jing Wang, Zhe Jiang, Zhenan Ye, Guang Yang, Bin Liu, Wei Peng, Ziqiang Zhu, Meihui Lian, Kaiwen Lv Kacuila, Haidong Ding, Bingyu Zhu, Yan Wang, Hai Zhao, Xuan Jin, Wei Zhao, Pengfei Sun, Wei Wang, Huiming Zhang, Bin Li, Hui Xue

References

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. Aleman, D. Almeida, J. Altenschmidt, et al. (2023) GPT-4 technical report. Technical report OpenAI. Note: arXiv:2303.08774 Cited by: §1.
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, D. Ganguli, T. Henighan, N. Joseph, et al. (2022) Constitutional ai: harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Cited by: §8, §8.
D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman (2019) Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference (WWW), pp. 491–500. Cited by: 7th item, §4.3.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901. Cited by: §1, §8.
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. (2022) PaLM: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311. Cited by: §8.
C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 2924–2936. Cited by: 4th item.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: 1st item, §4.2.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: 3rd item, §4.2.
O. Contributors (2023) OpenCompass: a universal evaluation platform for foundation models. Note: https://github.com/open-compass/opencompass Cited by: §B.1.
DeepSeek-AI (2025a) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. External Links: Link Cited by: §5.1.
DeepSeek-AI (2025b) DeepSeek-v3.2: pushing the frontier of open large language models. External Links: Link Cited by: §5.1.
J. Deng, J. Zhou, H. Sun, C. Zheng, F. Mi, H. Meng, and M. Huang (2022) COLD: a benchmark for chinese offensive language detection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 11580–11599. Cited by: 2nd item, §4.3.
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 2368–2378. Cited by: 4th item.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §8.
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. (2022) Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858. Cited by: §8.
S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith (2020) RealToxicityPrompts: evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369. Cited by: §8.
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, et al. (2024) Are we done with mmlu?. arXiv preprint arXiv:2406.04127. Cited by: 1st item, §4.2.
S. Ghosh, P. Varshney, E. Galinkin, and C. Parisien (2024) AEGIS: online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993. Cited by: §4.3.
GLM-5-Team (2026) GLM-5: from Vibe Coding to Agentic Engineering. External Links: 2602.15763, Link Cited by: §5.1.
S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of ACL, Cited by: §8.
T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar (2022) ToxiGen: a large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3309–3326. Cited by: 5th item, §4.3.
B. He, N. Ding, C. Qian, J. Deng, G. Cui, L. Yuan, H. Hong, H. Gao, L. Huang, H. Chen, et al. (2025a) The right time matters: data arrangement affects zero-shot generalization in instruction tuning. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 222–243. Cited by: §8.
B. He, W. Zhang, J. Song, C. Qian, Z. Fu, B. Sun, N. Ding, H. Hong, L. Huang, H. Xue, et al. (2025b) Air: a systematic analysis of annotations, instructions, and response pairs in preference dataset. arXiv preprint arXiv:2504.03612. Cited by: §8.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2021a) Measuring massive multitask language understanding. International Conference on Learning Representations. Cited by: 1st item, §4.2.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021b) Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: 3rd item.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, j. lei, Y. Fu, M. Sun, and J. He (2023) C-eval: a multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, Vol. 36, pp. 62991–63010. Cited by: 2nd item, §4.2.
H. Inan, K. Upasani, J. Chi, et al. (2023) Llama guard: llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674. Cited by: §1, §1, §8, §8.
Jigsaw/Conversation AI (2018) Toxic comment classification challenge. Note: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge Cited by: 6th item, §4.3.
M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1601–1611. Cited by: 1st item, §4.2.
Kimi Team (2026) Kimi K2.5: Visual Agentic Intelligence. External Links: 2602.02276, Link Cited by: §5.1.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin (2023a) CMMLU: measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212. Cited by: 2nd item, §4.2.
M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li (2023b) API-Bank: a comprehensive benchmark for tool-augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3102–3116. Cited by: 1st item, §4.2, §5.2.1.
Y. Liang, R. Zhong, H. Xu, C. Jiang, Y. Zhong, R. Fang, J. Gu, S. Deng, Y. Yao, M. Wang, et al. (2026) Skillnet: create, evaluate, and connect ai skills. arXiv preprint arXiv:2603.04448. Cited by: §8.
J. Lin, M. Liu, X. Huang, J. Li, H. Hong, X. Yuan, Y. Chen, L. Huang, H. Xue, R. Duan, Z. Chen, Y. Fu, D. Li, L. Gao, and Y. Yang (2026) YuFeng-xguard: a reasoning-centric, interpretable, and flexible guardrail model for large language models. arXiv preprint arXiv:2601.15588. Cited by: §A.2, 3rd item, §4.3, §5.2.2, §5.2.2.
K. Liu, S. Cheng, B. Tian, X. Liang, Y. Yin, M. Han, N. Zhang, B. Hooi, X. Chen, and S. Deng (2025) ChineseHarm-bench: a chinese harmful content detection benchmark. arXiv preprint arXiv:2506.10960. Cited by: 1st item, §4.3.
T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang, and L. Weng (2022) A holistic approach to undesired content detection. arXiv preprint arXiv:2208.03274. Cited by: 3rd item, §4.3.
B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee (2021) HateXplain: a benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 14867–14875. Cited by: 4th item, §4.3.
M. Mazeika et al. (2024) HarmBench: a standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249. Cited by: §1, §8, §8.
Meta (2025) Llama guard 4 model card. Note: https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/ Cited by: §1, §1, §5.1, §8, §8.
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2381–2391. Cited by: 1st item, §4.2.
MiniMax Team (2026) MiniMax-M2.5. External Links: Link Cited by: §5.1.
OpenAI (2026) GPT-5.4. Note: https://openai.com/ Cited by: §5.1.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Vol. 35, pp. 27730–27744. Cited by: §8, §8.
A. Paranjape, S. Lundberg, M. T. Ribeiro, H. Hajishirzi, et al. (2023) ART: automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014. Cited by: §8.
S. G. Patil, H. Mao, C. Cheng-Jie Ji, F. Yan, V. Suresh, I. Stoica, and J. E. Gonzalez (2024) The berkeley function calling leaderboard (bfcl): from tool use to agentic evaluation of large language models. In Advances in Neural Information Processing Systems, Cited by: 2nd item, §4.2, §5.2.1.
E. Perez, S. Ringer, K. Lukošiūtė, K. Nguyen, E. Chen, S. Heiner, L. Zettlemoyer, et al. (2022) Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251. Cited by: §1, §8, §8.
F. Perez and I. Ribeiro (2022) Ignore previous prompt: attack techniques for language models. arXiv preprint arXiv:2211.09527. Cited by: §8.
C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji (2025) ToolRL: reward is all tool learning needs. arXiv preprint arXiv:2504.13958. Cited by: §3.5.
Qwen Team and Qwen Team (2026) Qwen3.6-Plus: towards real world agents. External Links: Link Cited by: §5.1.
Qwen Team (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §1, §1, §5.1, §5.1, §8.
Qwen Team (2026) Qwen3.5: towards native multimodal agents. External Links: Link Cited by: §5.1, §5.1.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. In Advances in Neural Information Processing Systems, Vol. 36, pp. 53728–53741. Cited by: §8.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. Cited by: §8.
P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 784–789. Cited by: 4th item.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2023) GPQA: a graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022. Cited by: 1st item, §4.2.
K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi (2020) WinoGrande: an adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8732–8740. Cited by: 4th item.
T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom (2023) Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, Vol. 36, pp. 68539–68551. Cited by: §1, §8.
Z. Shao, P. Wang, Q. Zhu, et al. (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §8.
K. Sun, D. Yu, D. Yu, and C. Cardie (2020) Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics 8, pp. 141–155. Cited by: 2nd item, §4.2.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei (2022) Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261. Cited by: 4th item, §4.2.
A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158. Cited by: 4th item.
Tongyi DeepResearch Team, B. Li, B. Zhang, D. Zhang, F. Huang, G. Li, G. Chen, H. Yin, J. Wu, J. Zhou, et al. (2025) Tongyi DeepResearch technical report. arXiv preprint arXiv:2510.24701. Cited by: 3rd item, §4.2, §5.2.1.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §8.
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel (2024) The instruction hierarchy: training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208. Cited by: §8.
X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji (2024a) Executable code actions elicit better llm agents. arXiv preprint arXiv:2402.01030. Cited by: §1, §8.
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. (2024b) MMLU-pro: a more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574. Cited by: 1st item, §4.2.
A. Wei, N. Haghtalab, and J. Steinhardt (2023) Jailbroken: how does llm safety training fail?. In Advances in Neural Information Processing Systems, Vol. 36, pp. 80079–80110. Cited by: §1, §8, §8.
L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, et al. (2020) CLUE: a chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 4762–4772. Cited by: 2nd item, §4.2.
Z. Xu, H. Hong, L. Yu, B. Cui, L. Huang, H. Xue, and N. Zhang (2026a) How lora remembers? a parametric memory law for llm finetuning. arXiv preprint arXiv:2605.30260. Cited by: §8.
Z. Xu, C. Wu, H. Sun, H. Hong, M. Wang, Y. Yao, L. Huang, H. Xue, S. Deng, Z. Chu, et al. (2026b) Why steering works: toward a unified view of language model parameter dynamics. arXiv preprint arXiv:2602.02343. Cited by: §8.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023) ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), Cited by: §1, §8.
X. Yuan, J. Li, D. Wang, Y. Chen, X. Mao, L. Huang, J. Chen, H. Xue, X. Liu, W. Wang, K. Ren, and J. Wang (2024) S-eval: automatic and adaptive test generation for benchmarking safety evaluation of large language models. arXiv preprint arXiv:2405.14191. Cited by: §4.3.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4791–4800. Cited by: 4th item.
A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang (2023) AgentTuning: enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823. Cited by: §8.
Z. Zhang, L. Lei, L. Wu, R. Sun, Y. Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang (2023) SafetyBench: evaluating the safety of large language models. arXiv preprint arXiv:2309.07045. Cited by: 8th item, §4.3.
H. Zhao, C. Yuan, F. Huang, X. Hu, Y. Zhang, A. Yang, B. Yu, D. Liu, J. Zhou, J. Lin, et al. (2025) Qwen3Guard technical report. arXiv preprint arXiv:2510.14276. Cited by: §5.1.
C. Zheng, M. Huang, and A. Sun (2019) ChID: a large-scale chinese idiom dataset for cloze test. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 778–787. Cited by: 2nd item, §4.2.
A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson (2023) Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. Cited by: §8, §8.

Appendix A Detailed Benchmark Descriptions

This appendix provides additional details on the benchmark suites used in the Yuvion LLM RiskEval. For each benchmark group, we summarize its evaluation purpose, task format, metric, and the aspect of model quality it is intended to measure.

A.1 Open-source General Benchmarks

The open-source general benchmark suite is used to evaluate whether Yuvion preserves broad general-purpose capability after domain-specific continued pretraining and safety-oriented post-training. The suite contains more than 30 public evaluation sets and is organized into two groups: general-purpose benchmarks and agentic benchmarks. Unless otherwise specified, we report Accuracy as the primary metric.