CLEAR-MoE: Shared-Basis Expert Extraction from Frozen Vision Transformers via Calibration-Driven Layer Selection

Md Irtiza Hossain Humaira Ayesha Junaid Ahmed Sifat

Abstract

We present CLEAR-MoE, a four-phase post-training pipeline that converts a frozen pretrained vision transformer (ViT) into a sparse mixture-of-experts (MoE) model without updating backbone weights. The pipeline (i) scores FFN layers by sparsity, clusterability, and output sensitivity; (ii) decomposes selected layers into a shared low-rank SVD basis plus per-cluster residual experts ( $k$ -means); (iii) fits lightweight routers supervised by cluster labels; and (iv) dispatches tokens via pluggable CUDA backends. On Imagenette with DeiT-Small, CLEAR-MoE retains 99.9% of dense accuracy (86.70 $\pm$ 0.02% vs. 86.73%). Our ablations isolate a consistent empirical finding: the shared SVD basis is the dominant accuracy-preserving component. Random routing, learned routing, and three router architectures yield numerically similar results spanning at most 0.06 pp (86.62–86.68%), and accuracy remains stable across SVD rank, expert count $E\in\{2,\ldots,8\}$ , calibration size $N\in\{50,\ldots,500\}$ , and random seed. This finding generalizes across five ViT backbones (DeiT-T/S/B, ViT-S/B, 5.7M–86.6M parameters), with $|\Delta|{\leq}0.10$ pp across all configurations. On a GTX 960, routing and scatter-gather overhead make CLEAR-MoE FFN 1.3–1.7 $\times$ slower than dense. A dispatch microbenchmark indicates that routing (AI = 1.9 FLOPs/B) is an order of magnitude more memory-bound than expert GEMMs (AI = 22.7), identifying fused dispatch kernels as a plausible optimization target.

I Introduction

Vision transformers (ViTs) [4] devote roughly two-thirds of inference FLOPs to feed-forward network (FFN) sub-layers that execute the same MLP computation for every token, regardless of semantic content [10]. Sparse mixture-of-experts (MoE) architectures [5] address this by routing each token to a subset of specialized sub-networks, reducing active compute proportional to $k/E$ . However, existing vision MoE models [10, 2] require training from scratch, which is a resource barrier for practitioners who already hold a pretrained checkpoint.

Post-training expert extraction converts a dense FFN into experts without retraining [13, 11, 1]. Despite recent progress, three gaps remain: (1) most methods convert all FFN layers, ignoring sensitivity variation across depth; (2) disjoint expert assignment discards shared low-level structure, destabilizing downstream transfer; (3) speedup claims are reported in MACs rather than wall-clock latency on real hardware.

CLEAR-MoE addresses all three with the following contributions:

•

Calibration-driven layer scoring: composite score $\mathcal{S}(l)=0.4\,\text{sparsity}+0.4\,\text{clusterability}-0.2\,\text{sensitivity}$ selects FFN layers that cluster well and tolerate perturbation, automatically excluding high-sensitivity layers (e.g., block 0, sensitivity $=$ 0.946).
•

Shared-basis decomposition: fc2 is decomposed into a shared truncated-SVD basis plus per-cluster residuals; fc1 is shared identically across all experts. This preserves common visual structure while allowing per-cluster specialization.
•

Hardware-transparent latency study: all timing on a single GTX 960 (4 GB, 112 GB/s) with cuda.synchronize()-fenced p50 reporting (100 forward passes), indicating that routing overhead (not expert arithmetic) is a likely bottleneck.
•

Comprehensive ablation: decomposition (D0–D8), layer selection (L0–L10), dispatch backends, expert count, SVD rank, router architecture, calibration size, and random seed, establishing which design choices actually matter.

Scope. All measured results use a single GTX 960. No claims are made about data-center GPUs, distributed setups, or larger datasets. Multi-device throughput projections are modelled analytically using PCIe Gen3 $\times$ 16 bandwidth; no NCCL runs were performed.

II Related Work

Post-training expert extraction. MoEfication [13] first showed that clustering neuron activation patterns partitions a pretrained FFN into experts without retraining. D2DMoE [11] extended this to dynamic- $k$ routing, achieving 30% latency reduction on an A100 for ViT-B/ImageNet-1K. Berisha et al. [1] used variance-based neuron grouping to recover 98% of dense performance with 36.3% MAC reduction for DeiT-B. CLEAR-MoE differs by (a) selecting layers via a composite sensitivity-aware score, (b) preserving shared structure via SVD decomposition, and (c) providing a full wall-clock dispatch study on consumer hardware. Table I positions these methods side-by-side. Sparse Upcycling [8] converts dense language-model checkpoints to MoE by cloning FFN weights and training a load-balancing router via continued pre-training; unlike CLEAR-MoE, backbone weights are updated, requiring gradient access and sufficient unlabelled data.

TABLE I: Post-training MoE extraction: method comparison.

\dagger

= MAC reduction (not wall-clock).

\ddagger

= minimal fine-tuning reported by original paper. CLEAR-MoE: zero backbone weight updates.

Method	Backbone	HW	Acc.	Latency	Structure
MoEfication [13]	ViT	N/A	$\sim$ 99%	N/A	Disjoint $k$ -means
D2DMoE [11]^‡	ViT-B/16	A100	$\sim$ 99%	$-$ 30%	Dynamic- $k$ disj.
Berisha [1]^‡	DeiT-B	N/A	98%	$-$ 36.3% $\dagger$	Var.-based disj.
CLEAR-MoE	DeiT/ViT (5)	GTX 960	99.9%	$+$ 1.3–1.7 $\times$	Shared + res.
D2DMoE latency on A100 (2 TB/s); CLEAR-MoE on GTX 960 (112 GB/s). Bandwidth 18 $\times$ lower.

Vision MoE training. V-MoE [10] demonstrated that sparse patch routing matches dense quality at $\sim$ 50% active compute for 15B-parameter ViTs. AdaMV-MoE [2] showed late transformer layers benefit most from expertization, a heuristic our composite score can override when sensitivity signals contradict it. DynamicViT [9] and A-ViT [12] achieve token-level sparsification via halting gates, complementary to our FFN-level approach.

MoE runtime systems. TUTEL [6] achieves 3.11 $\times$ speedup via 2D all-to-all on 128 GPUs. Brainstorm [3] shows profile-guided dispatch yields 5 $\times$ speedup over DeepSpeed for SwinV2-MoE. Lancet [7] reduces non-overlapping communication by 77% via whole-graph overlap. These systems target high-bandwidth multi-GPU clusters; CLEAR-MoE studies the single-consumer-GPU regime where bandwidth is 18 $\times$ lower.

III Method

CLEAR-MoE converts a frozen pretrained ViT into a selectively expertised model through four sequential phases (Fig. 1). No backbone weights are updated.

Refer to caption — Figure 1: CLEAR-MoE pipeline. A frozen pretrained ViT yields calibration activations; scored FFN layers are decomposed into a shared SVD basis plus per-cluster residual experts; lightweight routers are fitted on $k$ -means labels; inference uses pluggable CUDA dispatch backends.

III-A Phase 1: Calibration Pass

Forward $N_{\text{cal}}$ representative images through the frozen model, recording pre-FFN activation tensors at each of the $L$ selected layers via PyTorch hooks. For DeiT-Small ( $d{=}384$ , $N_{\text{cal}}{=}200$ , 197 tokens/image), this produces a $[39{,}400,\,384]$ activation matrix per layer.

III-B Phase 2: Layer Scoring and Selection

Each FFN layer $l$ receives a composite score:

\mathcal{S}(l)=0.4\,\text{sparsity}(l)+0.4\,\text{clusterability}(l)-0.2\,\text{sensitivity}(l)

(1)

where sparsity is the fraction of activation magnitudes below 0.01; clusterability is the Silhouette score of $k$ -means with $k{=}E$ on the calibration activations; and sensitivity is the normalized logit change when the layer’s FFN output is zeroed. Sensitivity is subtracted because high sensitivity implies high degradation risk if expertized. The top- $k{=}L/2$ layers by $\mathcal{S}$ are selected; the remaining layers retain their original dense FFN. Table II reports per-layer scores for DeiT-Small (200-image calibration); block 0 ranks last due to sensitivity $=$ 0.946.

TABLE II: Per-layer composite scores, DeiT-Small (200-image calibration). Composite

=0.4{\cdot}\text{sp}+0.4{\cdot}\text{cl}-0.2{\cdot}\text{se}

. Bold = selected (top-

k{=}6

Block	Sparsity	Clusterab.	Sensitivity	Composite
blocks.1	0.180	0.526	0.171	0.248
blocks.4	0.152	0.518	0.109	0.246
blocks.5	0.152	0.520	0.113	0.246
blocks.3	0.149	0.519	0.113	0.245
blocks.6	0.147	0.522	0.123	0.243
blocks.2	0.141	0.518	0.147	0.234
blocks.7	0.147	0.519	0.162	0.234
blocks.11	0.134	0.516	0.149	0.230
blocks.9	0.135	0.518	0.167	0.228
blocks.8	0.142	0.512	0.175	0.227
blocks.10	0.171	0.514	0.252	0.224
blocks.0	0.210	0.516	0.946	0.101

III-C Phase 3: Expert Extraction

For each selected layer, fc2 ( $W_{2}\in\mathbb{R}^{d\times d_{\text{ffn}}}$ ) is decomposed via truncated SVD; fc1 is shared identically across all experts:

y=\underbrace{W_{\text{shared}}}_{\text{SVD-}r\text{ approx.}}\!\cdot z+\underbrace{W_{e^{*}}^{\text{res}}}_{\text{cluster residual}}\!\cdot z,\;\;z=\text{GELU}(\text{fc1}(x))

(2)

$W_{\text{shared}}=U_{r}\Sigma_{r}V_{r}^{\top}$ retains the top- $r$ singular values (default $r$ = 50% of matrix rank, $r{=}d/2{=}192$ for DeiT-Small where $W_{2}{\in}\mathbb{R}^{384\times 1536}$ has maximum rank 384; materialised as a dense matrix before inference, with no factorised compute reduction). $k$ -means with $k{=}E$ partitions calibration tokens into $E$ clusters; the per-cluster residual is $W_{e}^{\text{res}}=(W_{2}-W_{\text{shared}})\cdot s_{e}$ , where $s_{e}$ scales by the ratio of cluster-mean to global-mean activation norm. Because $s_{e}$ is a scalar, all residual experts share the same directional component $(W_{2}-W_{\text{shared}})$ , differing only in magnitude; a misrouted token therefore receives a different scaling but the same directional correction. During inference, $z$ is computed once and reused for both the shared and residual paths.

III-D Phase 4: Router Fitting

A router $g_{\theta}$ predicts the cluster assignment of each token. The $k$ -means labels from Phase 3 provide supervised targets: router training minimizes cross-entropy between $g_{\theta}(x)$ and cluster label $c$ over the calibration tokens, using AdamW ( $\text{lr}{=}10^{-3}$ , 5 epochs, cosine decay). Three router architectures are evaluated: Linear ( $d\times E$ parameters, 9.2K total), MLP (hidden layer $d/6$ , 149K), and Adaptive (confidence-threshold top-1/top-2 escalation, same parameter count as Linear).

III-E Dispatch Backends

Three CUDA backends are implemented: (1) Naive: boolean-mask loop over experts (reference). (2) Grouped: tokens are sorted by expert index; one GEMM per contiguous sub-batch. No padding; throughput stable under imbalance. (3) cuBLAS: expert sub-batches are padded to the largest and stacked into a 3D tensor dispatched to a batched GEMM call. Fastest at balanced load; degrades at high imbalance from padding waste.

IV Experiments

Setup. DeiT-Small (22M parameters, $d{=}384$ , $d_{\text{ffn}}{=}1536$ , 12 blocks) evaluated on Imagenette (3,925 validation images, 10-class ImageNet subset) with 200-image calibration set, $E{=}4$ experts, AdamW router training (5 epochs, lr $10^{-3}$ , cosine decay). All timing: p50 of 100 forward passes, batch size 8, GTX 960 (4 GB, 112 GB/s), cuda.synchronize()-fenced.

IV-A Decomposition Ablation (D0–D8)

Table III ablates what each architectural choice contributes to accuracy and latency. Configurations D0–D8 span the design space from dense baseline to shared-basis CLEAR-MoE FFN, isolating the effect of the shared basis, the routing mechanism, and the expert assignment strategy.

TABLE III: Decomposition ablation (DeiT-Small, Imagenette,

E{=}4

, last-

k

layers, batch=8, GTX 960).

\Delta

Top-1 relative to D0. D4/D5 differ only in random seed. D6 and D8 are the same architecture: D6 is the cold-kernel first dispatch; D8 is measured after 3 warm-up passes (steady-state). The 21 ms gap reflects CUDA JIT overhead on first run.

ID	Sh. fc1	Sh. fc2	Router	Top-1	$\boldsymbol{\Delta}$ pp	ms (p50)
D0	–	–	Dense	86.73%	$+$ 0.00	59.6
D1	✓	SVD rank- $r$	None	86.55%	$-$ 0.18	59.7
D2	✗	Disjoint	Random	66.42%	$-$ 20.31	49.4
D3	✓	Full res.	None	86.73%	$+$ 0.00	59.9
D4/D5	✓	$k$ -means res.	Random	86.62–65%	$-$ 0.08–11	92.2
D6/D8	✓	$k$ -means res.	Linear	86.65%	$-$ 0.08	78.8–100.3
D7	✗	Disjoint	$k$ -means	65.22%	$-$ 21.51	48.0
✗ = disjoint (no shared fc1); ✓ = shared across all tokens.

The shared SVD basis is the dominant accuracy-preserving component. D3 (shared basis + full global residual, no routing) achieves exact reconstruction: algebraically, $W_{\text{shared}}{\cdot}z+(W_{2}{-}W_{\text{shared}}){\cdot}z=W_{2}{\cdot}z$ , matching D0 at 86.73%. D2 and D7 (disjoint experts, no shared fc1) collapse to 65–66% regardless of routing strategy: disjoint expert weights estimated from 200 calibration images fail to generalize. In contrast, D4/D5 (shared basis, random routing) retain 86.62–65%, only 0.08–0.11 pp below D0; D6/D8 (shared basis, learned routing at 95% accuracy) achieve an identical $-$ 0.08 pp. Routing quality has limited measured impact once the shared basis is present.

No latency reduction on GTX 960. CLEAR-MoE FFN (D4–D8) executes shared fc1+fc2 on all tokens plus residual paths that collectively process all $T$ tokens ( $T/E$ per expert $\times$ $E$ experts), adding approximately one full fc2-equivalent computation; total arithmetic is $\approx\!1.5\times$ dense FFN. On bandwidth-limited memory (112 GB/s), this incurs 1.3–1.7 $\times$ latency overhead (78–100 ms vs. 59.6 ms). Only disjoint experts (D2/D7) are faster (48–49 ms) by eliminating the shared path entirely, at the cost of 21 pp accuracy.

IV-B Layer-Selection Ablation (L0–L10)

Table IV compares all 11 layer-selection strategies on the same MoE configuration ( $E{=}4$ , $k{=}6$ of 12 FFN layers, composite scoring).

TABLE IV: Complete layer-selection ablation (L0–L10). All: DeiT-Small,

E{=}4

k{=}6

layers, linear router, 200-image calibration. L1 avg of 3 seeds.

\Delta

relative to L0.

ID	Strategy	Layers	Top-1	$\boldsymbol{\Delta}$ pp	p50 ms
L0	Dense (no MoE)	–	86.73%	$+$ 0.00	57.0
L1	Random ( $n{=}3$ )	varies	86.72 $\pm$ 0.01%	$-$ 0.01	78.5 $\pm$ 0.8
L2	First $k$	0–5	86.60%	$-$ 0.13	79.3
L3	Last $k$	6–11	86.68%	$-$ 0.05	80.3
L4	Alternating (odd)	1,3,5,7,9,11	86.83%	$+$ 0.10	87.2
L5	Sparsity only	0,1,3,4,5,10	86.73%	$+$ 0.00	99.1
L6	Clusterability	1,2,3,5,6,7	86.73%	$+$ 0.00	88.8
L7	High-sensitivity	0,1,7,8,9,10	86.65%	$-$ 0.08	102.5
L8	Sp.+Cl.	0,1,4,5,6,10	86.62%	$-$ 0.10	88.8
L9	Cl. $-$ Se.	2,3,4,5,6,11	86.68%	$-$ 0.05	94.0
L10	Composite (ours)	1–6	86.68%	$\boldsymbol{-}$ 0.05	82.4

All 11 policies span only 0.21 pp (86.62–86.83%), confirming that accuracy is policy-insensitive. The shared basis preserves quality regardless of which 6 blocks are expertized. The composite score’s practical value is principled block 0 exclusion (sensitivity = 0.946): L2 (first- $k$ ) includes block 0 and incurs the worst $-$ 0.13 pp drop. L10 avoids it, achieving 82.4 ms, comparable to L3 (last- $k$ , 80.3 ms) while additionally excluding high-sensitivity blocks. L7 (high-sensitivity, deliberately selecting the most sensitive blocks as a stress-test baseline) achieves only $-$ 0.08 pp accuracy loss but is the slowest policy at 102.5 ms, demonstrating that layer choice affects latency more than accuracy.

IV-C Dispatch Strategy Benchmark

We benchmark three CUDA backends ( $B{=}8$ , $T{=}1568$ , $E{=}4$ , GTX 960) under four imbalance levels (Table V). At balanced load, cuBLAS peaks at 728 K tok/s (2.79 $\times$ CPU serial). At 80% imbalance, padding waste collapses cuBLAS to 558 K tok/s ( $-$ 23%). Grouped dispatch remains stable at 665–747 K tok/s across all imbalance levels. Recommendation: cuBLAS for predictably balanced load; Grouped otherwise.

TABLE V: Dispatch micro-benchmark: 0% vs. 80% imbalance (

B{=}8

T{=}1568

E{=}4

, GTX 960).

Backend	ms	K tok/s	ms	K tok/s
	0% imb.		80% imb.
CPU Serial	6.01	261	5.96	263
Naive	3.67	428	3.59	437
Grouped	2.36	665	2.14	733
cuBLAS	2.15	728	2.81	558

IV-D Roofline Analysis

Fig. 2 places each CLEAR-MoE operation on the GTX 960 roofline (peak FP32: 2.4 TFLOP/s, BW: 112 GB/s, ridge: 21.4 FLOPs/B). Dense FFN (AI = 74.3) and expert GEMMs (AI = 22.7) are compute-bound. The router gate (AI = 1.9) and token sort (AI = 1.0) are both deep in the memory-bound regime: 11 $\times$ and 21 $\times$ below the ridge point. This indicates that routing and token sort are strongly memory-bound (11–21 $\times$ below the ridge point), making fused dispatch kernels a plausible high-leverage optimization. Expert GEMMs are already near the compute ceiling and offer diminishing returns.

IV-E Hyperparameter Sensitivity

We ablate SVD rank, expert count, router architecture, calibration size, and random seed (Table VI); findings are:

TABLE VI: Hyperparameter sensitivity summary (DeiT-Small, Imagenette). All ranges

\leq

0.41 pp.

${}^{*}E{\in}\{2,4,8\}$ spans 0.15 pp; $E{=}16$ drops 0.26 pp (calibration instability: 200 images insufficient for 16 clusters).
Study	Range tested	Top-1 span	$\boldsymbol{\Delta}$ pp
SVD rank $r$	16–256	86.62–86.75%	0.13
Expert count $E$	2–16	86.39–86.80%^∗	0.41
Router arch.	Lin/MLP/Adaptive	86.68% (all)	0.00
Calibration $N$	50–500	86.55–86.70%	0.15
Random seed	42/123/456	86.70 $\pm$ 0.02%	0.05

SVD rank ( $r\in\{16,32,64,96,128,192,256\}$ ): accuracy range is 86.62–86.75% (0.13 pp). Reconstruction error decreases monotonically from 0.91 to 0.29, but this improvement does not translate to accuracy. Even $r{=}16$ captures the shared basis sufficiently.

Expert count ( $E\in\{2,4,8,16\}$ ): accuracy is 86.65–86.80% for $E\in\{2,4,8\}$ ; $E{=}16$ drops 0.26 pp as 200 calibration images provide insufficient statistics to estimate 16 fine-grained clusters. Latency grows $\sim$ 9 ms per doubling of $E$ (73 ms at $E{=}2$ to 98 ms at $E{=}16$ ). No empty experts appear at any count.

Router architecture: all three routers (Linear, MLP, Adaptive) achieve identical 86.68% Top-1. MLP routing accuracy reaches 0.980 vs. 0.925 for Linear, a 5.5 pp improvement in routing precision with zero downstream benefit. The gap between random routing (D4/D5, Table III) and all learned routers is at most 0.06 pp (roughly two changed predictions out of 3,925). While this numerically exceeds the 3-seed standard deviation of $\pm$ 0.02 pp (Table VI, row “Random seed”), a paired significance test would be needed to confirm a reliable effect; we treat 0.06 pp as practically negligible. Fig. 3 provides a mechanistic view: despite D6 (learned router) reducing mean per-token routing entropy $3.5\times$ versus D5 (random, $\ln 4{\approx}1.386$ nats), the accuracy impact is negligible.

Fig. 4 plots accuracy gap against router training across 13 configurations (D5 at $x{=}0$ , no trained router); no strong monotonic relationship is visible, consistent with the hypothesis that shared-basis quality governs accuracy regardless of routing quality.

Calibration size ( $N\in\{50,100,200,500\}$ , multiple subsets for $N{<}500$ ): accuracy spans 86.55–86.70% across all sizes; $N{=}50$ yields 86.65–86.70% (within 0.15 pp of $N{=}500$ ). Router accuracy climbs from 0.83 to 0.95 as $N$ grows, without accuracy benefit.

Reproducibility: 86.70 $\pm$ 0.02% Top-1 across seeds 42, 123, 456 (full pipeline, composite scoring, $E{=}4$ , $N{=}200$ ).

IV-F Cross-Backbone Generalization

Table VII tests whether the shared-basis finding extends across architectures. We apply the full CLEAR-MoE pipeline to five backbones spanning 5.7–86.6 M parameters and two pretraining lineages (DeiT, ViT); all other settings are identical to the DeiT-S ablation ( $E{=}4$ , $N{=}200$ , seed 42).

TABLE VII: Cross-backbone generalization (Imagenette,

E{=}4

N{=}200

, seed 42). D5 = random routing; D6 = learned linear router.

|\Delta|{\leq}0.10

pp across all 10 configurations.

^†Single-seed result for this backbone; per-backbone seed variance not characterised.
Backbone	Params	Dense	D5 $\Delta$ pp	D6 $\Delta$ pp	D6 Rtr. Acc	D6 Skew
DeiT-T/16	5.7 M	75.92%	$-$ 0.05	$+$ 0.05^†	0.913	0.090
DeiT-S/16	22.1 M	86.73%	$-$ 0.10	$-$ 0.05	0.925	0.099
ViT-S/16	22.1 M	76.23%	$-$ 0.05	$-$ 0.03	0.956	0.139
DeiT-B/16	86.6 M	91.77%	$-$ 0.03	$\pm$ 0.00	0.927	0.094
ViT-B/16	86.6 M	85.38%	$-$ 0.08	$-$ 0.10^‡	0.967	0.150
^‡Highest load skew (0.150); see Discussion.

Routing differences are numerically small. Learned routing (D6) provides a marginal numerical advantage over random routing (D5) in 4 of 5 backbones; however, the per-backbone improvements range from 0.02 to 0.10 pp, too small to confirm without paired significance testing. The one exception is ViT-B/16, where D5 ( $-$ 0.08 pp) outperforms D6 ( $-$ 0.10 pp): The ViT-B result coincides with the highest observed load skew (0.150, vs. 0.094 for DeiT-B), which may contribute to the difference; no ablation was performed to confirm a causal link. Taken together, these results are consistent with the hypothesis that routing quality has limited impact once the shared SVD basis is present; a definitive claim would require paired prediction-level testing across additional seeds. Fig. 5 visualises all 10 configurations.

V Discussion

Why routing differences are small. All residual experts are scalar-conditioned variants of the same matrix $(W_{2}-W_{\text{shared}})$ , so mis-routing changes only the scaling factor applied to a token’s correction, not its direction. The 0.06 pp gap between random and learned routing is the direct consequence. Cross-backbone results (Table VII) are consistent across 4 of 5 architectures; a linear router ( $<$ 10K parameters) appears adequate. Expert Choice routing [14], where experts select their tokens rather than the reverse, achieves better load balance in language models; on DeiT-S, however, random and learned routing differ by at most 0.06 pp, and all cross-backbone random-routing losses remain within 0.10 pp of dense, supporting the hypothesis that routing strategy is secondary to decomposition quality.

Why CLEAR-MoE is slower on the GTX 960. CLEAR-MoE FFN computes shared fc1+fc2 on all tokens, then adds residual paths that each process $T/E$ tokens but collectively span all $T$ tokens across $E$ experts, adding approximately one full fc2-equivalent computation. Total arithmetic is $\approx\!1.5\times$ dense FFN. On a bandwidth-limited GPU (112 GB/s vs. 2 TB/s for an A100), loading shared weights and expert-residual weights across memory dominates. The dispatch micro-benchmark isolates this: routing and token sort (AI = 1.0–1.9) are 11–21 $\times$ below the compute ridge. Disjoint experts (D2/D7) are faster (48 ms) by eliminating shared paths, but sacrifice 21 pp accuracy. Analytical modelling suggests $\geq$ 800 GB/s bandwidth may be needed for latency parity; A100 (2 TB/s) or H100 (3.35 TB/s) class hardware are the plausible candidates.

Practical design guidelines. When to use CLEAR-MoE. CLEAR-MoE may be well suited when accuracy preservation is non-negotiable and backbone fine-tuning is infeasible: the evaluated DeiT and ViT backbones can be converted with 200 calibration images and no GPU cluster. The method is not competitive with disjoint-expert approaches if latency reduction is the primary goal on consumer hardware.

Choosing $E$ and $r$ . Expert count $E{=}4$ provides a reasonable empirical trade-off: accuracy remains stable from $E{=}2$ to $E{=}8$ , and $E{=}4$ balances router overhead with residual expressiveness. For $E{=}16$ , our analytical model suggests $\geq$ 800 calibration images may be needed for stable cluster estimates; this was not directly validated. SVD rank $r$ can be set as low as 16 without consistent measured accuracy reduction on Imagenette; the default $r{=}d/2{\approx}192$ for DeiT-Small is conservative and may waste compute on deep layers.

Choosing the dispatch backend. cuBLAS is optimal only for perfectly balanced expert load, which is common in classification but rare in detection/segmentation. For workloads with variable token-to-expert ratios, Grouped dispatch is approximately 31% faster than cuBLAS at 80% imbalance and remains comparatively stable across imbalance levels. Naive dispatch is a reference implementation only and is not recommended for latency-sensitive deployment.

Limitations. Backbone and dataset scope. DeiT-T/S/B and ViT-S/B generalization is confirmed (Table VII); ViT-L, ImageNet-1K, and hierarchical backbones (Swin, ConvNeXt) are not evaluated. Parameter count, memory footprint, and FLOPs versus a same-hardware baseline are not reported.

Hardware and compute. CLEAR-MoE is 1.3–1.7 $\times$ slower than dense on GTX 960 (112 GB/s). Latency parity on high-bandwidth hardware is modelled analytically only; multi-device projections are simulated (PCIe Gen3 $\times$ 16), not measured NCCL runs.

VI Conclusion

CLEAR-MoE demonstrates that post-training expert extraction preserves ${\geq}99.9$ % of dense ViT accuracy via a shared SVD basis plus per-cluster residuals, requiring only 200 calibration images and a single consumer GPU. This finding is consistent across five ViT backbones spanning 5.7–86.6 M parameters and two pretraining lineages (Table VII), showing that the shared SVD basis is the dominant accuracy-preserving factor while routing quality, SVD rank, expert count ( $E\in\{2,4,8\}$ ), and calibration size ( $N\in\{50,\ldots,500\}$ ) are secondary. Random routing achieves 86.62% while learned routing at 98% accuracy achieves 86.68% (a 0.06 pp difference), simplifying the design space to a lightweight $<$ 10K-parameter linear router and suggesting future work should invest in decomposition quality over router expressiveness. On a bandwidth-constrained GTX 960 (112 GB/s), CLEAR-MoE’s FFN is 1.3–1.7 $\times$ slower than dense because loading shared weights for all tokens dominates; roofline analysis confirms routing/scatter-gather (AI = 1.0–1.9) are 11–21 $\times$ below the ridge point while expert GEMMs (AI = 22.7) approach the compute ceiling, identifying fused dispatch kernels as the primary engineering target. Future research directions include evaluating scaling on ViT-L and ImageNet-1K, implementing Triton-fused router-dispatch kernels, extending the composite scoring to hierarchical backbones (Swin, ConvNeXt) with non-uniform FFN widths, and empirically testing CLEAR-MoE on high-bandwidth hardware (A100/H100 class) to validate projected latency parity.

References

[1] U. Berisha, J. Mehnert, and A. P. Condurache (2025) Efficient data driven mixture-of-expert extraction from trained networks. External Links: 2505.15414, Link Cited by: §I, TABLE I, §II.
[2] T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y. Li (2023-10) AdaMV-moe: adaptive multi-task vision mixture-of-experts. pp. 17300–17311. External Links: Document Cited by: §I, §II.
[3] W. Cui, S. Jiao, T. University, Z. Han, M. Research, Y. Wang, N. Zheng, L. Ma, Y. Yang, F. Yang, J. Xue, L. Qiu, L. Zhou, A. Quan, H. Tan, M. Guo, L. Ouyang, and Q. Chen (2023-07) Optimizing dynamic neural networks with brainstorm optimizing dynamic neural networks with brainstorm. pp. . Cited by: §II.
[4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2020) An image is worth 16x16 words: transformers for image recognition at scale. CoRR abs/2010.11929. External Links: Link, 2010.11929 Cited by: §I.
[5] G. Hinton, N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, and J. Dean (2017-01) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. pp. . External Links: Document Cited by: §I.
[6] Cited by: §II.
[7] C. Jiang, Y. Tian, Z. Jia, S. Zheng, C. Wu, and Y. Wang (2024) Lancet: accelerating mixture-of-experts training via whole graph computation-communication overlapping. ArXiv abs/2404.19429. External Links: Link Cited by: §II.
[8] Cited by: §II.
[9] Cited by: §II.
[10] Cited by: §I, §II.
[11] F. Szatkowski, B. W’ojcik, M. Pi’orczy’nski, and S. Scardapane (2023) Exploiting activation sparsity with dense to dynamic-k mixture-of-experts conversion. Advances in Neural Information Processing Systems 37. External Links: Link Cited by: §I, TABLE I, §II.
[12] H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov (2022-06) A-vit: adaptive tokens for efficient vision transformer. pp. 10799–10808. External Links: Document Cited by: §II.
[13] Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou (2022-01) MoEfication: transformer feed-forward layers are mixtures of experts. pp. 877–890. External Links: Document Cited by: §I, TABLE I, §II.
[14] Cited by: §V.

Appendix A Methodology Details

A-A Glossary of Key Terms

Table A1 defines domain-specific terminology used throughout the paper.

TABLE A1: Domain-specific terminology.

Term	Definition
Token	Image patch embedded as $d$ -dimensional vector; 197 tokens per $224{\times}224$ image
FFN	Feed-forward network: two-layer MLP applied per token in each transformer block
Expert	Routed residual branch processing a token subset; in CLEAR-MoE, all residual experts share one direction $(W_{2}{-}W_{\text{shared}})$ and differ only by a cluster-conditioned scalar $s_{e}$
Router/Gate	Lightweight network mapping each token to a probability distribution over experts
MoE layer	FFN replaced by $E$ experts plus a router; only top- $k{<}E$ experts activate per token
SVD	Singular value decomposition; truncated SVD retains top- $r$ singular components
$k$ -means	Partitions $N$ points into $k$ clusters by minimising within-cluster variance
Calibration set	Small representative set used to measure model statistics (no weight updates)
AI	Arithmetic intensity: FLOPs per byte of memory accessed
Ridge point	AI where compute ceiling equals memory bandwidth ceiling (21.4 FLOPs/B on GTX 960)
p50 Latency	Median over 100 repeated forward passes; robust to OS-scheduling outliers
cuBLAS	NVIDIA’s optimised linear-algebra library; batched GEMM calls dispatch through it

A-B Dataset Preparation and EDA

Imagenette. 13,394 raw images were scanned using SHA-256 deduplication, robust z-score filtering ( $\tau{=}3.5$ ), and Isolation Forest (contamination = 0.01). 118 images (0.88%) were removed, yielding a clean set of 9,391 train / 3,885 val (used for EDA analysis only). The standard Imagenette validation split (3,925 images, including the 40 removed as anomalies by our filter) was used for all reported Top-1 accuracy evaluation, ensuring comparability with prior work. ImageNet normalisation and Lanczos-4 resize to $224{\times}224$ are applied.

Distribution shift analysis. DeiT-Small penultimate activations for all 3,885 clean val images (shape $3885{\times}384$ ): PSI = 0.043 ( $<$ 0.1, negligible shift), JSD = 0.018, KS $p$ -values $>$ 0.05 for all 10 classes. Clustering quality. $k{=}10$ on penultimate image-level activations: Silhouette = 0.542, Davies-Bouldin = 1.24, confirming that the feature space supports class-level cluster separation. Note that CLEAR-MoE clusters token-level intermediate activations with $k{=}4$ , which is a complementary but distinct analysis; the image-level EDA establishes feature quality rather than directly validating token cluster assignments. SVD energy. Rank-192 truncation retains 99.2% of penultimate activation matrix variance; top-10 singular values capture 67.8%. The effect of $W_{2}$ truncation rank on accuracy and reconstruction error is evaluated in Table C3.

Table A2 summarises these preprocessing and analysis statistics.

TABLE A2: Dataset preprocessing and EDA statistics.

Statistic	Value
Raw Imagenette images	13,394
Removed (anomalous)	118 (0.88%)
Train / val split	9,391 / 3,885 (clean, EDA only); 3,925 (standard, used for all Top-1)
Train-test PSI	0.043 ( $<$ 0.1 = negligible shift)
Train-test JSD	0.018
$k{=}10$ Silhouette score	0.542
$k{=}10$ Davies-Bouldin	1.24
Rank-192 SVD variance	99.2%
Calibration size	200 images

A-C Algorithm Pseudocode

Algorithms 1–3 give pseudocode for the three core CLEAR-MoE phases.

Algorithm 1 Phase 2: Layer Scoring and Selection

0: Calibration activations

\{h_{l}^{(i)}\}

L

= 12 layers, select

k{=}6

1: for each layer

l

\text{sp}(l)\leftarrow\frac{\sum_{j}\mathbf{1}(|h_{l,j}|<0.01)}{\operatorname{numel}(h_{l})}

\text{cl}(l)\leftarrow\text{Silhouette}(k\text{-means}(h_{l},E))

\text{se}(l)\leftarrow\|\Delta\text{logits}\|/\|\text{logits}\|

(zero FFN output)

\mathcal{S}(l)\leftarrow 0.4\,\text{sp}+0.4\,\text{cl}-0.2\,\text{se}

6: end for

7: Select top-

k

layers by

\mathcal{S}

Algorithm 2 Phase 3: Expert Extraction (fc1 shared; fc2 decomposed)

W_{2}\in\mathbb{R}^{d\times d_{\text{ffn}}}

, activations

z

E

, rank

r

[U,\Sigma,V]\leftarrow\text{SVD}(W_{2})

;

W_{\text{shared}}\leftarrow U_{r}\Sigma_{r}V_{r}^{\top}

\{C_{1},\ldots,C_{E}\}\leftarrow k\text{-means}(z,E)

3: for

e\in\{1,\ldots,E\}

s_{e}\leftarrow\frac{\frac{1}{|C_{e}|}\sum_{i\in C_{e}}\|z_{i}\|_{2}}{\frac{1}{N}\sum_{i=1}^{N}\|z_{i}\|_{2}}

W_{e}^{\text{res}}\leftarrow(W_{2}-W_{\text{shared}})\cdot s_{e}

6: end for

Algorithm 3 CLEAR-MoE FFN Inference (Grouped backend)

0: Tokens

x

, shared fc1,

W_{\text{shared}}

\{W_{e}^{\text{res}}\}

, Router

g

z\leftarrow\text{GELU}(\text{fc1}(x))

(shared, all tokens)

h_{\text{sh}}\leftarrow zW_{\text{shared}}^{\top}

(shared SVD basis)

a\leftarrow\text{argmax}(g(x))

(expert assignment)

4: Sort

z

a

; get sub-batch boundaries via index search

h_{\text{res}}[e]\leftarrow z_{e}W_{e}^{\text{res}\top}

for each expert

e

6: Unshuffle

h_{\text{res}}

;

y\leftarrow h_{\text{sh}}+h_{\text{res}}

Appendix B Experimental Configuration

B-A Hardware and Software Setup

Table B1 lists the complete hardware and software configuration used in all experiments.

TABLE B1: Hardware and software configuration for all experiments.

Item	Specification
OS	Windows 11 Pro
GPU	NVIDIA GeForce GTX 960
GPU memory	4.0 GB GDDR5
Memory bandwidth	112 GB/s
Peak FP32 throughput	2.4 TFLOP/s
Ridge point	21.4 FLOPs/Byte
CUDA	11.8
PyTorch	2.6.0+cu118
Classification backbone	DeiT-Small; 22M parameters; patch size 16 $\times$ 16; $d{=}384$ ; $d_{\text{ffn}}{=}1536$ ; 12 transformer blocks
Segmentation backbone	Not evaluated (planned future work)
Default $E$	4 experts
Default rank $r$	$r{=}d/2{=}192$ for DeiT-Small (50% of matrix rank)
Calibration size	200 images
Router training	5 epochs, AdamW, lr $10^{-3}$ , cosine decay
Latency	p50 of 100 passes, batch = 8, cuda.synchronize()

Appendix C Extended Experimental Results

C-A Full Ablation Tables

Per-layer composite scores. Full composite scores for all 12 DeiT-Small FFN blocks appear in Table II (Section III). Block 0 ranks last (composite = 0.101) due to sensitivity = 0.946.

Full layer-selection policy ablation. The complete L0–L10 ablation now appears in Table IV (Section IV-B).

C-B Hyperparameter Sensitivity Studies

Tables C1–C5 report the full numerical results for the five hyperparameter sensitivity studies summarised in Section IV.

TABLE C1: 3-Seed Robustness. Full composite-score pipeline,

E{=}4

, DeiT-Small.

Seed	Top-1	p50 ms	Router Acc	Skew
42	86.68%	85.05	0.925	0.099
123	86.70%	83.90	0.928	0.116
456	86.73%	78.89	0.928	0.101
Mean $\pm$ Std	86.70 $\pm$ 0.02%	82.6 $\pm$ 2.7	0.927 $\pm$ 0.002	0.105 $\pm$ 0.008

TABLE C2: Calibration-set sensitivity. Multiple random subsets per size.

$N_{\text{cal}}$	Sub.	Top-1	Router Acc	Skew
50	1	86.70%	0.867	0.139
50	2	86.68%	0.833	0.073
50	3	86.65%	0.856	0.139
100	1	86.55%	0.896	0.094
100	2	86.60%	0.908	0.150
100	3	86.70%	0.899	0.153
200	1	86.68%	0.926	0.099
200	2	86.60%	0.928	0.128
500	1	86.70%	0.953	0.098

TABLE C3: SVD rank sweep. Seed 42,

E{=}4

, composite scoring.

Rank $r$	Top-1	p50 ms	Recon Err
16	86.75%	78.13	0.907
32	86.65%	76.79	0.841
64	86.65%	76.60	0.734
96	86.75%	78.07	0.643
128	86.68%	77.82	0.562
192	86.62%	76.01	0.420
256	86.70%	77.36	0.294

TABLE C4: Expert-count study. Seed 42, composite scoring,

k{=}6

$E$	Top-1	p50 ms	Rtr Acc	Skew
2	86.65%	73.4	0.967	0.247
4	86.65%	77.5	0.923	0.074
8	86.80%	84.3	0.893	0.042
16	86.39%	97.8	0.865	0.024

TABLE C5: Router architecture comparison. Seed 42,

E{=}4

, composite scoring. Entropy: aggregate expert-usage entropy (distribution of tokens across experts, per layer; max

=\ln 4\approx 1.386

). Note: this is the load-balance entropy, distinct from the per-token predictive entropy in Fig. 3 (

\approx

0.40 nats), which measures router confidence.

Router	Top-1	p50 ms	Rtr Acc	Skew	Entropy	Params
Linear	86.68%	78.0	0.925	0.099	1.296	9,240
MLP	86.68%	75.6	0.980	0.095	1.300	149,400
Adaptive	86.68%	77.6	0.936	0.099	1.296	9,240

C-C Dispatch Benchmark and Parallel Scaling

Table C6 benchmarks all three CUDA backends under four token-load imbalance levels ( $B{=}8$ , $T{=}1568$ tokens, $E{=}4$ , GTX 960); CPU Serial is included as a baseline, allowing direct GPU-vs-CPU comparison at each imbalance level. The roofline plot appears in Fig. 2 (Section IV). Table C7 projects multi-device throughput analytically.

TABLE C6: Dispatch micro-benchmark: p50 latency (ms) and throughput (K tok/s) across imbalance levels.

T{=}1568=8{\times}196

tokens (class token excluded from expert dispatch; processed on the dense path). Calibration uses 197 tokens/image including the class token.

Backend	ms	K/s	ms	K/s	ms	K/s	ms	K/s
	0% imb.		40% imb.		60% imb.		80% imb.
CPU Serial	6.01	261	7.22	217	5.96	263	5.96	263
Naive	3.67	428	3.55	442	3.50	448	3.59	437
Grouped	2.36	665	2.10	747	2.30	681	2.14	733
cuBLAS	2.15	728	2.29	684	2.45	639	2.81	558

GPU vs. CPU and Parallel Scaling.

GPU vs. CPU speedup at balanced load is readable from the CPU Serial row in Table C6 above: at 0% imbalance, cuBLAS achieves 728 K tok/s vs. 261 K tok/s for CPU Serial ( $2.79\times$ speedup).

Multi-device projection (simulated; no second GPU available). All-to-all and AllReduce costs are modelled analytically using PCIe Gen3 $\times$ 16 bandwidth (16 GB/s). Results are not measured NCCL runs.

TABLE C7: Parallel scaling projection. Expert/Data/Pipeline parallelism are simulated; single-GPU is measured.

All multi-device entries are simulated via PCIe Gen3 $\times$ 16 bandwidth; no NCCL runs performed.
Mode	tok/s	Devices	Speedup vs GPU-1	Efficiency
Full-model single-GPU	225K	1	1.00 $\times$	1.00
Expert Parallel EP-2 (sim.)	175K	2	0.78 $\times$	0.39
Pipeline Parallel PP-2 (sim.)	8K	2	0.03 $\times$	0.02
EP-2 overhead: all-to-all token redistribution (AI = 0.1 F/B).
PP-2: 50% pipeline bubble with 1 micro-batch, 2 stages.
Single-GPU throughput is full-model (distinct from dispatch-only throughput in Table C6).