License: CC BY-NC-SA 4.0
arXiv:2606.28516v1 [cs.CV] 26 Jun 2026

CLEAR-MoE: Shared-Basis Expert Extraction from Frozen Vision Transformers via Calibration-Driven Layer Selection

Md Irtiza Hossain    Humaira Ayesha    Junaid Ahmed Sifat
Abstract

We present CLEAR-MoE, a four-phase post-training pipeline that converts a frozen pretrained vision transformer (ViT) into a sparse mixture-of-experts (MoE) model without updating backbone weights. The pipeline (i) scores FFN layers by sparsity, clusterability, and output sensitivity; (ii) decomposes selected layers into a shared low-rank SVD basis plus per-cluster residual experts (kk-means); (iii) fits lightweight routers supervised by cluster labels; and (iv) dispatches tokens via pluggable CUDA backends. On Imagenette with DeiT-Small, CLEAR-MoE retains 99.9% of dense accuracy (86.70 ±\pm 0.02% vs. 86.73%). Our ablations isolate a consistent empirical finding: the shared SVD basis is the dominant accuracy-preserving component. Random routing, learned routing, and three router architectures yield numerically similar results spanning at most 0.06 pp (86.62–86.68%), and accuracy remains stable across SVD rank, expert count E{2,,8}E\in\{2,\ldots,8\}, calibration size N{50,,500}N\in\{50,\ldots,500\}, and random seed. This finding generalizes across five ViT backbones (DeiT-T/S/B, ViT-S/B, 5.7M–86.6M parameters), with |Δ|0.10|\Delta|{\leq}0.10 pp across all configurations. On a GTX 960, routing and scatter-gather overhead make CLEAR-MoE FFN 1.3–1.7×\times slower than dense. A dispatch microbenchmark indicates that routing (AI = 1.9 FLOPs/B) is an order of magnitude more memory-bound than expert GEMMs (AI = 22.7), identifying fused dispatch kernels as a plausible optimization target.

I Introduction

Vision transformers (ViTs) [4] devote roughly two-thirds of inference FLOPs to feed-forward network (FFN) sub-layers that execute the same MLP computation for every token, regardless of semantic content [10]. Sparse mixture-of-experts (MoE) architectures [5] address this by routing each token to a subset of specialized sub-networks, reducing active compute proportional to k/Ek/E. However, existing vision MoE models [10, 2] require training from scratch, which is a resource barrier for practitioners who already hold a pretrained checkpoint.

Post-training expert extraction converts a dense FFN into experts without retraining [13, 11, 1]. Despite recent progress, three gaps remain: (1) most methods convert all FFN layers, ignoring sensitivity variation across depth; (2) disjoint expert assignment discards shared low-level structure, destabilizing downstream transfer; (3) speedup claims are reported in MACs rather than wall-clock latency on real hardware.

CLEAR-MoE addresses all three with the following contributions:

  • Calibration-driven layer scoring: composite score 𝒮(l)=0.4sparsity+0.4clusterability0.2sensitivity\mathcal{S}(l)=0.4\,\text{sparsity}+0.4\,\text{clusterability}-0.2\,\text{sensitivity} selects FFN layers that cluster well and tolerate perturbation, automatically excluding high-sensitivity layers (e.g., block 0, sensitivity == 0.946).

  • Shared-basis decomposition: fc2 is decomposed into a shared truncated-SVD basis plus per-cluster residuals; fc1 is shared identically across all experts. This preserves common visual structure while allowing per-cluster specialization.

  • Hardware-transparent latency study: all timing on a single GTX 960 (4 GB, 112 GB/s) with cuda.synchronize()-fenced p50 reporting (100 forward passes), indicating that routing overhead (not expert arithmetic) is a likely bottleneck.

  • Comprehensive ablation: decomposition (D0–D8), layer selection (L0–L10), dispatch backends, expert count, SVD rank, router architecture, calibration size, and random seed, establishing which design choices actually matter.

Scope. All measured results use a single GTX 960. No claims are made about data-center GPUs, distributed setups, or larger datasets. Multi-device throughput projections are modelled analytically using PCIe Gen3×\times16 bandwidth; no NCCL runs were performed.

II Related Work

Post-training expert extraction. MoEfication [13] first showed that clustering neuron activation patterns partitions a pretrained FFN into experts without retraining. D2DMoE [11] extended this to dynamic-kk routing, achieving 30% latency reduction on an A100 for ViT-B/ImageNet-1K. Berisha et al. [1] used variance-based neuron grouping to recover 98% of dense performance with 36.3% MAC reduction for DeiT-B. CLEAR-MoE differs by (a) selecting layers via a composite sensitivity-aware score, (b) preserving shared structure via SVD decomposition, and (c) providing a full wall-clock dispatch study on consumer hardware. Table I positions these methods side-by-side. Sparse Upcycling [8] converts dense language-model checkpoints to MoE by cloning FFN weights and training a load-balancing router via continued pre-training; unlike CLEAR-MoE, backbone weights are updated, requiring gradient access and sufficient unlabelled data.

TABLE I: Post-training MoE extraction: method comparison. \dagger = MAC reduction (not wall-clock). \ddagger = minimal fine-tuning reported by original paper. CLEAR-MoE: zero backbone weight updates.
Method Backbone HW Acc. Latency Structure
MoEfication [13] ViT N/A \sim99% N/A Disjoint kk-means
D2DMoE [11] ViT-B/16 A100 \sim99% -30% Dynamic-kk disj.
Berisha [1] DeiT-B N/A 98% -36.3%\dagger Var.-based disj.
CLEAR-MoE DeiT/ViT (5) GTX 960 99.9% ++1.3–1.7×\times Shared + res.
D2DMoE latency on A100 (2 TB/s); CLEAR-MoE on GTX 960 (112 GB/s). Bandwidth 18×\times lower.

Vision MoE training. V-MoE [10] demonstrated that sparse patch routing matches dense quality at \sim50% active compute for 15B-parameter ViTs. AdaMV-MoE [2] showed late transformer layers benefit most from expertization, a heuristic our composite score can override when sensitivity signals contradict it. DynamicViT [9] and A-ViT [12] achieve token-level sparsification via halting gates, complementary to our FFN-level approach.

MoE runtime systems. TUTEL [6] achieves 3.11×\times speedup via 2D all-to-all on 128 GPUs. Brainstorm [3] shows profile-guided dispatch yields 5×\times speedup over DeepSpeed for SwinV2-MoE. Lancet [7] reduces non-overlapping communication by 77% via whole-graph overlap. These systems target high-bandwidth multi-GPU clusters; CLEAR-MoE studies the single-consumer-GPU regime where bandwidth is 18×\times lower.

III Method

CLEAR-MoE converts a frozen pretrained ViT into a selectively expertised model through four sequential phases (Fig. 1). No backbone weights are updated.

Refer to caption
Figure 1: CLEAR-MoE pipeline. A frozen pretrained ViT yields calibration activations; scored FFN layers are decomposed into a shared SVD basis plus per-cluster residual experts; lightweight routers are fitted on kk-means labels; inference uses pluggable CUDA dispatch backends.

III-A Phase 1: Calibration Pass

Forward NcalN_{\text{cal}} representative images through the frozen model, recording pre-FFN activation tensors at each of the LL selected layers via PyTorch hooks. For DeiT-Small (d=384d{=}384, Ncal=200N_{\text{cal}}{=}200, 197 tokens/image), this produces a [39,400, 384][39{,}400,\,384] activation matrix per layer.

III-B Phase 2: Layer Scoring and Selection

Each FFN layer ll receives a composite score:

𝒮(l)=0.4sparsity(l)+0.4clusterability(l)0.2sensitivity(l)\mathcal{S}(l)=0.4\,\text{sparsity}(l)+0.4\,\text{clusterability}(l)-0.2\,\text{sensitivity}(l) (1)

where sparsity is the fraction of activation magnitudes below 0.01; clusterability is the Silhouette score of kk-means with k=Ek{=}E on the calibration activations; and sensitivity is the normalized logit change when the layer’s FFN output is zeroed. Sensitivity is subtracted because high sensitivity implies high degradation risk if expertized. The top-k=L/2k{=}L/2 layers by 𝒮\mathcal{S} are selected; the remaining layers retain their original dense FFN. Table II reports per-layer scores for DeiT-Small (200-image calibration); block 0 ranks last due to sensitivity == 0.946.

TABLE II: Per-layer composite scores, DeiT-Small (200-image calibration). Composite =0.4sp+0.4cl0.2se=0.4{\cdot}\text{sp}+0.4{\cdot}\text{cl}-0.2{\cdot}\text{se}. Bold = selected (top-k=6k{=}6).
Block Sparsity Clusterab. Sensitivity Composite
blocks.1 0.180 0.526 0.171 0.248
blocks.4 0.152 0.518 0.109 0.246
blocks.5 0.152 0.520 0.113 0.246
blocks.3 0.149 0.519 0.113 0.245
blocks.6 0.147 0.522 0.123 0.243
blocks.2 0.141 0.518 0.147 0.234
blocks.7 0.147 0.519 0.162 0.234
blocks.11 0.134 0.516 0.149 0.230
blocks.9 0.135 0.518 0.167 0.228
blocks.8 0.142 0.512 0.175 0.227
blocks.10 0.171 0.514 0.252 0.224
blocks.0 0.210 0.516 0.946 0.101

III-C Phase 3: Expert Extraction

For each selected layer, fc2 (W2d×dffnW_{2}\in\mathbb{R}^{d\times d_{\text{ffn}}}) is decomposed via truncated SVD; fc1 is shared identically across all experts:

y=WsharedSVD-r approx.z+Werescluster residualz,z=GELU(fc1(x))y=\underbrace{W_{\text{shared}}}_{\text{SVD-}r\text{ approx.}}\!\cdot z+\underbrace{W_{e^{*}}^{\text{res}}}_{\text{cluster residual}}\!\cdot z,\;\;z=\text{GELU}(\text{fc1}(x)) (2)

Wshared=UrΣrVrW_{\text{shared}}=U_{r}\Sigma_{r}V_{r}^{\top} retains the top-rr singular values (default rr = 50% of matrix rank, r=d/2=192r{=}d/2{=}192 for DeiT-Small where W2384×1536W_{2}{\in}\mathbb{R}^{384\times 1536} has maximum rank 384; materialised as a dense matrix before inference, with no factorised compute reduction). kk-means with k=Ek{=}E partitions calibration tokens into EE clusters; the per-cluster residual is Weres=(W2Wshared)seW_{e}^{\text{res}}=(W_{2}-W_{\text{shared}})\cdot s_{e}, where ses_{e} scales by the ratio of cluster-mean to global-mean activation norm. Because ses_{e} is a scalar, all residual experts share the same directional component (W2Wshared)(W_{2}-W_{\text{shared}}), differing only in magnitude; a misrouted token therefore receives a different scaling but the same directional correction. During inference, zz is computed once and reused for both the shared and residual paths.

III-D Phase 4: Router Fitting

A router gθg_{\theta} predicts the cluster assignment of each token. The kk-means labels from Phase 3 provide supervised targets: router training minimizes cross-entropy between gθ(x)g_{\theta}(x) and cluster label cc over the calibration tokens, using AdamW (lr=103\text{lr}{=}10^{-3}, 5 epochs, cosine decay). Three router architectures are evaluated: Linear (d×Ed\times E parameters, 9.2K total), MLP (hidden layer d/6d/6, 149K), and Adaptive (confidence-threshold top-1/top-2 escalation, same parameter count as Linear).

III-E Dispatch Backends

Three CUDA backends are implemented: (1) Naive: boolean-mask loop over experts (reference). (2) Grouped: tokens are sorted by expert index; one GEMM per contiguous sub-batch. No padding; throughput stable under imbalance. (3) cuBLAS: expert sub-batches are padded to the largest and stacked into a 3D tensor dispatched to a batched GEMM call. Fastest at balanced load; degrades at high imbalance from padding waste.

IV Experiments

Setup. DeiT-Small (22M parameters, d=384d{=}384, dffn=1536d_{\text{ffn}}{=}1536, 12 blocks) evaluated on Imagenette (3,925 validation images, 10-class ImageNet subset) with 200-image calibration set, E=4E{=}4 experts, AdamW router training (5 epochs, lr 10310^{-3}, cosine decay). All timing: p50 of 100 forward passes, batch size 8, GTX 960 (4 GB, 112 GB/s), cuda.synchronize()-fenced.

IV-A Decomposition Ablation (D0–D8)

Table III ablates what each architectural choice contributes to accuracy and latency. Configurations D0–D8 span the design space from dense baseline to shared-basis CLEAR-MoE FFN, isolating the effect of the shared basis, the routing mechanism, and the expert assignment strategy.

TABLE III: Decomposition ablation (DeiT-Small, Imagenette, E=4E{=}4, last-kk layers, batch=8, GTX 960). Δ\DeltaTop-1 relative to D0. D4/D5 differ only in random seed. D6 and D8 are the same architecture: D6 is the cold-kernel first dispatch; D8 is measured after 3 warm-up passes (steady-state). The 21 ms gap reflects CUDA JIT overhead on first run.
ID Sh. fc1 Sh. fc2 Router Top-1 𝚫\boldsymbol{\Delta}pp ms (p50)
D0 Dense 86.73% ++0.00 59.6
D1 SVD rank-rr None 86.55% -0.18 59.7
D2 Disjoint Random 66.42% -20.31 49.4
D3 Full res. None 86.73% ++0.00 59.9
D4/D5 kk-means res. Random 86.62–65% -0.08–11 92.2
D6/D8 kk-means res. Linear 86.65% -0.08 78.8–100.3
D7 Disjoint kk-means 65.22% -21.51 48.0
✗ = disjoint (no shared fc1); ✓ = shared across all tokens.

The shared SVD basis is the dominant accuracy-preserving component. D3 (shared basis + full global residual, no routing) achieves exact reconstruction: algebraically, Wsharedz+(W2Wshared)z=W2zW_{\text{shared}}{\cdot}z+(W_{2}{-}W_{\text{shared}}){\cdot}z=W_{2}{\cdot}z, matching D0 at 86.73%. D2 and D7 (disjoint experts, no shared fc1) collapse to 65–66% regardless of routing strategy: disjoint expert weights estimated from 200 calibration images fail to generalize. In contrast, D4/D5 (shared basis, random routing) retain 86.62–65%, only 0.08–0.11 pp below D0; D6/D8 (shared basis, learned routing at 95% accuracy) achieve an identical -0.08 pp. Routing quality has limited measured impact once the shared basis is present.

No latency reduction on GTX 960. CLEAR-MoE FFN (D4–D8) executes shared fc1+fc2 on all tokens plus residual paths that collectively process all TT tokens (T/ET/E per expert ×\times EE experts), adding approximately one full fc2-equivalent computation; total arithmetic is 1.5×\approx\!1.5\times dense FFN. On bandwidth-limited memory (112 GB/s), this incurs 1.3–1.7×\times latency overhead (78–100 ms vs. 59.6 ms). Only disjoint experts (D2/D7) are faster (48–49 ms) by eliminating the shared path entirely, at the cost of 21 pp accuracy.

IV-B Layer-Selection Ablation (L0–L10)

Table IV compares all 11 layer-selection strategies on the same MoE configuration (E=4E{=}4, k=6k{=}6 of 12 FFN layers, composite scoring).

TABLE IV: Complete layer-selection ablation (L0–L10). All: DeiT-Small, E=4E{=}4, k=6k{=}6 layers, linear router, 200-image calibration. L1 avg of 3 seeds. Δ\Delta relative to L0.
ID Strategy Layers Top-1 𝚫\boldsymbol{\Delta}pp p50 ms
L0 Dense (no MoE) 86.73% ++0.00 57.0
L1 Random (n=3n{=}3) varies 86.72±\pm0.01% -0.01 78.5±\pm0.8
L2 First kk 0–5 86.60% -0.13 79.3
L3 Last kk 6–11 86.68% -0.05 80.3
L4 Alternating (odd) 1,3,5,7,9,11 86.83% ++0.10 87.2
L5 Sparsity only 0,1,3,4,5,10 86.73% ++0.00 99.1
L6 Clusterability 1,2,3,5,6,7 86.73% ++0.00 88.8
L7 High-sensitivity 0,1,7,8,9,10 86.65% -0.08 102.5
L8 Sp.+Cl. 0,1,4,5,6,10 86.62% -0.10 88.8
L9 Cl.-Se. 2,3,4,5,6,11 86.68% -0.05 94.0
L10 Composite (ours) 1–6 86.68% \boldsymbol{-}0.05 82.4

All 11 policies span only 0.21 pp (86.62–86.83%), confirming that accuracy is policy-insensitive. The shared basis preserves quality regardless of which 6 blocks are expertized. The composite score’s practical value is principled block 0 exclusion (sensitivity = 0.946): L2 (first-kk) includes block 0 and incurs the worst -0.13 pp drop. L10 avoids it, achieving 82.4 ms, comparable to L3 (last-kk, 80.3 ms) while additionally excluding high-sensitivity blocks. L7 (high-sensitivity, deliberately selecting the most sensitive blocks as a stress-test baseline) achieves only -0.08 pp accuracy loss but is the slowest policy at 102.5 ms, demonstrating that layer choice affects latency more than accuracy.

IV-C Dispatch Strategy Benchmark

We benchmark three CUDA backends (B=8B{=}8, T=1568T{=}1568, E=4E{=}4, GTX 960) under four imbalance levels (Table V). At balanced load, cuBLAS peaks at 728 K tok/s (2.79×\times CPU serial). At 80% imbalance, padding waste collapses cuBLAS to 558 K tok/s (-23%). Grouped dispatch remains stable at 665–747 K tok/s across all imbalance levels. Recommendation: cuBLAS for predictably balanced load; Grouped otherwise.

TABLE V: Dispatch micro-benchmark: 0% vs. 80% imbalance (B=8B{=}8, T=1568T{=}1568, E=4E{=}4, GTX 960).
0% imb. 80% imb.
Backend ms K tok/s ms K tok/s
CPU Serial 6.01 261 5.96 263
Naive 3.67 428 3.59 437
Grouped 2.36 665 2.14 733
cuBLAS 2.15 728 2.81 558

IV-D Roofline Analysis

Fig. 2 places each CLEAR-MoE operation on the GTX 960 roofline (peak FP32: 2.4 TFLOP/s, BW: 112 GB/s, ridge: 21.4 FLOPs/B). Dense FFN (AI = 74.3) and expert GEMMs (AI = 22.7) are compute-bound. The router gate (AI = 1.9) and token sort (AI = 1.0) are both deep in the memory-bound regime: 11×\times and 21×\times below the ridge point. This indicates that routing and token sort are strongly memory-bound (11–21×\times below the ridge point), making fused dispatch kernels a plausible high-leverage optimization. Expert GEMMs are already near the compute ceiling and offer diminishing returns.

Refer to caption
Figure 2: Roofline for GTX 960. Blue = compute-bound; orange/red = memory-bound. Router gate (AI = 1.9) is 11×\times below the ridge point (21.4 FLOPs/B); routing, not expert arithmetic, is the primary latency bottleneck.

IV-E Hyperparameter Sensitivity

We ablate SVD rank, expert count, router architecture, calibration size, and random seed (Table VI); findings are:

TABLE VI: Hyperparameter sensitivity summary (DeiT-Small, Imagenette). All ranges \leq0.41 pp.
Study Range tested Top-1 span 𝚫\boldsymbol{\Delta}pp
SVD rank rr 16–256 86.62–86.75% 0.13
Expert count EE 2–16 86.39–86.80% 0.41
Router arch. Lin/MLP/Adaptive 86.68% (all) 0.00
Calibration NN 50–500 86.55–86.70% 0.15
Random seed 42/123/456 86.70±\pm0.02% 0.05
E{2,4,8}{}^{*}E{\in}\{2,4,8\} spans 0.15 pp; E=16E{=}16 drops 0.26 pp (calibration instability: 200 images insufficient for 16 clusters).

SVD rank (r{16,32,64,96,128,192,256}r\in\{16,32,64,96,128,192,256\}): accuracy range is 86.62–86.75% (0.13 pp). Reconstruction error decreases monotonically from 0.91 to 0.29, but this improvement does not translate to accuracy. Even r=16r{=}16 captures the shared basis sufficiently.

Expert count (E{2,4,8,16}E\in\{2,4,8,16\}): accuracy is 86.65–86.80% for E{2,4,8}E\in\{2,4,8\}; E=16E{=}16 drops 0.26 pp as 200 calibration images provide insufficient statistics to estimate 16 fine-grained clusters. Latency grows \sim9 ms per doubling of EE (73 ms at E=2E{=}2 to 98 ms at E=16E{=}16). No empty experts appear at any count.

Router architecture: all three routers (Linear, MLP, Adaptive) achieve identical 86.68% Top-1. MLP routing accuracy reaches 0.980 vs. 0.925 for Linear, a 5.5 pp improvement in routing precision with zero downstream benefit. The gap between random routing (D4/D5, Table III) and all learned routers is at most 0.06 pp (roughly two changed predictions out of 3,925). While this numerically exceeds the 3-seed standard deviation of ±\pm0.02 pp (Table VI, row “Random seed”), a paired significance test would be needed to confirm a reliable effect; we treat 0.06 pp as practically negligible. Fig. 3 provides a mechanistic view: despite D6 (learned router) reducing mean per-token routing entropy 3.5×3.5\times versus D5 (random, ln41.386\ln 4{\approx}1.386 nats), the accuracy impact is negligible.

Refer to caption
Figure 3: Mean per-token routing entropy (nats) per expertized layer. Dashed red = D5 (random routing, max entropy ln41.386\ln 4{\approx}1.386). Blue bars = D6 (learned linear router, mean 0.40{\approx}0.40 nats). Despite 3.5×3.5\times entropy reduction, the D5-vs-D6 accuracy gap is only 0.06 pp, confirming that routing confidence does not drive accuracy; the shared SVD basis does.

Fig. 4 plots accuracy gap against router training across 13 configurations (D5 at x=0x{=}0, no trained router); no strong monotonic relationship is visible, consistent with the hypothesis that shared-basis quality governs accuracy regardless of routing quality.

Refer to caption
Figure 4: Accuracy gap (Δ\Deltapp relative to dense) vs. router training across 13 configurations. D5 (random, no trained router) at x=0x{=}0; D6 and cross-backbone routers at measured routing accuracy. No strong monotonic relationship is visible, consistent with the hypothesis that shared-basis quality governs accuracy regardless of routing precision.

Calibration size (N{50,100,200,500}N\in\{50,100,200,500\}, multiple subsets for N<500N{<}500): accuracy spans 86.55–86.70% across all sizes; N=50N{=}50 yields 86.65–86.70% (within 0.15 pp of N=500N{=}500). Router accuracy climbs from 0.83 to 0.95 as NN grows, without accuracy benefit.

Reproducibility: 86.70 ±\pm 0.02% Top-1 across seeds 42, 123, 456 (full pipeline, composite scoring, E=4E{=}4, N=200N{=}200).

IV-F Cross-Backbone Generalization

Table VII tests whether the shared-basis finding extends across architectures. We apply the full CLEAR-MoE pipeline to five backbones spanning 5.7–86.6 M parameters and two pretraining lineages (DeiT, ViT); all other settings are identical to the DeiT-S ablation (E=4E{=}4, N=200N{=}200, seed 42).

TABLE VII: Cross-backbone generalization (Imagenette, E=4E{=}4, N=200N{=}200, seed 42). D5 = random routing; D6 = learned linear router. |Δ|0.10|\Delta|{\leq}0.10 pp across all 10 configurations.
Backbone Params Dense D5 Δ\Delta pp D6 Δ\Delta pp D6 Rtr. Acc D6 Skew
DeiT-T/16 5.7 M 75.92% -0.05 ++0.05 0.913 0.090
DeiT-S/16 22.1 M 86.73% -0.10 -0.05 0.925 0.099
ViT-S/16 22.1 M 76.23% -0.05 -0.03 0.956 0.139
DeiT-B/16 86.6 M 91.77% -0.03 ±\pm0.00 0.927 0.094
ViT-B/16 86.6 M 85.38% -0.08 -0.10 0.967 0.150
Single-seed result for this backbone; per-backbone seed variance not characterised.
Highest load skew (0.150); see Discussion.

Routing differences are numerically small. Learned routing (D6) provides a marginal numerical advantage over random routing (D5) in 4 of 5 backbones; however, the per-backbone improvements range from 0.02 to 0.10 pp, too small to confirm without paired significance testing. The one exception is ViT-B/16, where D5 (-0.08 pp) outperforms D6 (-0.10 pp): The ViT-B result coincides with the highest observed load skew (0.150, vs. 0.094 for DeiT-B), which may contribute to the difference; no ablation was performed to confirm a causal link. Taken together, these results are consistent with the hypothesis that routing quality has limited impact once the shared SVD basis is present; a definitive claim would require paired prediction-level testing across additional seeds. Fig. 5 visualises all 10 configurations.

Refer to caption
Figure 5: Accuracy Δ\Delta vs. dense for five ViT backbones under D5 (random) and D6 (learned linear) routing (Imagenette, E=4E{=}4, N=200N{=}200, seed 42). All deltas within |Δ|0.10|\Delta|{\leq}0.10 pp. D6 provides marginal numerical advantage in 4 of 5 backbones; D5 outperforms D6 only on ViT-B (highest load skew, 0.150).

V Discussion

Why routing differences are small. All residual experts are scalar-conditioned variants of the same matrix (W2Wshared)(W_{2}-W_{\text{shared}}), so mis-routing changes only the scaling factor applied to a token’s correction, not its direction. The 0.06 pp gap between random and learned routing is the direct consequence. Cross-backbone results (Table VII) are consistent across 4 of 5 architectures; a linear router (<<10K parameters) appears adequate. Expert Choice routing [14], where experts select their tokens rather than the reverse, achieves better load balance in language models; on DeiT-S, however, random and learned routing differ by at most 0.06 pp, and all cross-backbone random-routing losses remain within 0.10 pp of dense, supporting the hypothesis that routing strategy is secondary to decomposition quality.

Why CLEAR-MoE is slower on the GTX 960. CLEAR-MoE FFN computes shared fc1+fc2 on all tokens, then adds residual paths that each process T/ET/E tokens but collectively span all TT tokens across EE experts, adding approximately one full fc2-equivalent computation. Total arithmetic is 1.5×\approx\!1.5\times dense FFN. On a bandwidth-limited GPU (112 GB/s vs. 2 TB/s for an A100), loading shared weights and expert-residual weights across memory dominates. The dispatch micro-benchmark isolates this: routing and token sort (AI = 1.0–1.9) are 11–21×\times below the compute ridge. Disjoint experts (D2/D7) are faster (48 ms) by eliminating shared paths, but sacrifice 21 pp accuracy. Analytical modelling suggests \geq800 GB/s bandwidth may be needed for latency parity; A100 (2 TB/s) or H100 (3.35 TB/s) class hardware are the plausible candidates.

Practical design guidelines. When to use CLEAR-MoE. CLEAR-MoE may be well suited when accuracy preservation is non-negotiable and backbone fine-tuning is infeasible: the evaluated DeiT and ViT backbones can be converted with 200 calibration images and no GPU cluster. The method is not competitive with disjoint-expert approaches if latency reduction is the primary goal on consumer hardware.

Choosing EE and rr. Expert count E=4E{=}4 provides a reasonable empirical trade-off: accuracy remains stable from E=2E{=}2 to E=8E{=}8, and E=4E{=}4 balances router overhead with residual expressiveness. For E=16E{=}16, our analytical model suggests \geq800 calibration images may be needed for stable cluster estimates; this was not directly validated. SVD rank rr can be set as low as 16 without consistent measured accuracy reduction on Imagenette; the default r=d/2192r{=}d/2{\approx}192 for DeiT-Small is conservative and may waste compute on deep layers.

Choosing the dispatch backend. cuBLAS is optimal only for perfectly balanced expert load, which is common in classification but rare in detection/segmentation. For workloads with variable token-to-expert ratios, Grouped dispatch is approximately 31% faster than cuBLAS at 80% imbalance and remains comparatively stable across imbalance levels. Naive dispatch is a reference implementation only and is not recommended for latency-sensitive deployment.

Limitations. Backbone and dataset scope. DeiT-T/S/B and ViT-S/B generalization is confirmed (Table VII); ViT-L, ImageNet-1K, and hierarchical backbones (Swin, ConvNeXt) are not evaluated. Parameter count, memory footprint, and FLOPs versus a same-hardware baseline are not reported.

Hardware and compute. CLEAR-MoE is 1.3–1.7×\times slower than dense on GTX 960 (112 GB/s). Latency parity on high-bandwidth hardware is modelled analytically only; multi-device projections are simulated (PCIe Gen3×\times16), not measured NCCL runs.

VI Conclusion

CLEAR-MoE demonstrates that post-training expert extraction preserves 99.9{\geq}99.9% of dense ViT accuracy via a shared SVD basis plus per-cluster residuals, requiring only 200 calibration images and a single consumer GPU. This finding is consistent across five ViT backbones spanning 5.7–86.6 M parameters and two pretraining lineages (Table VII), showing that the shared SVD basis is the dominant accuracy-preserving factor while routing quality, SVD rank, expert count (E{2,4,8}E\in\{2,4,8\}), and calibration size (N{50,,500}N\in\{50,\ldots,500\}) are secondary. Random routing achieves 86.62% while learned routing at 98% accuracy achieves 86.68% (a 0.06 pp difference), simplifying the design space to a lightweight <<10K-parameter linear router and suggesting future work should invest in decomposition quality over router expressiveness. On a bandwidth-constrained GTX 960 (112 GB/s), CLEAR-MoE’s FFN is 1.3–1.7×\times slower than dense because loading shared weights for all tokens dominates; roofline analysis confirms routing/scatter-gather (AI = 1.0–1.9) are 11–21×\times below the ridge point while expert GEMMs (AI = 22.7) approach the compute ceiling, identifying fused dispatch kernels as the primary engineering target. Future research directions include evaluating scaling on ViT-L and ImageNet-1K, implementing Triton-fused router-dispatch kernels, extending the composite scoring to hierarchical backbones (Swin, ConvNeXt) with non-uniform FFN widths, and empirically testing CLEAR-MoE on high-bandwidth hardware (A100/H100 class) to validate projected latency parity.

References

  • [1] U. Berisha, J. Mehnert, and A. P. Condurache (2025) Efficient data driven mixture-of-expert extraction from trained networks. External Links: 2505.15414, Link Cited by: §I, TABLE I, §II.
  • [2] T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y. Li (2023-10) AdaMV-moe: adaptive multi-task vision mixture-of-experts. pp. 17300–17311. External Links: Document Cited by: §I, §II.
  • [3] W. Cui, S. Jiao, T. University, Z. Han, M. Research, Y. Wang, N. Zheng, L. Ma, Y. Yang, F. Yang, J. Xue, L. Qiu, L. Zhou, A. Quan, H. Tan, M. Guo, L. Ouyang, and Q. Chen (2023-07) Optimizing dynamic neural networks with brainstorm optimizing dynamic neural networks with brainstorm. pp. . Cited by: §II.
  • [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2020) An image is worth 16x16 words: transformers for image recognition at scale. CoRR abs/2010.11929. External Links: Link, 2010.11929 Cited by: §I.
  • [5] G. Hinton, N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, and J. Dean (2017-01) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. pp. . External Links: Document Cited by: §I.
  • [6] Cited by: §II.
  • [7] C. Jiang, Y. Tian, Z. Jia, S. Zheng, C. Wu, and Y. Wang (2024) Lancet: accelerating mixture-of-experts training via whole graph computation-communication overlapping. ArXiv abs/2404.19429. External Links: Link Cited by: §II.
  • [8] Cited by: §II.
  • [9] Cited by: §II.
  • [10] Cited by: §I, §II.
  • [11] F. Szatkowski, B. W’ojcik, M. Pi’orczy’nski, and S. Scardapane (2023) Exploiting activation sparsity with dense to dynamic-k mixture-of-experts conversion. Advances in Neural Information Processing Systems 37. External Links: Link Cited by: §I, TABLE I, §II.
  • [12] H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov (2022-06) A-vit: adaptive tokens for efficient vision transformer. pp. 10799–10808. External Links: Document Cited by: §II.
  • [13] Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou (2022-01) MoEfication: transformer feed-forward layers are mixtures of experts. pp. 877–890. External Links: Document Cited by: §I, TABLE I, §II.
  • [14] Cited by: §V.

Appendix A Methodology Details

A-A Glossary of Key Terms

Table A1 defines domain-specific terminology used throughout the paper.

TABLE A1: Domain-specific terminology.
Term Definition
Token Image patch embedded as dd-dimensional vector; 197 tokens per 224×224224{\times}224 image
FFN Feed-forward network: two-layer MLP applied per token in each transformer block
Expert Routed residual branch processing a token subset; in CLEAR-MoE, all residual experts share one direction (W2Wshared)(W_{2}{-}W_{\text{shared}}) and differ only by a cluster-conditioned scalar ses_{e}
Router/Gate Lightweight network mapping each token to a probability distribution over experts
MoE layer FFN replaced by EE experts plus a router; only top-k<Ek{<}E experts activate per token
SVD Singular value decomposition; truncated SVD retains top-rr singular components
kk-means Partitions NN points into kk clusters by minimising within-cluster variance
Calibration set Small representative set used to measure model statistics (no weight updates)
AI Arithmetic intensity: FLOPs per byte of memory accessed
Ridge point AI where compute ceiling equals memory bandwidth ceiling (21.4 FLOPs/B on GTX 960)
p50 Latency Median over 100 repeated forward passes; robust to OS-scheduling outliers
cuBLAS NVIDIA’s optimised linear-algebra library; batched GEMM calls dispatch through it

A-B Dataset Preparation and EDA

Imagenette. 13,394 raw images were scanned using SHA-256 deduplication, robust z-score filtering (τ=3.5\tau{=}3.5), and Isolation Forest (contamination = 0.01). 118 images (0.88%) were removed, yielding a clean set of 9,391 train / 3,885 val (used for EDA analysis only). The standard Imagenette validation split (3,925 images, including the 40 removed as anomalies by our filter) was used for all reported Top-1 accuracy evaluation, ensuring comparability with prior work. ImageNet normalisation and Lanczos-4 resize to 224×224224{\times}224 are applied.

Distribution shift analysis. DeiT-Small penultimate activations for all 3,885 clean val images (shape 3885×3843885{\times}384): PSI = 0.043 (<<0.1, negligible shift), JSD = 0.018, KS pp-values >>0.05 for all 10 classes. Clustering quality. k=10k{=}10 on penultimate image-level activations: Silhouette = 0.542, Davies-Bouldin = 1.24, confirming that the feature space supports class-level cluster separation. Note that CLEAR-MoE clusters token-level intermediate activations with k=4k{=}4, which is a complementary but distinct analysis; the image-level EDA establishes feature quality rather than directly validating token cluster assignments. SVD energy. Rank-192 truncation retains 99.2% of penultimate activation matrix variance; top-10 singular values capture 67.8%. The effect of W2W_{2} truncation rank on accuracy and reconstruction error is evaluated in Table C3.

Table A2 summarises these preprocessing and analysis statistics.

TABLE A2: Dataset preprocessing and EDA statistics.
Statistic Value
Raw Imagenette images 13,394
Removed (anomalous) 118 (0.88%)
Train / val split 9,391 / 3,885 (clean, EDA only); 3,925 (standard, used for all Top-1)
Train-test PSI 0.043 (<<0.1 = negligible shift)
Train-test JSD 0.018
k=10k{=}10 Silhouette score 0.542
k=10k{=}10 Davies-Bouldin 1.24
Rank-192 SVD variance 99.2%
Calibration size 200 images

A-C Algorithm Pseudocode

Algorithms 13 give pseudocode for the three core CLEAR-MoE phases.

Algorithm 1 Phase 2: Layer Scoring and Selection
0: Calibration activations {hl(i)}\{h_{l}^{(i)}\}, LL = 12 layers, select k=6k{=}6
1:for each layer ll do
2:  sp(l)j𝟏(|hl,j|<0.01)numel(hl)\text{sp}(l)\leftarrow\frac{\sum_{j}\mathbf{1}(|h_{l,j}|<0.01)}{\operatorname{numel}(h_{l})}
3:  cl(l)Silhouette(k-means(hl,E))\text{cl}(l)\leftarrow\text{Silhouette}(k\text{-means}(h_{l},E))
4:  se(l)Δlogits/logits\text{se}(l)\leftarrow\|\Delta\text{logits}\|/\|\text{logits}\| (zero FFN output)
5:  𝒮(l)0.4sp+0.4cl0.2se\mathcal{S}(l)\leftarrow 0.4\,\text{sp}+0.4\,\text{cl}-0.2\,\text{se}
6:end for
7: Select top-kk layers by 𝒮\mathcal{S}

Algorithm 2 Phase 3: Expert Extraction (fc1 shared; fc2 decomposed)
0:W2d×dffnW_{2}\in\mathbb{R}^{d\times d_{\text{ffn}}}, activations zz, EE, rank rr
1:[U,Σ,V]SVD(W2)[U,\Sigma,V]\leftarrow\text{SVD}(W_{2});   WsharedUrΣrVrW_{\text{shared}}\leftarrow U_{r}\Sigma_{r}V_{r}^{\top}
2:{C1,,CE}k-means(z,E)\{C_{1},\ldots,C_{E}\}\leftarrow k\text{-means}(z,E)
3:for e{1,,E}e\in\{1,\ldots,E\} do
4:  se1|Ce|iCezi21Ni=1Nzi2s_{e}\leftarrow\frac{\frac{1}{|C_{e}|}\sum_{i\in C_{e}}\|z_{i}\|_{2}}{\frac{1}{N}\sum_{i=1}^{N}\|z_{i}\|_{2}}
5:  Weres(W2Wshared)seW_{e}^{\text{res}}\leftarrow(W_{2}-W_{\text{shared}})\cdot s_{e}
6:end for
Algorithm 3 CLEAR-MoE FFN Inference (Grouped backend)
0: Tokens xx, shared fc1, WsharedW_{\text{shared}}, {Weres}\{W_{e}^{\text{res}}\}, Router gg
1:zGELU(fc1(x))z\leftarrow\text{GELU}(\text{fc1}(x)) (shared, all tokens)
2:hshzWsharedh_{\text{sh}}\leftarrow zW_{\text{shared}}^{\top} (shared SVD basis)
3:aargmax(g(x))a\leftarrow\text{argmax}(g(x)) (expert assignment)
4: Sort zz by aa; get sub-batch boundaries via index search
5:hres[e]zeWeresh_{\text{res}}[e]\leftarrow z_{e}W_{e}^{\text{res}\top} for each expert ee
6: Unshuffle hresh_{\text{res}}; yhsh+hresy\leftarrow h_{\text{sh}}+h_{\text{res}}

Appendix B Experimental Configuration

B-A Hardware and Software Setup

Table B1 lists the complete hardware and software configuration used in all experiments.

TABLE B1: Hardware and software configuration for all experiments.
Item Specification
OS Windows 11 Pro
GPU NVIDIA GeForce GTX 960
GPU memory 4.0 GB GDDR5
Memory bandwidth 112 GB/s
Peak FP32 throughput 2.4 TFLOP/s
Ridge point 21.4 FLOPs/Byte
CUDA 11.8
PyTorch 2.6.0+cu118
Classification backbone DeiT-Small; 22M parameters; patch size 16×\times16; d=384d{=}384; dffn=1536d_{\text{ffn}}{=}1536; 12 transformer blocks
Segmentation backbone Not evaluated (planned future work)
Default EE 4 experts
Default rank rr r=d/2=192r{=}d/2{=}192 for DeiT-Small (50% of matrix rank)
Calibration size 200 images
Router training 5 epochs, AdamW, lr 10310^{-3}, cosine decay
Latency p50 of 100 passes, batch = 8, cuda.synchronize()

Appendix C Extended Experimental Results

C-A Full Ablation Tables

Per-layer composite scores. Full composite scores for all 12 DeiT-Small FFN blocks appear in Table II (Section III). Block 0 ranks last (composite = 0.101) due to sensitivity = 0.946.

Full layer-selection policy ablation. The complete L0–L10 ablation now appears in Table IV (Section IV-B).

C-B Hyperparameter Sensitivity Studies

Tables C1C5 report the full numerical results for the five hyperparameter sensitivity studies summarised in Section IV.

TABLE C1: 3-Seed Robustness. Full composite-score pipeline, E=4E{=}4, DeiT-Small.
Seed Top-1 p50 ms Router Acc Skew
42 86.68% 85.05 0.925 0.099
123 86.70% 83.90 0.928 0.116
456 86.73% 78.89 0.928 0.101
Mean±\pmStd 86.70±\pm0.02% 82.6±\pm2.7 0.927±\pm0.002 0.105±\pm0.008
TABLE C2: Calibration-set sensitivity. Multiple random subsets per size.
NcalN_{\text{cal}} Sub. Top-1 Router Acc Skew
50 1 86.70% 0.867 0.139
50 2 86.68% 0.833 0.073
50 3 86.65% 0.856 0.139
100 1 86.55% 0.896 0.094
100 2 86.60% 0.908 0.150
100 3 86.70% 0.899 0.153
200 1 86.68% 0.926 0.099
200 2 86.60% 0.928 0.128
500 1 86.70% 0.953 0.098
TABLE C3: SVD rank sweep. Seed 42, E=4E{=}4, composite scoring.
Rank rr Top-1 p50 ms Recon Err
16 86.75% 78.13 0.907
32 86.65% 76.79 0.841
64 86.65% 76.60 0.734
96 86.75% 78.07 0.643
128 86.68% 77.82 0.562
192 86.62% 76.01 0.420
256 86.70% 77.36 0.294
Refer to caption
Figure C1: SVD rank vs. Top-1 accuracy (left axis, blue) and reconstruction error (right axis, red). Accuracy spans 0.13 pp across r{16,,256}r\in\{16,\ldots,256\} while reconstruction error drops from 0.907 to 0.294, confirming rank does not gate accuracy once the shared basis captures the dominant singular subspace.
TABLE C4: Expert-count study. Seed 42, composite scoring, k=6k{=}6.
EE Top-1 p50 ms Rtr Acc Skew Empty
2 86.65% 73.4 0.967 0.247 0
4 86.65% 77.5 0.923 0.074 0
8 86.80% 84.3 0.893 0.042 0
16 86.39% 97.8 0.865 0.024 0
TABLE C5: Router architecture comparison. Seed 42, E=4E{=}4, composite scoring. Entropy: aggregate expert-usage entropy (distribution of tokens across experts, per layer; max =ln41.386=\ln 4\approx 1.386). Note: this is the load-balance entropy, distinct from the per-token predictive entropy in Fig. 3 (\approx0.40 nats), which measures router confidence.
Router Top-1 p50 ms Rtr Acc Skew Entropy Params
Linear 86.68% 78.0 0.925 0.099 1.296 9,240
MLP 86.68% 75.6 0.980 0.095 1.300 149,400
Adaptive 86.68% 77.6 0.936 0.099 1.296 9,240

C-C Dispatch Benchmark and Parallel Scaling

Table C6 benchmarks all three CUDA backends under four token-load imbalance levels (B=8B{=}8, T=1568T{=}1568 tokens, E=4E{=}4, GTX 960); CPU Serial is included as a baseline, allowing direct GPU-vs-CPU comparison at each imbalance level. The roofline plot appears in Fig. 2 (Section IV). Table C7 projects multi-device throughput analytically.

TABLE C6: Dispatch micro-benchmark: p50 latency (ms) and throughput (K tok/s) across imbalance levels. T=1568=8×196T{=}1568=8{\times}196 tokens (class token excluded from expert dispatch; processed on the dense path). Calibration uses 197 tokens/image including the class token.
0% imb. 40% imb. 60% imb. 80% imb.
Backend ms K/s ms K/s ms K/s ms K/s
CPU Serial 6.01 261 7.22 217 5.96 263 5.96 263
Naive 3.67 428 3.55 442 3.50 448 3.59 437
Grouped 2.36 665 2.10 747 2.30 681 2.14 733
cuBLAS 2.15 728 2.29 684 2.45 639 2.81 558
Refer to caption
Figure C2: Dispatch throughput vs. token-load imbalance for three CUDA backends (B=8B{=}8, T=1568T{=}1568, E=4E{=}4, GTX 960). Grouped is stable across all imbalance levels (665–747 K tok/s); cuBLAS degrades 23% from 0% to 80% imbalance due to padding waste. For variable-load workloads, Grouped is the recommended backend.

GPU vs. CPU and Parallel Scaling.

GPU vs. CPU speedup at balanced load is readable from the CPU Serial row in Table C6 above: at 0% imbalance, cuBLAS achieves 728 K tok/s vs. 261 K tok/s for CPU Serial (2.79×2.79\times speedup).

Multi-device projection (simulated; no second GPU available). All-to-all and AllReduce costs are modelled analytically using PCIe Gen3×\times16 bandwidth (16 GB/s). Results are not measured NCCL runs.

TABLE C7: Parallel scaling projection. Expert/Data/Pipeline parallelism are simulated; single-GPU is measured.
Mode tok/s Devices Speedup vs GPU-1 Efficiency
Full-model single-GPU 225K 1 1.00×\times 1.00
Expert Parallel EP-2 (sim.) 175K 2 0.78×\times 0.39
Pipeline Parallel PP-2 (sim.) 8K 2 0.03×\times 0.02
All multi-device entries are simulated via PCIe Gen3×\times16 bandwidth; no NCCL runs performed.
EP-2 overhead: all-to-all token redistribution (AI = 0.1 F/B).
PP-2: 50% pipeline bubble with 1 micro-batch, 2 stages.
Single-GPU throughput is full-model (distinct from dispatch-only throughput in Table C6).