Distributed, Parallel, and Cluster Computing

New submissions
Cross-lists
Replacements

See recent articles

Showing new listings for Tuesday, 30 June 2026

Total of 46 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2606.28889 [pdf, other]: Title: Concurrent Splay-Based Tree

Vitaly Aksenov, Rene van Bevern, Artem Shilkin

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)

Most work on efficient concurrent ordered indices, such as concurrent binary search trees, B-trees, skip lists, etc., has focused on data structures that provide good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Many efficient distribution-adaptive data structures exist in the sequential case; however, they are often complicated to make efficient in the concurrent case.
The most prominent distribution-adaptive data structure is Splay Tree. Its most important advantage is that it does not store any balancing information and provides a reasonable performance improvement on extremely skewed workloads, such as Zipfian workloads. This paper proposes a splay-like rotation design for concurrent binary search trees. Instead of moving an accessed node to the root, rotations use two depth thresholds that are based on the static-optimality complexity computed from the number of accesses to the node: a node is rotated only when it is substantially deeper than the upper threshold, and rotations of the node stop before reaching the lower threshold. This design aims to preserve the main practical benefit of splaying on skewed workloads while reducing contention near the root.
We present two variants of the rotation design: one using an exact 64-bit access counter per node and one using a 6-bit approximate counter. We prove static optimality for the corresponding sequential read-only tree and evaluate both rotation designs by implementing them on top of the concurrent AVL tree of Bronson et al. Our experiments show that the approach can improve throughput on several skewed workloads.
[2] arXiv:2606.28972 [pdf, html, other]: Title: Five Ways to Build a Concurrent Linked From Coarse-Grain Locking to Lock-Free Algorithms

Zeeshan Mohammed Rangrej

Comments: 9 Pages, 5 Psuedo Code optimization techniques

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

Linked lists are one of the most basic data structures in computer science. But when many threads try to use the same linked list at the same time, things get complicated. In this paper, we look at five different ways to make a linked list work correctly and efficiently with multiple threads running at once. We start with the simplest approach -- one big lock for the whole list -- and step by step improve it, ending with a lock-free design that uses no locks at all. We implemented all five versions in C++ and measured how fast each one is across different workloads (read-heavy, balanced, and write-heavy) and different list sizes. Our results show that the right choice of algorithm depends heavily on how the list is used: the coarse-grain and lazy lists win under read-heavy workloads with small key ranges, while the lock-free list becomes competitive when key ranges are large and more threads are running. Fine-grain locking, despite its theoretical appeal, pays a heavy cost from per-node lock overhead and consistently performs the worst in our tests.
[3] arXiv:2606.29052 [pdf, html, other]: Title: Importance-Aware Resource Allocation for Collaborative Task-Oriented Semantic Communication

Kaiyi Lei, Yuanzhe Peng, Letian Zhang, Jie Xu

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Task-oriented semantic communication must allocate scarce radio resources to semantic features under fast fading wireless conditions and strict end-to-end latency budgets. Existing solutions are either optimization-heavy, leading to prohibitive computational overhead during online operation, or rely on end-to-end retraining procedures together with slowly varying channel assumptions. We propose iCoTASC (importance-aware Collaborative Task-Oriented Semantic Communication), a hybrid offline-online framework designed for collaborative multi-device semantic communication systems. iCoTASC leverages attribution-based importance to guide per-dimension embedding selection as a practical communication control signal, models diminishing semantic returns of quantization through a data-driven utility function, and precomputes per-transmitter utility lookup tables offline, which together enable lightweight online scheduling via table lookup and low-complexity refinement under time-varying channels. The proposed framework supports real-time, channel-adaptive semantic resource allocation in distributed systems without requiring retraining of the underlying task inference model.
[4] arXiv:2606.29078 [pdf, html, other]: Title: Are There Manufacturer Differences in Hard-Drive Reliability?

Christoph Siemroth, Yeomyung Park

Comments: Accepted to IEEE Transactions on Cloud Computing. Copyright 2026 IEEE

Journal-ref: in IEEE Transactions on Cloud Computing, vol. 14, no. 2, pp. 1015-1024, April-June 2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

Based on the Backblaze hard disk drive (HDD) dataset, we analyze whether the four major HDD manufacturers represented in the dataset -- HGST, Seagate, Toshiba, Western Digital (WD) -- show differences in short- to medium-term HDD failure rates. Using two different duration regression models, we find -- holding constant drive age, capacity, form-factor, and drive temperature -- that Toshiba's failure rate is slightly above Seagate's. HGST HDD failure rates are the lowest, about 41% of Seagate's. WD HDD failure rates are significantly above HGST's, but still only about 52% of Seagate's. We also document the effects of age, capacity, temperature and drive location on failure rates.
[5] arXiv:2606.29207 [pdf, html, other]: Title: KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding

Guangyu Xiang, Xueze Kang, Lin Zhang, Wenxiang Lin, Shaohuai Shi, Yuxin Wang, Xiaowen Chu

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

LLM serving is increasingly dominated by long and dynamic decode workloads from agents, reasoning models, and extended conversations. When bursty long-context demand exceeds deployed capacity, existing serving systems typically scale out by launching additional serving instances with model replicas. This instance-level elasticity increases KV capacity only by provisioning another full copy of the model, inheriting startup latency, memory overhead, and batch fragmentation.
We present KernelFlume, a decode-centric architecture that disaggregates the stable projection/FFN path from core-attention computation: weight nodes execute dense projection/FFN kernels, while weightless attention nodes store token-range KV partitions and scale with request-state demand. To make this separation elastic, KernelFlume maintains a routing table that maps token ranges to attention-node endpoints. It updates routes at token boundaries and uses host-visible graph signals to drive pre-registered UCX endpoint communication outside the captured CUDA Graph. To preserve low per-token latency after disaggregation, KernelFlume combines query-first core-attention dispatch with inter-layer kernel pipelining, overlapping remote attention and communication with local projection/FFN work. On real GPU testbeds (intra-node A6000 and cross-node H100), under a dynamic long-context agentic workload serving Llama-3.1-8B, KernelFlume sustains flat p99 TPOTs of ~74 ms on A6000 and ~34 ms on H100, while lowering cost per million output tokens by up to 32% and 61%, respectively, relative to full-instance elastic scaling with ServerlessLLM, a state-of-the-art instance-startup method. Replaying the same trace at larger model scale in simulation projects a 56--66% cost reduction over ServerlessLLM, widening to 80--85% with cheaper heterogeneous attention-node hardware and persisting into the million-token context range.
[6] arXiv:2606.29483 [pdf, html, other]: Title: Fog Computing and Large Language Models: A vision for the mutual beneficiaries

Satish Narayana Srirama

Comments: Paper accepted for publication at IEEE Computer Magazine

Journal-ref: IEEE Computer, ISSN: 0018-9162, 2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Fog computing utilizes proximal computational resources for sensor data processing and actuation, and addresses the latency, network load, and privacy issues of cloud-centric Internet of Things. On the other hand, Large Language Models (LLMs) are a type of deep learning AI models, which are trained on enormous text data, that perform various natural language processing tasks such as translation, question answering, text summarization, and code generation. LLMs are generally cloud-centric, requiring abundant GPU memory and computing capabilities, again face the same issues that led to fog computing. This pushes the necessity for LLM support in the proximity on fog infrastructure, requiring LLM optimizations such as parameter-weight quantization, pruning, low-rank adaptation etc. Meanwhile, fog computing also gets benefit from LLM's ability for code generation, in the dynamic deployment of fog-based applications. The paper addresses how both fog computing and LLMs can be mutual beneficiaries, discussing the state-of-the-art and future research scope.
[7] arXiv:2606.29629 [pdf, html, other]: Title: Energy-Efficient Multimodal Inference Serving with Tri-serve

Ziyang Jia, Sara Rashidi Golrouye, Laxmi Bhuyan, Benjamin Kubwimana, Devashree Tripathy, Zexin Li, Cong Liu, Daniel Wong

Comments: 9 pages, 9 figures. Submitted to ICCD 2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Multimodal model inference creates substantial energy demand with growing performance requirements. Within GPUs, power is autonomously managed by an on-board power management unit (PMU), which makes frequency boosting/throttling decisions. However, we find that these hardware-managed frequency decisions can cause significant power inefficiency.
This work identifies three classes of power inefficiencies within modern multimodal inference serving: (1) inter-stage dependency stalls run at near maximum frequency despite being idle; (2) anti-correlation between auto-boost frequency and arithmetic intensity (A.I.) results in compute-bound phases (e.g., prefill) running at lower frequency and vice versa; and (3) thermal throttling degrades SM frequency and throughput.
We propose Tri-serve, a software-based DVFS controller that jointly accounts for three classes of inefficiency -- inter-stage Dependency stalls, the Arithmetic-intensity effect on frequency and power, and the Thermal-throttling effect of high A.I. phases -- to deliver energy-efficient multimodal serving on commodity GPUs. We show that Tri-serve achieves 22% energy efficiency improvement with no latency or throughput impacts.
[8] arXiv:2606.29651 [pdf, html, other]: Title: NI-ORCA: A Parallel Algorithm for Counting the Orbits of Non-Induced Graphlets up to K4

Syed Ibtisam Tauhidi, Arindam Karmakar, Thai Son Mai, Hans Vandierendonck

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Counting the orbits of graphlets in a network is a vital tool for understanding the structural roles of vertices in various graph analytics tasks. While existing algorithms efficiently compute orbits of induced graphlets, many real-world applications require non-induced orbit counts. However, no current method offers exact, scalable, and parallel support for non-induced orbit counting. This paper presents NI-ORCA, a parallel algorithm to efficiently compute the orbits of non-induced graphlets up to size four (4-clique). NI-ORCA extends the ORCA framework for non-induced orbit counting by reformulating a system of linear equations. The algorithm consists of three stages: triangle counting, 4-clique enumeration, and orbit solving. We design and implement stage-specific parallelisation strategies using thread and vertex-local memory models and data structures, minimising contention and balancing workload. We further analyse the impact of scheduling policies, chunk sizes, and affinity strategies on performance. Experimental analysis on eight real-world datasets and a series of synthetic Erddos-Renyi graphs demonstrates that a mixed mode combining stage-specific data structure, with dynamic scheduling with small chunk sizes, delivers consistent speedup and effective load balancing. Our results show that NI-ORCA significantly outperforms state-of-the-art sequential algorithms, achieving up to 30x speedups.
[9] arXiv:2606.29708 [pdf, html, other]: Title: Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving

Zhixin Wang, Zhengbo Wang, Fangcheng Fu, Yinhui Lu, Jinlong Hou, Yijie Chen, Xiaowei Shen, He Liu, Xiangbin Li, Jun Chen, Ruya Gu, Dian Wang, Zhou Tan, Yuan Cheng, Hongzhou Zhang, Xiangjun Huang, Ping Zhang, Xiaohe Hu

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Heterogeneous prefill-decode (PD) inference is now in production: prefill on cost-efficient or supply-available accelerators, decode on bandwidth-strong ones, and KV state crossing mixed interconnects in mixed numerical formats. Each deployment makes these decisions on its own. What is missing is the picture across configurations-which decisions must be made jointly at the PD boundary, and which can be made independently. We propose a design space organized along four design axes-accelerator, precision, interconnect, and KV residency and the workload regime (stage pressure) they respond to. We show that only a subset of interactions among these factors become binding constraints once PD inference becomes heterogeneous. These interactions surface through three recurring boundary decisions: compute placement, KV representation, and KV ownership. The resulting analysis yields concrete guidance. Precision policy belongs to runtime roles rather than to a single system-wide setting, because the same low-bit format relieves different bottlenecks on each side of the boundary. KV transfer engines move bytes rather than tensor semantics, making representation compatibility an explicit boundary concern whenever producer and consumer differ. The KV handoff also carries a lifecycle-reservation, release, and failure recovery-that spans prefill and decode and requires explicit ownership. Two further interactions remain open. Cross-vendor and interconnect-related claims are stated as design guidance grounded in industrial deployment observations and source-code inspection of the runtimes involved.
[10] arXiv:2606.29775 [pdf, html, other]: Title: SMART-MIG: A Learning Framework for Scalable and Energy-Efficient GPU Scheduling

Wenqing Yu, Neel Karia, Tanvi Hisaria, Clifford Stein, Olivier Tardieu, Asser Tantawi

Comments: 14 pages, 13 figures, paper accepted at 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2026)

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

The emergence of Multi-Instance GPU (MIG) technology enables us to run smaller machine learning models on partitions of a GPU rather than the entire device, thus improving utilization and reducing energy consumption, albeit with potential performance trade-offs. Meanwhile, the growing energy demands of GPU-equipped data centers motivate the development of online partitioning and scheduling schemes that not only ensure fast job processing but also achieve high energy efficiency. However, achieving energy-tardiness efficiency with manageable algorithmic complexity in large-scale scheduling remains a great challenge, due to the dual objectives of deciding on the GPU partitions and scheduling jobs onto the slices of the heterogeneous partitions. To address this challenge, we propose SMART-MIG, a parallel computing system that combines Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) for large-scale MIG repartitioning with tailored heuristic algorithms for job scheduling. We demonstrate that the complexity of the repartitioning component remains constant even as the number of jobs and GPUs increases. We also establish theoretical lower bounds on energy consumption and tardiness to rigorously benchmark system performance. Finally, extensive experiments show that SMART-MIG improves the energy-tardiness efficiency by $18\%$ compared to its corresponding static-partitioning counterpart, while being only $27\%$ above the theoretical lower bound on energy consumption.
[11] arXiv:2606.29982 [pdf, html, other]: Title: Beyond Uniform Experts: Cost-Aware Expert Execution for Efficient Multi-Device MoE Inference

Hui Zang, Pengfei Xia, Hong Liu, Jiajia Chu, Tuo Hao, Minghao Chen, Rui Zhang, Ziyang Zhang

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Mixture-of-Experts (MoE) architectures enable language models to achieve unprecedented scale via sparse activation. However, their inference performance is often limited by data movement bottlenecks. Two coupled challenges exacerbate this limtation: (1) Importance-Agnostic Cost: Low-contribution experts incur nearly uniform memory and transfer costs, resulting in a low cost-to-benefit ratio and wasting critical bandwidth; (2) System-Level Imbalance: Multi-device deployments are universally bottlenecked by the slowest device, meaning that local reductions on one device may yield no improvement in end-to-end latency. We propose Cost-Aware Expert Execution (CAEE), a hardware-guided runtime framework that jointly optimizes for token-level expert importance and system-level execution cost. CAEE uses lightweight, calibrated cost models to estimate hardware overhead, selectively prunes low-importance, high-cost experts, and redistributes their contributions via a low-overhead compensation mechanism, avoiding extra data movement. Evaluations on the 671B DeepSeek-R1 model show that CAEE can reduce end-to-end inference latency by 8\%-18\% across diverse deployment settings, including expert offloading and on-device execution on multi-device systems, while maintaining a model accuracy drop of less than 1\%.
[12] arXiv:2606.30197 [pdf, other]: Title: FBench: A Flexible Benchmark for CFG-Based What-If Exploration of HPC I/O Patterns

Zhaobin Zhu, Chen Wang, Kathryn Mohror, Sarah Neuwirth

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

The I/O performance of large-scale HPC applications depends on a complex interplay of access patterns, middleware optimizations, and file system configurations. To systematically explore these effects without repeatedly rerunning full applications, we introduce FBench, a flexible and code-transparent benchmarking tool for what-if analysis and I/O performance exploration. FBench leverages context-free grammars (CFGs) derived from Recorder traces to either generate simplified global configuration files for benchmark execution or replay I/O patterns on-the-fly without additional preprocessing. It supports both POSIX and MPI-IO interfaces and allows users to inject optimization hints via JSON configuration files, enabling rapid experimentation with I/O settings without code changes. Our evaluation shows that FBench accurately reproduces I/O behavior for both synthetic and real workloads, capturing access patterns and performance trends across diverse optimizations and file system settings. For IOR and HACC-IO, FBench closely matches scaling behavior and sensitivity to Lustre striping parameters. For FLASH Sedov, it reveals that collective I/O on Lustre can yield up to 30x lower write bandwidth than independent I/O, largely independent of striping, and that switching to a burst buffer file system increases non-collective write bandwidth by about 1.5x without additional tuning. The evaluation with LAMMPS shows that FBench can significantly reduce the time required for what-if analyses and, with simple tuning, enable improvements of up to 8x.
[13] arXiv:2606.30391 [pdf, html, other]: Title: Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs

Tianyu Wang, Gourav Rattihalli, Aditya Dhakal, Longfei Shangguan, Dejan Milojicic

Comments: 13 pages body and 5 pages appendix, 19 pages total

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

As LLM inference becomes a major cloud workload, its growing energy footprint makes cluster-wide energy optimization increasingly important. Serverless LLM serving helps platforms absorb traffic volatility by elastically sharing GPU resources across models, but this sharing also makes energy optimization difficult. Multiple co-resident models run under one device-wide operating point, while their resource demands and latency slack change across execution phases and load conditions. As a result, minimizing energy requires coordinated scheduling across request placement, runtime resource adaptation, and workload consolidation.
We present Festina, a profiling-guided, power-aware control plane to minimize cluster-wide energy for serverless LLM serving. Unlike common global-local schedulers that focus on throughput or tail latency, Festina makes energy-first decisions by jointly coordinating request placement, SM partitioning, and GPU operating points under TTFT/TBT SLOs. In our system, a lightweight global scheduler performs fast, SLO-safe, energy-aware placement using constant-time lookups from offline profiles and GPU state summaries. On each GPU, a phase-aware local scheduler continuously adapts task batching and compute resources to minimize power consumption. Festina further performs energy-aware workload consolidation to reduce GPUs' static power consumption via SLO-aware migration. Comparison with four SOTA LLM serving systems and one DVFS-augmented system demonstrates that Festina reduces energy consumption by up to 56% while maintaining parity in SLO attainment (within a 2% margin)
[14] arXiv:2606.30419 [pdf, html, other]: Title: Analyzing Linearizability in Relativistic Distributed Systems

Kahbod Aeini, Wojciech Golab

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Einstein's theory of relativity correctly predicted that time is relative, and subject to both kinematic and gravitational dilation. Therefore, executions of distributed systems cannot always be modeled as sequences of events totally ordered according to wall clock time. To address this fundamental problem, Gilbert and Golab formulated a generalization of Herlihy and Wing's linearizability property for shared objects, which they called \emph{relativistic linearizability}, and introduced a collection of theoretical tools to facilitate rigorous analysis. While they conjectured that several widely-studied classically linearizable algorithms are also relativistically linearizable, their work stopped short of presenting formal proofs of correctness, as pointed out recently by Jayanti. In this paper, we explain how Gilbert and Golab's techniques can be used to establish relativistic linearizability for a replicated state machine, as well as variations of the widely studied read/write register construction of Attiya, Bar-Noy and Dolev (ABD). Our results establish a stronger form of relativistic linearizability than Jayanti's central theorem for these asynchronous algorithms.
[15] arXiv:2606.30497 [pdf, other]: Title: GPU Parallelization Strategies for Forward and Backward Propagation in Shallow Neural Networks: A CUDA-Based Comparative Study

Rania Zitouni, Nadine Bousdjira, Sarah Hasnaoui, Amel Sadoun, Fatma Salhi

Comments: 7 pages, 5 figures. Technical report, ESI Algiers, 2025--2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

We present a comparative study of CUDA optimization strategies applied to forward and backward propagation in a shallow neural network. Three stacked optimizations are evaluated: (1) tiled shared memory with bank-conflict elimination via +1-column padding, (2) pre-transposed weight matrices for coalesced global memory access, and (3) a fused MatMul+ReLU kernel that eliminates intermediate global-memory round-trips. Experiments on an NVIDIA Tesla T4 (CUDA 13.0) across three dataset sizes show that the fully optimized implementation achieves a 1.41x speedup over the baseline CUDA version on the large dataset (25,600 samples), reducing execution time from 21.0s to 14.8s. Results are compared against a sequential CPU baseline and an OpenMP parallel implementation, demonstrating the effectiveness of memory-access optimization in GPU-accelerated deep learning primitives.
[16] arXiv:2606.30533 [pdf, other]: Title: Spandana: Reconciling Strict SLOs with Low Cost under Fine-Grained Load Fluctuations

Dilina Dehigama, Shyam Jesalpura, Zeyu Xu, Marton Nemeth, Shengda Zhu, Marios Kogias, Boris Grot

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Cloud-based online services face significant sub-second load fluctuations while needing to meet strict Service Level Objectives (SLOs). Cluster operators often over-provision resources to protect SLOs, sacrificing utilization and cost efficiency. Existing reactive and proactive autoscalers, serverless (FaaS) deployments, and VM/FaaS hybrid systems fail to reconcile strict SLO compliance with low cost and high utilization under fine-grained load fluctuation.
We introduce Spandana, an architecture that addresses this trade off by decoupling SLO enforcement from cost optimization. A lightweight controller colocated with each application VM enforces SLOs by steering each arriving request between the VM and FaaS. Requests that can meet the SLO stay on the VM; the remaining requests are forwarded to a stock FaaS layer such as AWS Lambda. For cost optimization, Spandana's resource allocator determines the most-efficient VM provisioning by accounting for VM cost, FaaS cost, and traffic volatility, allowing the VM pool to run at high utilization. Our evaluation shows that Spandana maintains strict SLO adherence, achieves 76-86% CPU utilization, and reduces cost by 5-44% over three SOTA baselines.
[17] arXiv:2606.30563 [pdf, html, other]: Title: Data Replication Meets Function Scheduling in the Edge-Cloud Continuum

Matteo Cenzato, Dario d'Abate, Arianna Dragoni, Matteo Briscini, Alessandro Margara

Comments: To be submitted to Journal of Parallel and Distributed Computing

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Serverless computing is an appealing model for the edge-cloud continuum, but its stateless assumption breaks down once functions need persistent data: fetching state from a distant cloud store erases the latency benefit of running at the edge. Keeping data close means replicating it, and replication forces a placement decision that is coupled with where functions execute and with the consistency each application demands. We study this joint problem of function scheduling and data placement under two consistency models, strong and eventual replication. We first formulate it as a Binary Linear Program that yields the optimal placement for a given system snapshot, and use it as a reference point. Because the solver does not scale past a few hundred nodes, we add two heuristics with progressively less information: a Global-View greedy method that works from the same complete snapshot, and an Aggregated-View heuristic in which each node decides from locally observed demand alone. Across a range of system sizes the Global-View heuristic stays within a few percent of the optimum while scaling to over $10^4$ nodes. The Aggregated-View heuristic sacrifices some solution quality, but adapts continuously to each invocation. Under client mobility, centralized policies suffer from stale snapshots and recurring latency spikes, while the Aggregated-View maintains low and stable client-observed latency. Across all experiments, data placement proves more influential than function scheduling in determining the outcome.

[18] arXiv:2606.28516 (cross-list from cs.CV) [pdf, html, other]: Title: CLEAR-MoE: Shared-Basis Expert Extraction from Frozen Vision Transformers via Calibration-Driven Layer Selection

Md Irtiza Hossain, Humaira Ayesha, Junaid Ahmed Sifat

Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

We present CLEAR-MoE, a four-phase post-training pipeline that converts a frozen pretrained Vision Transformer (ViT) into a sparse Mixture-of-Experts (MoE) model without updating backbone weights. The pipeline (i) scores feed-forward network (FFN) layers by sparsity, clusterability, and output sensitivity; (ii) decomposes selected layers into a shared low-rank SVD basis and per-cluster residual experts using k-means clustering; (iii) trains lightweight routers supervised by cluster labels; and (iv) dispatches tokens through pluggable CUDA backends. On Imagenette with DeiT-Small, CLEAR-MoE retains 99.9% of the dense model's accuracy (86.70 +/- 0.02% versus 86.73%). Extensive ablation studies reveal a consistent empirical finding: the shared SVD basis is the primary factor responsible for preserving accuracy. Random routing, learned routing, and three different router architectures produce nearly identical performance, with accuracy varying by at most 0.06 percentage points (86.62%-86.68%). Accuracy also remains stable across different SVD ranks, expert counts (2-8), calibration set sizes (50-500), and random seeds. This behavior generalizes across five ViT backbones (DeiT-Tiny, DeiT-Small, DeiT-Base, ViT-Small, and ViT-Base), covering models from 5.7M to 86.6M parameters, with accuracy differences <= 0.10 percentage points from their dense counterparts. On a GTX 960 GPU, routing and scatter-gather overhead make the CLEAR-MoE FFN 1.3-1.7x slower than the dense implementation. A dispatch microbenchmark further shows that routing is an order of magnitude more memory-bound than expert matrix multiplications, identifying fused dispatch kernels as a promising direction for future optimization.
[19] arXiv:2606.28534 (cross-list from physics.plasm-ph) [pdf, html, other]: Title: High-Performance Resilient Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations at Scale

Jeremy J. Williams, Stefan Costea, David Tskhakaya, Leon Kos, Ales Podolnik, Jakub Hromadka, Jordy Trilaksono, Yi Ju, Kallia Chronaki, Evangelos Gkolantas, Vassilis Papaefstathiou, Allen D. Malony, Sameer Shende, Frank Jenko, Erwin Laure, Stefano Markidis

Comments: Accepted by the Euro-Par 2026 workshops (BIGHPC 2026), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figures

Subjects: Plasma Physics (physics.plasm-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Computational Physics (physics.comp-ph)

The increasing demand for high-performance computing in plasma physics has driven scalable and resilient simulation methods capable of efficiently exploiting modern multi-GPU architectures. This work extends a portable hybrid MPI+OpenMP implementation of BIT1, focusing on high-performance resilience for accelerated Particle-in-Cell (PIC) Monte Carlo (MC) simulations under both uniform and non-uniform load conditions. Scalable particle load balancing and robust checkpoint/restart mechanisms across Nvidia and AMD accelerators are integrated with standardized I/O using openPMD and ADIOS2. This leverages BP4 for high-performance file-based checkpointing and SST for in-memory data streaming, enabling efficient data movement, resilient large-scale execution, seamless continuation from existing checkpoints, and effective handling of computational and I/O workloads. Advanced HPC profiling and tracing tools, including Nvidia Nsight Systems and AMD ROC-Profiler with Perfetto, provide detailed insights into computation, communication, and system-level behavior for optimization. Performance results on Frontier (OLCF-5), MN5, and LUMI-G demonstrate strong and weak scaling up to 800 GPUs, validating the framework for large-scale PIC MC simulations, while in-situ analysis and visualization using scalable I/O further enhance scientific insight without interrupting multi-GPU execution on current and future exascale systems.
[20] arXiv:2606.28911 (cross-list from cs.LG) [pdf, html, other]: Title: MALOQ: Massively Accelerated Learning of Operators for Quantum Transport

Manasa Kaniselvan, Alexander Maeder, Denghui Lu, Alexandros Nikolaos Ziogas, Mathieu Luisier

Comments: 13 pages, 8 figures

Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Distributed, Parallel, and Cluster Computing (cs.DC); Computational Physics (physics.comp-ph)

Machine-learned (ML) operator models can be trained to predict density functional theory (DFT) Hamiltonian/density matrices at significantly reduced computational cost, thus extending electronic-structure calculations to previously unfeasible scales. Here, we introduce MALOQ (Massively Accelerated Learning of Operators for Quantum Transport), an application built to train on and predict electronic-structure matrices for systems made of few to 100k atoms, described by large basis sets, and covering a wide range of atomic elements. Based on a state-of-the-art, SO(2)-equivariant backbone architecture, MALOQ provides (i) custom data-processing kernels to handle high-rank Hamiltonian matrix data and (ii) a scalable edge-wise distribution of atomic graph(s). Trained on the largest molecular Hamiltonian datasets available today, it reduces time-per-epoch by over 30% compared to a molecule-wise-distributed framework, and enables inference on material graphs of arbitrary size. We demonstrate scalable training and inference for 3,000-12,000 atoms on the Alps supercomputer, up to 192 GPUs and 256 GPUs, respectively.
[21] arXiv:2606.29129 (cross-list from cs.MS) [pdf, html, other]: Title: Improved Scaling for Fast Mode of Ozaki Scheme II

Shota Kawakami, Daisuke Takahashi

Subjects: Mathematical Software (cs.MS); Distributed, Parallel, and Cluster Computing (cs.DC); Numerical Analysis (math.NA)

Ozaki scheme II emulates high-precision matrix multiplication using low-precision integer matrix operations based on the Chinese remainder theorem (CRT). It first scales the high-precision matrices to convert them into integer matrices. For this scaling step, Ozaki scheme II provides two modes: accurate mode, which uses INT8 matrix multiplication to estimate scaling factors, and fast mode, which applies the Cauchy--Schwarz inequality at lower computational cost. We show that the existing formula lacks scale invariance; multiplying the input matrices by a constant changes the effective bit width of the integer matrices in the scaling step, causing accuracy degradation or CRT recovery failure. To address this, we propose a revised scaling formula derived from the CRT uniqueness condition via the Cauchy--Schwarz inequality. The proposed formula is scale-invariant by construction, guarantees that the CRT uniqueness condition is always satisfied, and introduces no additional overhead over the original fast mode. Experiments on an NVIDIA GH200 GPU show that the proposed method achieves accuracy comparable to that of accurate mode while maintaining throughput comparable to that of fast mode. In the accuracy--throughput trade-off, the proposed method overcomes the accuracy limitation of fast mode and the throughput constraint of accurate mode, offering a superior accuracy and performance.
[22] arXiv:2606.29337 (cross-list from cs.CV) [pdf, html, other]: Title: W4A4 Quantization for Inference on Wan2.2-I2V-A14B

Yidong Chen, Chengyu Shi, Jiahao Liu

Comments: 4 pages, 8 figures; ICME 2026 Low-Bit-width Large-Model Quantization Challenge submission

Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

We summarize our submission to Sub-Challenge 1: W4A4 Quantization for Inference (HiF4 / MXFP4) of the ICME 2026 Low-Bit-width Large-Model Quantization Challenge. The sub-challenge targets 4-bit weight and 4-bit activation inference on Wan-AI/Wan2.2-I2V-A14B under HiF4 or MXFP4 numerical formats. We adapt two complementary ideas from LLM quantization, MixQ-style mixed precision for sparse activation outliers and SmoothQuant-style per-channel smoothing, together with block-wise HiF4 packing for Wan2.2 feed-forward linear layers. Calibration on representative OpenS2V-5M batches identifies heavy-tailed activation channels; smoothing rebalances dynamic range before W4A4 rounding; and a dual-branch GEMM preserves outlier columns in higher precision while the bulk of channels use strict W4A4. On official VBench I2V metrics, our pipeline stays within 2-3.5 percent of FP16 on most quality axes and improves motion smoothness, outperforming a native HiFloat4 baseline that degrades roughly 5 percent relative to FP16 across all reported scores.
[23] arXiv:2606.29574 (cross-list from cs.NI) [pdf, html, other]: Title: Stateless Network-Aware Adaptive Bitrate Streaming over IPFS

Iliya Mirzaei, Shabnam Jafarzade Mojaveri, Amirhossein Najafizadeh

Comments: 6 pages including references, 6 figures

Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)

Modern content delivery is increasingly decentralized, improving availability, cost, and reach for geographically distributed users. The InterPlanetary File System (IPFS) is a promising approach that uses content-based identifiers distributed across a global peer-to-peer network. Although IPFS improves fault tolerance, resilience, and censorship resistance, its unpredictable environment introduces significant performance variability that limits conventional Adaptive Bitrate (ABR) streaming and degrades Quality of Experience (QoE). Recent network-aware ABR solutions address this by incorporating IPFS-specific information into bitrate decisions. However, they rely on maintaining continuously synchronized state across consumers and providers, which can quickly become stale under peer churn, provider migrations, network partitions, and changing content distributions, making existing policies less effective. We investigate whether network-aware ABR can remain effective without synchronized adaptation state, and present a stateless network-aware ABR policy for IPFS-based video streaming. Our approach replaces provider-stateful adaptation with an observation-driven policy that recomputes the bitrate for each segment using only locally observable request-time signals. To preserve adaptation context without provider-side state, the client embeds its adaptation state in HTTP headers, keeping it under client control and carried transparently across requests. By eliminating cross-provider state synchronization, the framework improves robustness to failures and network reconfigurations while simplifying deployment at scale. Early results show the approach maintains high QoE in faulty conditions, improving it by up to roughly 6x over existing solutions. These findings demonstrate that stateless network-aware adaptation provides a practical and scalable foundation for decentralized video delivery.
[24] arXiv:2606.29826 (cross-list from cs.CR) [pdf, html, other]: Title: Rethinking Collaborative Trust for Verifiably Decentralized Blockchain Systems

Yunqi Zhang, Shaileshh Bojja Venkatakrishnan

Subjects: Cryptography and Security (cs.CR); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)

Despite the promise of decentralization, measurement studies have identified a conspicuous lack of decentralization in blockchains. Centralization has been observed in almost all layers of the blockchain, in decentralized applications, and in decentralized autonomous organizations. In many cases, it is practically impossible to definitively determine the extent of centralization in the system. While multiple works have proposed methods to decrease centralization, by and large blockchains continue to be significantly centralized.
In this paper, we develop a general framework for building verifiably decentralized blockchain systems. Our framework is motivated by the core observation that the richness and diversity of collaborative interactions between users -- rather than resource uniformity -- captures the essence and extent of decentralization in a blockchain system. Existing blockchains do not have any incentive mechanisms to encourage inter-coalition collaboration, which directly contributes to centralization. We propose a novel reward design that incentivizes users to collaborate with other users without forming isolated coalitions. Technically, our method uses a Sybil-resistant asymmetric Shapley value for reward attribution within a collaboration group, and the theory of expander graphs for measuring and enforcing decentralization.
Our framework is general and can be adapted to alleviate centralization in any layer, application, or decentralized organization. It also has important implications beyond the topic of centralization. For example, we show that our solution can naturally address the blockchain scalability problem. We also identify a new class of decentralized collaborative applications that have hitherto been unexplored in blockchains.
[25] arXiv:2606.29986 (cross-list from cs.AR) [pdf, html, other]: Title: HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

Zhixiang Wei, Yun Wang, James Yen, Mingyuan Xia, Zhengwei Qi

Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)

LLM inference comprises a compute-bound prefill phase and a memory-bound decode phase, and recent systems disaggregate them onto separate hardware. Yet today's datacenter GPUs rely on costly HBM whose bandwidth sits almost entirely idle during prefill. LLM serving across memory-heterogeneous accelerators (MemHA) pairs GDDR-based accelerators for prefill with HBM-based GPUs for decode, promising lower cost without sacrificing performance. Pushed to its most economical form, MemHA serving is inherently cross-vendor, since the best-suited chip for each phase may come from a different vendor. This breaks two assumptions that single-vendor disaggregation takes for granted -- a KV format both ends consume natively, and a shared software stack. We present \textbf{HMA-Serve}, a MemHA-centric disaggregated serving system pairing GDDR-based accelerators for prefill with HBM-based GPUs for decode efficiently. HMA-Serve achieves this through (1) phase-wise quantization, applying vendor-native low precision for high-throughput prefill while keeping decode in high-precision BF16, (2) a compute-transfer pipeline that overlaps each layer's KV cache transfer with later-layer prefill to reduce time-to-first-token (TTFT), and (3) deferred dequantization, shipping raw quantized bytes and reconstructing them lazily on the decode GPU to reduce network bandwidth and HBM usage. Across four Qwen3 models (4B--32B) and three production traces, HMA-Serve delivers up to $3.2\times$ higher goodput than state-of-the-art memory-homogeneous methods and $4.8\times$ higher goodput-per-dollar, with no measurable loss on generation-quality benchmarks.
[26] arXiv:2606.30460 (cross-list from cs.LG) [pdf, html, other]: Title: HSAP: A Hierachical Sequence-aware Parallelism for Hybrid-Context Generative Models

Songxin Zhang, Zejian Xie, Zhuoyang Song, Cong lin, Junyu Lu, Jiaxing Zhang, Bingyi Jing

Comments: 10 pages, ACL preprint style

Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

In this paper, we aim to combine the advantages of existing sequence parallelism paradigms and overcomes their drawbacks, the most serious of which is the incapability to correctly compute causal attention on the hybrid-context packed sequences, in a stronger sequence parallelism framework. The practical technique of packing sequences for efficiently pretraining and fine-tuning large language models causes cross-contamination problem in attention computation, which can be effectively solved when no parallelism in the sequence length dimension is taken. However, in sequence parallelism, existing approaches either ignore the scenario of hybrid-context sequences or conversely sacrifice and limit parallelism degree for supporting the scenario. To this end, we innovatively propose an efficient Sequence-Aware Parallelism algorithm to conquer the obstacles of intensive tensor transmission and partial attention computation across multiple device groups. Our algorithm utilizes JIT (Just-In-Time) compilation to optimize the communication strategy of all device groups in NCCL level. Further, we integrate existing sequence parallelism paradigms into a Hierachical Sequence-Aware Parallelism framework which benefits from our sequence-aware algorithm. We additionally elaborate on the memory and communication overhead management of the hierachical framework to optimize its performance. Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics.
[27] arXiv:2606.30553 (cross-list from cs.AR) [pdf, html, other]: Title: COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices

Yilong Zhao, Fangxin Liu, Onur Mutlu, Mingyu Gao, Jian Liu, Haibing Guan, Li Jiang

Comments: 18 pages, 13 figures, ISCA'26

Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)

The development of on-device large language models (LLMs) is driven by the need for privacy and fast response times. Energy-intensive data transfer on mobile devices makes Processing-in-Memory (PIM) an effective solution. Due to stringent DRAM cost constraints, limited physical footprint on circuit boards, and the interaction between applications and LLMs, it is imperative for the CPU and PIM to operate concurrently within a shared memory space. However, challenges such as bank conflicts and bus congestion can arise, potentially diminishing the performance and energy benefits of PIM. To address this challenge, we introduce COSM, a cooperative scheduling framework designed to facilitate the concurrent operation of PIM and CPU tasks on mobile platforms. Our key innovations include: 1) a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses; 2) an idleness-aware scheduling method that integrates PIM commands into available idle time windows within the CPU's access sequence. COSM not only hides PIM execution latency from the CPU, but also overlaps PIM execution with data transfer. Experiments on concurrent execution of LLMs and mobile workloads, including mobile applications and compute-intensive kernels, demonstrate that COSM improves PIM throughput by up to 2.8x compared to the baseline scheduling method with less than 2.0% CPU performance loss.
[28] arXiv:2606.30554 (cross-list from cs.NI) [pdf, html, other]: Title: SubEdge: A Subscriber-Centric Edge Computing Subsystem in 6G Networks for AI

Abdirazak Ali Asir Rage, Riccardo Pozza, Rahim Tafazolli

Comments: 6 pages, 5 figures

Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)

Beyond traditional connectivity, 6G is envisioned to transform mobile networks into a distributed fabric that provides native integrated communication, computing, and intelligence services. AI-native terminals (e.g., robots, autonomous vehicles, and smart glasses) require real-time inference from individualised, manufacturer-specific models that cannot be executed on-board nor shared across subscribers, making per-subscriber edge compute the necessary complement to per-subscriber connectivity. Existing Network for AI (Net4AI) architectures provision compute for application providers through shared deployments and do not address per-subscriber provisioning. This paper proposes SubEdge, a Net4AI subsystem that provisions integrated communication and compute resources on a per-subscriber basis, ensuring the coupled migration of both dimensions to maintain service continuity during mobility. SubEdge contributes the computing context--a per-subscriber data structure binding a Subscription Permanent Identifier (SUPI) to its inference container, edge node, and service entitlement--and a mobility-event-driven mechanism that simultaneously migrates the subscriber's compute instance and its traffic-routing policy when the serving cell changes. SubEdge operates as an Application Function over existing Network Exposure Function (NEF) APIs with zero 3GPP core modifications. Experimental evaluation on a real-world testbed shows that SubEdge's mobility-driven joint communication-and-compute migration reduces 95th-percentile latency from 22.9 ms to 12.2 ms with zero packet loss across six mobility events, sustains 99.92% frame delivery for an end-to-end 30 fps inference workload, and completes 1,560 migration operations across batches of up to 50 simultaneously migrating subscribers with 100% success.

[29] arXiv:2505.14914 (replaced) [pdf, html, other]: Title: Sei Giga

Benjamin Marsh, Steven Landers, Jayendra Jog

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR)

We introduce the Sei Giga, a multi-concurrent producer parallelized execution EVM layer one blockchain. In an internal testnet Giga has achieved >5 gigagas/sec throughput and sub 250ms finality. Giga uses Autobahn for consensus with separate DA and consensus layers requiring f+1 votes for a PoA on the DA layer before consensus. Giga reaches consensus over ordering and uses async block execution and state agreement to remove execution from the consensus bottleneck.
[30] arXiv:2508.13084 (replaced) [pdf, html, other]: Title: Team Formation and Applications

Yuval Emek, Shay Kutten, Ido Rafael, Gadi Taubenfeld

Comments: An extended abstract of this paper was accepted to DISC 2025. Journal version published in Distributed Computing

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

A novel long-lived distributed problem, called Team Formation (TF), is introduced together with a message- and time-efficient randomized algorithm. The problem is defined over the asynchronous model with a complete communication graph, using bounded size messages, where a certain fraction of the nodes may experience a generalized, strictly stronger, version of initial failures. The goal of a TF algorithm is to assemble tokens injected by the environment, in a distributed manner, into teams of size $\sigma$, where $\sigma$ is a parameter of the problem.
The usefulness of TF is demonstrated by using it to derive efficient algorithms for many distributed problems. Specifically, we show that various (one-shot as well as long-lived) distributed problems reduce to TF. This includes well-known (and extensively studied) distributed problems such as several versions of leader election and threshold detection. For example, we are the first to break the linear message complexity bound for asynchronous implicit leader election. We also improve the time complexity of message-optimal algorithms for asynchronous explicit leader election. Other distributed problems that reduce to TF are new ones, including matching players in online gaming platforms, a generalization of gathering, constructing a perfect matching in an induced subgraph of the complete graph, quorum sensing in message-passing networks, and more. To complement our positive contribution, we establish a tight lower bound on the message complexity of TF algorithms.
[31] arXiv:2511.10480 (replaced) [pdf, html, other]: Title: Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs

Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna

Comments: ISCA2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)

Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and hardware design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces capturing execution on a specific platform cannot be easily adapted to study alternate software and/or hardware configurations, especially at scale. We introduce STAGE, a framework that synthesizes high-fidelity execution graphs to accurately model distributed AI workloads (including LLMs and MoEs). STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of model architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 128K GPUs, while preserving tensorlevel accuracy in compute, memory, and communication. STAGE is publicy available at this https URL
[32] arXiv:2512.16455 (replaced) [pdf, html, other]: Title: AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research

Ignacio Heredia, Álvaro López García, Fernando Aguilar Gómez, Diego Aguirre, Caterina Alarcón Marín, Khadijeh Alibabaei, Lisana Berberi, Miguel Caballer, Amanda Calatrava, Pedro Castro, Alessandro Costantini, Mario David, Jaime Díez Stefan Dlugolinsky, Borja Esteban Sanchis, Giacinto Donvito, Leonhard Duda, Saúl Fernandez, Andrés Heredia Canales, Valentin Kozlov, Sergio Langarita, João Machado, Germán Moltó, Daniel San Martín, Martin Šeleng, Giang Nguyen, Marcin Płóciennik, Marta Obregón Ruiz, Susana Rebolledo Ruiz, Vicente Rodriguez, Judith Sáinz-Pardo Díaz, Viet Tran

Journal-ref: Future Generation Computer Systems (2026)

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)

The rapid growth of Artificial Intelligence and Machine Learning in scientific research has highlighted a gap between industry-standard MLOps tools and platforms, and the unique requirements of modern and Open Science, particularly regarding the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. This paper presents AI4EOSC, a federated, open-source platform designed to operationalize the full AI/ML lifecycle within the European Open Science Cloud (EOSC) ecosystem. Our methodology tackles the fragmentation of distributed research infrastructures by integrating a modular and distributed architecture comprising an AI development platform, a serverless AI-as-a-Service layer, and a federated orchestration model that is able to integrate heterogeneous compute and storage resources from distributed e-Infrastructures. AI4EOSC also introduces a ``FAIR-by-design'' approach that enforces metadata standardization (via MLDCAT-AP) and W3C PROV-compliant provenance tracking through a platform-integrated CI/CD pipeline. AI4EOSC added value is demonstrated through the delivery of a diverse set of community installations, showing consistent and seamless deployment across heterogeneous cloud providers. These installations are validated by a set of scientific cases, showing how our work reduces the manual burden on researchers while ensuring high levels of reproducibility and interoperability and providing an unified environment for development, training, and production of AI/ML models in the EOSC.
[33] arXiv:2601.06903 (replaced) [pdf, html, other]: Title: Divergence-Based Adaptive Aggregation for Byzantine Robust Federated Learning

Bingnan Xiao, Feng Zhu, Jingjing Zhang, Wei Ni, Xin Wang

Comments: 16 pages, 22 figures

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Inherent client drifts caused by data heterogeneity, as well as vulnerability to Byzantine attacks within the system, hinder effective model training and convergence in federated learning (FL). This paper presents two new frameworks, named DiveRgence-based Adaptive aGgregation (DRAG) and Byzantine-Resilient DRAG (BR-DRAG), to mitigate client drifts and resist attacks while expediting training. DRAG designs a reference direction and a metric named divergence of degree to quantify the deviation of local updates. Accordingly, each worker can align its local update via linear calibration without extra communication cost. BR-DRAG refines DRAG under Byzantine attacks by maintaining a vetted root dataset at the server to produce trusted reference directions. The workers' updates can be then calibrated to mitigate divergence caused by malicious attacks. We analytically prove that DRAG and BR-DRAG achieve fast convergence for non-convex models under partial worker participation, data heterogeneity, and Byzantine attacks. Experiments validate the effectiveness of DRAG and its superior performance over state-of-the-art methods in handling client drifts, and highlight the robustness of BR-DRAG in maintaining resilience against data heterogeneity and diverse Byzantine attacks.
[34] arXiv:2604.08242 (replaced) [pdf, html, other]: Title: Scheduling Coflows in Multi-Core OCS Networks with Performance Guarantee

Xin Wang, Hong Shen, Hui Tian, Dong Wang

Comments: 10 pages, 7 figures

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Coflow provides a key application-layer abstraction for capturing communication patterns, enabling the efficient coordination of parallel data flows to reduce job completion times in distributed systems. Modern data center networks (DCNs) are employing multiple independent optical circuit switching (OCS) cores operating concurrently to meet the massive bandwidth demands of application jobs. However, existing coflow scheduling research primarily focuses on the single-core setting, with multi-core fabrics only for EPS (electrical packet switching) networks.
To address this gap, this paper studies the coflow scheduling problem in multi-core OCS networks under the \textit{not-all-stop} reconfiguration model in which one circuit's reconfiguration does not interrupt other circuits. The challenges stem from two aspects: (i) cross-core coupling induced by traffic assignment across heterogeneous cores; and (ii) per-core OCS scheduling constraints, namely \textit{port exclusivity} and \textit{reconfiguration delay}. We propose an approximation algorithm that jointly integrates cross-core flow assignment and per-core circuit scheduling to minimize the total weighted coflow completion time (CCT) and establish a provable worst-case performance guarantee. Trace-driven simulations using real Facebook workloads demonstrate that our algorithm effectively reduces weighted CCT and tail CCT.
[35] arXiv:2604.27767 (replaced) [pdf, html, other]: Title: Monadic Presburger Predicates have Robust Population Protocols

Philipp Czerner, Javier Esparza, Vincent Fischer, Roland Guttenberg, Julian Pins, Simon Reilich

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Population protocols are a model of distributed computation in which a collection of indistinguishable finite-state agents interact randomly in pairs to decide a predicate of their initial configuration. The agents decide by achieving a stable consensus on whether the predicate holds or not. It is known that population protocols can decide exactly the predicates expressible in Presburger arithmetic.
Recently, Lossin et al. have introduced a notion of protocol robustness against adversarial crash failures. They show that all atomic Presburger predicates can be decided by robust protocols, and ask whether the same holds for every Presburger predicate. We make progress towards settling this question by proving that all predicates expressible in monadic Presburger arithmetic have robust protocols. In addition, we analyze the cost of robustness in terms of state complexity. We study the ratio between the number of states of the smallest robust protocol for a given predicate and the smallest protocol for it. We show that the cost of robustness is at least double exponential in the size of the predicate, and prove that the robust protocols by Lossin et al. for threshold predicates x >= k have optimal state complexity.
[36] arXiv:2606.08761 (replaced) [pdf, html, other]: Title: APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Hong Guo, Nianhui Guo, Weixing Wang, Jona Otholt, Christoph Meinel, Haojin Yang

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)

W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware indicator: the W4A4-g128 kernel yields $2.0$--$2.5\times$ speedup on RTX~3090 ($\rho=16$) yet degrades to $0.43$--$0.47\times$ on A100 ($\rho=64$) in compute-bond scenarios, establishing W4A4 viability as platform-dependent rather than universally infeasible. Guided by this finding, we build \textbf{APEX4}, which co-designs pure INT4 GEMM kernels with $\rho$-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck. APEX4 achieves perplexity within 0.63 of FP16 on LLaMA-2-70B and outperforms W4Ax Atom-g128 by 4.0\%--4.4\% in zero-shot accuracy. Deployed as a drop-in replacement in unmodified vLLM, it delivers up to $1.66\times$ end-to-end speedup on L40S ($\rho=8$), and $1.78\times$ on RTX~3090 ($\rho=16$), $2.09\times$ on A40 ($\rho=16$), while recovering A100 ($\rho=64$) to $1.20$--$1.40\times$ via the mixed-granularity mode. Our code is available at this https URL.
[37] arXiv:2606.19832 (replaced) [pdf, html, other]: Title: Ratio-Independent Three-Cycle Decomposition with Optimal Ordered Local-Switch Cost in Six-Regular Non-Axis Eisenstein--Jacobi Networks

Bader Albader

Comments: Preprint also available on Zenodo:this https URL

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)

Six-regular simple Eisenstein--Jacobi (EJ) networks are degree-six quotient-lattice interconnection networks. This paper gives a ratio-independent decomposition of every six-regular simple non-axis EJ network into three edge-disjoint Hamiltonian cycles using a canonical ordered local-switch model based on unit-parallelogram exchanges. The admitted $d=1$ branch needs no switches; $d=2$ has optimal total cost four; and for $d=3$ and $d\ge4$ both modified factors attain the component-counting lower bound $d-1$. Factor-local switches commute, so chronological interleaving does not alter the final factors or cost within the model. Orbit normalization identifies the exact domain and excludes the unique normalized non-axis norm-three degeneration. For $d\ge4$, an equal-coordinate alternating lift removes reduced-ratio dependence from the fine diagonal coordinate. A block-chain invariant, exhaustive interior-template lemma, and parity-specific successor permutations certify the unused complement: rank advances by one modulo $4d-6$, and arc and connector bijections prove complete coverage. The certificate uses $O(d)$ seed records and expands to the full edge lists in $O(N)$ time. Deterministic symbolic and full-quotient audits, including a dictionary-free fine-incidence check for every $4\le d\le201$, are provided in the accompanying reproducibility package and are not proof premises.
[38] arXiv:2606.21401 (replaced) [pdf, html, other]: Title: SwarmX: Agentic Scheduling for Low-Latency Agentic Systems

Yeqi Huang, Yanwei Ye, Guomin Chen, Wenhao Su, Bin Gong, Jialian Li, Zhan Lu, Yangshen Deng, Xuan Sun, Le Xu, Luo Mai

Comments: 14 pages

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)

Agentic AI applications compose multiple model calls and tool executions, creating new scheduling challenges for GPU-CPU clusters. Their inference time and model-call structure often depend on prompt semantics, making conventional scheduling approaches ineffective for low-latency serving. This paper presents SwarmX, a system that implements agentic scheduling for low-latency agentic applications. SwarmX uses scheduling-specific neural predictors to capture prompt, device, runtime, and target-model features; exposes distributional predictions to routers and scalers for tail-aware decisions; and provides mechanisms for predictor training and online adaptation. These predictors and mechanisms are integrated into a scheduler-agent framework that provides a common substrate for integration with existing scheduling and model-serving infrastructure. We evaluate SwarmX using production deployment (nearly one thousand GPUs and one million CPU cores) and controlled experiments on a 128-GPU testbed. Across multi-agent code generation, deep research, and multimodal agentic workflows, SwarmX reduces tail latency by up to 61.5% compared to state-of-the-art schedulers and sustains up to 2x the throughput of production schedulers under the same SLO.
[39] arXiv:2606.21784 (replaced) [pdf, html, other]: Title: KineticSim: A Lightweight, High-Performance Execution Engine for Real-Time Market Simulators

Shakya Jayakody, Prarthinie Jayakody

Comments: 12 pages, 7 figures, 5 tables. IEEE format

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Trading and Market Microstructure (q-fin.TR)

Simulating financial markets at scale with multi-agent (Agent-Based) models is critical for market design, regulatory stress-testing, and reinforcement learning, but traditional CPU simulators are bottlenecked by sequential processing while vectorized GPU frameworks suffer from kernel-launch overhead and redundant global-memory round-trips. We formalize, analyze, and evaluate a reusable parallel design pattern: persistent, state-carrying clearing for iterative multi-agent reductions. By caching mutable simulation state in thread-block shared memory across step boundaries, aggregating agent actions via shared-memory atomics, and resolving the clearing function cooperatively, the pattern reduces the per-step critical-path depth from Theta(L+A) for sequential clearing (L price-grid ticks, A agents) to Theta(log L + ceil(A/L)) and makes global-memory traffic independent of the step count. We implement this in KineticSim, a lightweight GPU execution engine that simulates massive ensembles of limit-order books in parallel, reaching a peak throughput of over 54.7 billion agent-events per second. On a fixed workload it delivers speedups of 3406x over CPU (NumPy), 27.8x over PyTorch GPU, 42.8x over JAX GPU, and 8.4x over a naive custom CUDA baseline, while using roughly an order of magnitude less GPU memory than PyTorch. Across 53 configurations the two custom CUDA engines produce bitwise-identical order books, and aggregate statistics match the CPU reference to within 0.1%. The pattern generalizes to other iterative multi-agent workloads requiring state-persistent, block-localized reductions.
[40] arXiv:2606.26297 (replaced) [pdf, html, other]: Title: A Distributed Quantum Approximate Optimization Algorithm Simulator for Engineering Design Optimization

Ali Rajabi, Milad Hasanzadeh, Amin Kargarian

Comments: 37 pages, 7 figures, 5 tables

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computational Engineering, Finance, and Science (cs.CE)

This paper presents a Qiskit-compatible distributed quantum approximate optimization algorithm (DQAOA) simulator for quadratic unconstrained binary optimization (QUBO) problems arising in engineering design and decision applications. The open-source simulator is available through the RAISE LAB website and GitHub repository, with README documentation for installation, input formatting, configurable parameters, and example workflows. The package addresses the need for a reusable simulator that can solve and compare QUBO instances across different QAOA execution modes. It supports monolithic QAOA on a single quantum processing unit (QPU) and distributed QAOA across a user-specified number of QPUs with configurable capacities. The workflow canonicalizes the QUBO model, maps it to a cost Hamiltonian, allocates variables across QPUs, identifies local and cross-QPU couplings, and constructs the corresponding circuits. Runtime optimizations, including parameterized circuit reuse, objective reuse at fixed depth, batched evaluations, and parallel multi-start execution, reduce repeated overhead. A Streamlit graphical user interface is also provided for entering or uploading QUBO instances, configuring solver settings, running selected modes, and visualizing solution-quality metrics without editing Python scripts. The package is demonstrated on standalone QUBO benchmarks and a power generation unit commitment application. In the unit commitment case, brute force, monolithic QAOA, and distributed QAOA recover the same commitment bitstring and operating cost. Across multiple case studies, the simulator produces results consistent with classical monolithic QAOA references in terms of optimal bitstrings and costs. Staged runtime analysis shows substantial runtime reduction across implementation stages, while distributed QAOA remains more demanding because cross-QPU couplings require remote operations.
[41] arXiv:2511.00870 (replaced) [pdf, html, other]: Title: A Distributed Plug-and-Play MCMC Algorithm for High-Dimensional Inverse Problems

Maxime Bouton, Pierre-Antoine Thouvenin, Audrey Repetti, Pierre Chainais

Comments: accepted for publication in IEEE Trans. Comput. Imag., 2026

Subjects: Methodology (stat.ME); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)

Markov Chain Monte Carlo (MCMC) algorithms are standard approaches to solve imaging inverse problems and quantify estimation uncertainties, a key requirement in absence of ground-truth data. To improve estimation quality, Plug-and-Play MCMC algorithms, such as PnP-ULA, have been recently developed to accommodate priors encoded by a denoising neural network. Designing scalable samplers for high-dimensional imaging inverse problems remains a challenge: drawing and storing high-dimensional samples can be prohibitive, especially for high-resolution images. To address this issue, this work proposes a distributed sampler based on approximate data augmentation and PnP-ULA to solve very large problems. The proposed sampler uses lightweight denoising convolutional neural network, to efficiently exploit multiple GPUs on a Single Program Multiple Data architecture. Reconstruction performance and scalability are evaluated on several imaging problems. Communication and computation overheads due to the denoiser are carefully discussed. The proposed distributed approach noticeably combines three very precious qualities: it is scalable, enables uncertainty quantification, for a reconstruction performance comparable to other PnP methods.
[42] arXiv:2511.15503 (replaced) [pdf, other]: Title: DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at this https URL.
[43] arXiv:2511.15517 (replaced) [pdf, other]: Title: Beluga: Block Synchronization for BFT Consensus Protocols

Tasos Kichidis, Lefteris Kokoris-Kogias, Arun Koshy, Ilya Sergey, Alberto Sonnino, Mingwei Tian, Jianting Zhang

Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)

Modern high-throughput BFT consensus protocols use streamlined push-pull mechanisms to disseminate blocks and keep happy-path performance optimal. Yet state-of-the-art designs lack a principled and efficient way to exchange blocks, which leaves them open to targeted attacks and performance collapse under network asynchrony. This work introduces the concept of a block synchronizer, a simple abstraction that drives incremental block retrieval and enforces resource-aware exchange. Its interface and role fit cleanly inside a modern BFT consensus stack. We also uncover a new attack, where an adversary steers honest validators into redundant, uncoordinated pulls that exhaust bandwidth and stall progress. Beluga is a modular and scarcity-aware instantiation of the block synchronizer. It achieves optimal common-case latency while bounding the cost of recovery under faults and adversarial behavior. We integrate Beluga into Mysticeti, the consensus core of the Sui blockchain, and show on a geo-distributed AWS deployment that Beluga sustains optimal performance in the optimistic path and, under attack, delivers up to 3x higher throughput and 25x lower latency than prior designs. The Sui blockchain adopted Beluga in production.
[44] arXiv:2601.13903 (replaced) [pdf, html, other]: Title: Know Your Contract: eIDAS-Based Verifiable Legal Identities for Smart Contracts, Enabling Regulatory-Compliant On-Chain Operations

Awid Vaziry, Sandro Rodriguez Garzon, Christoph Wronka, Axel Küpper

Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)

Public blockchains provide no native mechanism to verify the legal identity behind a deployed smart contract, which blocks institutional adoption and compliance with EU regulations such as MiCA and AMLR. We present KYC Seal, the first protocol that extends the EU eIDAS trust infrastructure to Ethereum smart contracts by cryptographically binding them to Qualified Electronic Seals issued by Qualified Trust Service Providers (QTSPs). The protocol realizes the full eIDAS trust chain, from the European Commission's List of Trusted Lists through Member-State trusted lists and QTSP-signed X.509 certificates down to the individual smart contract, natively on-chain. An on-chain parser extracts identity fields directly from the QTSP-signed certificate bytes at registration. Both cryptographic verifications, the QTSP issuance signature and the certificate holder's seal signature, are performed once at registration and cached as on-chain state, reducing per-interaction seal verification to a pure state check. A new P-256 elliptic-curve precompile in Ethereum (deployed December 2025) makes these one-time cryptographic steps economical, enabling trustless on-chain verification of eIDAS identities without oracles or runtime intermediaries. A reference implementation, a formal security analysis, and a gas evaluation are the subject of forthcoming work.
[45] arXiv:2605.09708 (replaced) [pdf, html, other]: Title: Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Víctor Gallego

Comments: Published at the Fifth Workshop on Deep Learning for Code (DL4C) at ICML 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in $n$-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, a roofline-anchored fitness function, and a held-out generalization size. We pair the benchmark with a lightweight harness for automatic kernel search that runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to a frozen LLM driving a $(1{+}1)$ evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span $1.00\times$ to $10.7\times$. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function $\Phi_\mathcal{T}$ (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheap mechanical oversight primitive on this automatic search loop, catching e.g. an Opus template <uint D> HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at $2.95\times$ speedup but collapses to $0.23\times$ on a $256^3$ held-out cube, a silent regression that the in-distribution score alone cannot see. Code at this https URL
[46] arXiv:2606.07957 (replaced) [pdf, html, other]: Title: Demand-Driven Vulnerability Detection for Cloud Security Posture Management: Removing Human Rule Authoring from the Disclosure-to-Protection Critical Path

Prashant Kumar Pathak

Comments: 16 pages, 3 figures. Preprint. Under review at IEEE Transactions on Cloud Computing

Subjects: Cryptography and Security (cs.CR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)

Cloud Security Posture Management (CSPM) systems detect known vulnerabilities by maintaining a rule set, distributing it to customers, and evaluating it against periodically-collected asset inventories. To our knowledge, in publicly documented architectures the rule set is environment-agnostic and curated centrally by the vendor; updates are batched into release cycles and shipped on a cadence ranging from hours to days depending on detection severity. The disclosure-to-protection window -- from a CVE being published to the customer's system being capable of detecting affected assets -- is therefore bounded by the vendor's release cadence for version-match detections, and by additional human authoring time for richer detections incorporating configuration predicates beyond the affected-software string. We propose an architecture in which the rule set is not vendor-distributed but continuously derived, within the customer's tenant, from the intersection of public catalogue feeds and the live asset graph. A rule comes into existence when a catalogue entry and an applicable asset are simultaneously present, and goes out of existence when either input ceases to support it. Derivation is bidirectional: new catalogue entries and new assets both trigger it. It incorporates the full structured-field content of catalogue entries, not only the affected-software predicate. The live rule set is bounded by environment diversity rather than catalogue breadth. Prior systems incrementally evaluate a static rule set; we incrementally derive the rule set itself. We present the threat model, the architecture, formal semantics with an equivalence theorem, complexity analysis, a worked example, and an evaluation methodology. The contribution is the architectural shift and its latency and resource consequences; rule correctness and alert prioritization are out of scope.

Total of 46 entries

Showing up to 2000 entries per page: fewer | more | all

Distributed, Parallel, and Cluster Computing

Showing new listings for Tuesday, 30 June 2026

New submissions (showing 17 of 17 entries)

Cross submissions (showing 11 of 11 entries)

Replacement submissions (showing 18 of 18 entries)