Self-Modulating Quantum Fast-Weight Programmers for Efficient Adaptive Sequential Learning ^†^†thanks: The views expressed in this article are those of the authors and do not represent the views of Wells Fargo. This article is for informational purposes only. Nothing contained in this article should be construed as investment advice. Wells Fargo makes no express or implied warranties and expressly disclaims all legal, tax, and accounting implications related to this article.

Samuel Yen-Chi Chen¹, Yifeng Peng², Kuo-Chung Peng⁶, Jiun-Cheng Jiang⁶, Chun-Hua Lin⁶
Junghoon Justin Park³, Huan-Hsin Tseng⁴, Hsin-Yi Lin⁴ Kuan-Cheng Chen⁵, Chen-Yu Liu⁶, Shinjae Yoo⁴
¹Wells Fargo ²Stevens Institute of Technology ³Seoul National University
⁴Brookhaven National Laboratory ⁵Imperial College London ⁶National Taiwan University

Abstract

Recent advances in quantum machine learning have motivated efficient models for sequential data processing. In this paper, we propose Self-Modulating Quantum Fast Weight Programmers, or Self-Modulating QFWP, which extends Quantum Fast Weight Programmers by introducing adaptive modulation over both newly generated fast-weight updates and historical fast-weight memory. Numerical results show that the proposed mechanism improves convergence stability and prediction performance across varying model settings, including different numbers of qubits and input sequence lengths. We further provide theoretical arguments explaining how self-modulation balances new information injection with memory retention, thereby enhancing temporal information propagation. These results suggest that Self-Modulating QFWP is a compact and effective framework for quantum machine learning on time-series data.

I Introduction

Quantum computing (QC) provides new computational paradigms that may offer advantages for certain problems [56, 65, 30]. Combined with the success of artificial intelligence and machine learning (AI/ML), this has motivated the development of quantum machine learning (QML), which seeks to design learning models based on quantum computational principles [5, 23, 27, 63, 3, 1, 67]. A central framework in QML is the variational quantum algorithm (VQA) [4, 7, 57, 74], which has been widely used in hybrid quantum-classical models for classification, sequence learning, and reinforcement learning [54, 62, 19, 18, 35, 51, 52, 55, 43, 14, 12, 50, 47, 9, 26, 68, 38, 41, 42, 64].

Sequence learning is an important direction in QML because many real-world data are naturally temporal. Existing quantum recurrent models, such as Quantum Long Short-Term Memory (QLSTM) networks, have shown that parameterized quantum circuits can be used to process sequential information [19, 44, 28, 69, 70, 2, 66, 45, 73, 32, 48, 46, 10]. However, many recurrent quantum models rely on fixed temporal update structures, leaving open how quantum architectures should regulate the interaction between newly generated information and memory accumulated from previous time steps.

Quantum Fast Weight Programmers (QFWPs) provide an alternative approach to quantum sequence modeling by dynamically generating fast-weight parameters during recurrent processing [22]. This mechanism allows the model to encode temporal information into evolving fast-weight states. Nevertheless, directly accumulating fast-weight updates may cause unstable parameter evolution or inefficient memory propagation, especially when the input sequence length or model size changes. Therefore, an explicit mechanism for controlling both new fast-weight updates and historical fast-weight memory is needed.

To address this issue, we propose Self-Modulating Quantum Fast Weight Programmers, or Self-Modulating QFWP. The key idea is to introduce adaptive modulation into the fast-weight evolution, allowing the model to regulate how much newly generated information is injected and how much historical fast-weight memory is retained. This self-modulating mechanism provides a more flexible recurrent structure and aims to improve learning robustness across different quantum model settings.

Contribution Statement. The main contributions of this work are threefold. First, we propose a self-modulating extension of QFWP that adaptively regulates fast-weight dynamics for more robust sequence learning. Second, we conduct comprehensive numerical simulations and ablation studies on multiple time-series tasks, evaluating the effects of different numbers of qubits, input sequence lengths, and modulation designs. Third, we provide theoretical arguments explaining why certain forms of self-modulation more effectively balance new information injection and memory retention, thereby improving temporal information propagation.

II Related Works

Sequence learning is a central topic in both classical and quantum machine learning. In the classical setting, the long short-term memory (LSTM) network [31] has long served as the canonical recurrent architecture, using input, forget, and output gates to mitigate vanishing gradients and capture long-range dependencies. Building on this design, the Quantum LSTM (QLSTM) [19] replaces the classical components inside the LSTM cell with variational quantum circuits (VQCs), and has since been applied across natural language processing [44, 28, 69], generative modeling [25], reinforcement learning [20, 21], time-series forecasting [6, 8], solving partial differential equations [24], and many other domains [72, 29, 71, 11, 75, 13, 34, 33, 58]. Recent QKAN-LSTM work further explores quantum-inspired Kolmogorov–Arnold recurrent gates to achieve better scalability [34, 39]. However, QLSTM inherits the structural drawbacks of recurrent designs: its non-linear gated recurrence couples the hidden state across all time steps, forcing gradient computation to proceed sequentially through backpropagation-through-time (BPTT) and precluding parallelization across time steps. This sequential cost is amplified in the quantum setting, where each step requires repeated VQC evaluations for gradient estimation.

The Fast Weight Programmer (FWP) framework [60, 61] addresses sequence learning through a different mechanism: a slow network generates low-rank parameter updates that reprogram a fast network, yielding a linear recurrence in the fast weights. This formulation was later shown to be equivalent to linearized self-attention [59, 40], and extended through recurrent and self-referential variants [37, 36]. The Quantum FWP (QFWP) [22] adapts this paradigm to the hybrid setting, with a classical slow programmer continually reprogramming a VQC fast programmer. Because temporal dependence is encoded through additive parameter updates rather than a recurrent quantum state, the QFWP avoids deep gradient evaluation such as BPTT through the quantum circuit. Subsequent variants such as the Quantum-Train QFWP [49] and observable-aware extensions [17] further reduce parameters and broaden applicability. However, these models share a purely additive update rule that lacks any input-dependent mechanism to forget or reweight accumulated information, in contrast to QLSTM. To close this gap, the Self-Modulating QFWP introduces input-dependent modulation of both the new update and the previously accumulated fast weights. Because the modulation coefficients depend only on the current input, the recurrence remains linear and parallelizable [53], thereby inheriting the temporal-memory control of LSTM-style temporal-memory control without the sequential gradient bottleneck of non-linear recurrence.

III Quantum Neural Networks

A quantum neural network (QNN), also known as variational quantum circuit (VQC) or parameterized quantum circuit (PQC), consists of a data-encoding circuit $U(\vec{x})$ , a trainable variational circuit $W(\Theta)$ , and a final measurement stage. Given a classical input $\vec{x}$ , the quantum state is prepared as $\ket{\Psi}=W(\Theta)U(\vec{x})\ket{0}^{\otimes n},$ where $n$ is the number of qubits and $\Theta$ denotes the trainable circuit parameters. The model output is obtained by measuring Hermitian observables $\hat{B}_{k}$ , yielding expectation values $\langle\hat{B}_{k}\rangle=\bra{\Psi}\hat{B}_{k}\ket{\Psi}.$ Therefore, the QNN defines a quantum function $\vec{f}(\vec{x};\Theta)=(\langle\hat{B}_{1}\rangle,\dots,\langle\hat{B}_{K}\rangle)$ .

IV Quantum Fast Weight Programmers and Learning to Modulate

Let $L$ be the number of trainable variational layers in the fast quantum circuit and $Q$ be the number of qubits. At time step $t$ , the fast circuit state is $\Theta_{t}\in\mathbb{R}^{L\times Q}$ , where $(\Theta_{t})_{k,q}$ is the circuit angle assigned to layer $k\in\{1,\ldots,L\}$ and qubit $q\in\{1,\ldots,Q\}$ .

Denote $\Omega$ as the collection of trainable parameters of the classical slow controller. These parameters are optimized during training and remain fixed during inference. By contrast, $\Theta_{t}$ , $\Delta_{t}$ , $M_{t}^{\mathrm{new}}$ , and $M_{t}^{\mathrm{old}}$ are time-dependent quantities generated from the input sequence.

Let $h_{t}=\phi_{\Omega_{\phi}}(x_{t})\in\mathbb{R}^{H}$ be the hidden state produced by the classical controller $\phi_{\Omega_{\phi}}$ from the input $x_{t}$ . The raw update vectors are $\ell_{t}=W_{\ell}\,h_{t}+b_{\ell}\in\mathbb{R}^{L}$ , $r_{t}=W_{r}\,h_{t}+b_{r}\in\mathbb{R}^{Q}$ and the raw fast-weight update is the rank-1 matrix $\Delta_{t}=\ell_{t}\cdot r_{t}^{\top}\in\mathbb{R}^{L\times Q}$ . Then the original QFWP [22] uses the update rule,

\Theta_{t}=\Theta_{t-1}+\Delta_{t}

(1)

Thus, the fast circuit state is obtained by accumulating all previous raw updates. This makes the original QFWP a quantum recurrent fast-weight model: the classical slow programmer generates a low-rank update from the current input, and the quantum circuit uses the accumulated fast parameters to produce quantum expectation-value features.

Refer to caption — Figure 1: Architecture of the proposed Self-Modulating QFWP. A classical controller generates $\Delta_{t}$ , $M_{t}^{\mathrm{old}}$ , and $M_{t}^{\mathrm{new}}$ , which are fused to produce the circuit parameters $\Theta_{t}$ for the variational quantum circuit $W(\Theta_{t})$ .

IV-A Full Self-Modulating QFWP

The full Self-Modulating QFWP, as illustrated in Figure 1, introduces two input-dependent modulation matrices, one for the current raw update $\Delta_{t}$ , and one for the previous fast circuit state $\Theta_{t-1}$ . The new-update modulation head generates,

m_{t}^{\{\mathrm{old,new}\},\{L,Q\}}=W_{\{\mathrm{old,new}\},\{L,Q\}}\,h_{t}+b_{\{\mathrm{old,new}\},\{L,Q\}}

compactly representing the following 4 vectors,

m_{t}^{\mathrm{new},L}\in\mathbb{R}^{L},\,\,m_{t}^{\mathrm{new},Q}\in\mathbb{R}^{Q},\,\,m_{t}^{\mathrm{old},L}\in\mathbb{R}^{L},\,\,m_{t}^{\mathrm{old},Q}\in\mathbb{R}^{Q}

with the corresponding $\{\mathrm{old,new}\}$ -update modulation,

M_{t}^{\{\mathrm{old,new}\}}=m_{t}^{\{\mathrm{old,new}\},L}\left(m_{t}^{\{\mathrm{old,new}\},Q}\right)^{\top}\in\mathbb{R}^{L\times Q}

(2)

We define the full self-modulating update rule as,

\Theta_{t}=\Delta_{t}\odot M_{t}^{\mathrm{new}}+\Theta_{t-1}\odot M_{t}^{\mathrm{old}},

(3)

where $\odot$ denotes the element-wise multiplication.

This model controls both the newly generated update and the previously accumulated fast parameters. New modulation regulates how the current update is written, while Old modulation regulates how the previous fast-weight memory is retained, suppressed, amplified, or sign-reversed. Since the modulation matrices are produced by linear layers and outer products, they are not bounded gates unless an additional bounding nonlinearity is introduced.

IV-B Only-New Self-Modulating QFWP

The Only-New variant is a special case of Eq. (3) where only the new raw update is modulated, leaving the previous fast parameters unchanged by $\Theta_{t}=\Delta_{t}\odot M_{t}^{\mathrm{new}}+\Theta_{t-1}$ . This model tests whether input-dependent write modulation alone is sufficient.

IV-C Only-Old Self-Modulating QFWP

On the other hand, the Only-Old variant $\Theta_{t}=\Delta_{t}+\Theta_{t-1}\odot M_{t}^{\mathrm{old}}$ is considered to modulate only the previous fast parameters. This model tests whether input-dependent memory modulation alone captures most of the benefit of the full model.

TABLE I: Comparison of update rules in QFWP variants.

Model	Update Rule
Original QFWP	$\Theta_{t}=\Theta_{t-1}+\Delta_{t}$
Full Self-Modulating QFWP	$\Theta_{t}=\Delta_{t}\odot M_{t}^{\mathrm{new}}+\Theta_{t-1}\odot M_{t}^{\mathrm{old}}$
Only-New Self-Modulating QFWP	$\Theta_{t}=\Delta_{t}\odot M_{t}^{\mathrm{new}}+\Theta_{t-1}$
Only-Old Self-Modulating QFWP	$\Theta_{t}=\Delta_{t}+\Theta_{t-1}\odot M_{t}^{\mathrm{old}}$

V Numerical Results and Discussions

We evaluate the performance of the proposed Self-Modulating-QFWP model under the following four configurations: (1) Self-Modulating-QFWP — this is the variant in which both new and old parameters are modulated; (2) Self-Modulating-QFWP (Only Old) — this is the variant in which only old parameters are modulated; (3) Self-Modulating-QFWP (Only New) — this is the variant in which only new parameters are modulated; (4) QFWP (Standard) — this is the original QFWP in which no parameter modulation is implemented.

To compare the performance difference across various model sizes and qubit counts, we consider and train the following settings for hidden size, which is also the number of qubits of VQCs in the fast programmer: 4, 6 ,8, 10, 12, 14. The training and evaluation protocol follows the methodology described in [19, 15, 22, 16]. Specifically, the model is trained to predict the $(N+1)$ -th value in a sequence given the preceding $N$ observations. For example, at time step $t$ , the input to the model is $[x_{t-4},x_{t-3},x_{t-2},x_{t-1}]$ (with $N=4$ ), and the target output is $y_{t}$ , which should approximate the ground truth $x_{t}$ . To investigate the memory performance of the proposed model across various sequence length, we consider $N=\{4,8,16,32,64\}$ . For each time-series task, each model configuration is evaluated over $6\times 5=30$ combinations, obtained from six hidden-size/qubit-count settings $H\in\{4,6,8,10,12,14\}$ and five input sequence lengths $N\in\{4,8,16,32,64\}$ .

We consider five time-series tasks which are used as benchmarks in previous studies [19, 15, 22, 16]: Damped SHM, Bessel function $J_{2}$ , Delayed Quantum Control, NARMA-5, and NARMA-10. All simulations use batch_size=4, number of QNN layers $L=5$ , learning_rate= $10^{-3}$ and the Adam optimizer.

In addition to the standard metric such as mean square error (MSE), we consider the following metrics to better capture the learning performance of different variants in our proposed Self-Modulating QFWP models.

•
Relative Improvement:

$\Delta^{\mathrm{rel}}=\frac{M_{\mathrm{standard}}-M_{\mathrm{variant}}}{M_{\mathrm{standard}}+\varepsilon}.$ (4)

This quantity measures how much the variant improves or degrades the metric $M$ relative to the standard baseline, while using $\varepsilon$ to prevent numerical instability when the denominator is small. When $M_{\mathrm{variant}}<M_{\mathrm{standard}}$ , the numerator is positive, so $\Delta^{\mathrm{rel}}>0$ . This means that the variant achieves a smaller value of the metric and therefore performs better. Accordingly,
- –
  
  $\Delta^{\mathrm{rel}}>0$ : variant better
- –
  
  $\Delta^{\mathrm{rel}}<0$ : variant worse
- –
  
  $\Delta^{\mathrm{rel}}\approx 0$ : both perform similarly
•

Relative Strength: We define Relative Strength to quantify whether old-parameter modulation or new-parameter modulation contributes more strongly to performance improvement. Specifically, Relative Strength is defined as the difference between the relative improvements of the Only-Old-Modulated and Only-New-Modulated variants:

$\mathrm{Relative\ Strength}=\Delta_{\mathrm{old}}^{\mathrm{rel}}-\Delta_{\mathrm{new}}^{\mathrm{rel}}.$ (5)

A positive Relative Strength value indicates that old-parameter modulation contributes more strongly than new-parameter modulation, while a negative value indicates the opposite. Values close to zero suggest that the two modulation pathways have comparable effects.

•

Synergy: We further define Synergy to measure whether jointly modulating both old and new parameters provides an additional benefit beyond the better single-sided modulation strategy. Formally, Synergy is defined as

\mathrm{Synergy}=\Delta_{\mathrm{both}}^{\mathrm{rel}}-\max\left(\Delta_{\mathrm{old}}^{\mathrm{rel}},\Delta_{\mathrm{new}}^{\mathrm{rel}}\right).

(6)

A positive Synergy value indicates that jointly modulating both old and new parameters outperforms the better of the two single-sided modulation variants, suggesting a genuine cooperative effect. A value near zero indicates little additional gain beyond the stronger single-sided modulation, while a negative value suggests that the joint modulation strategy does not surpass the better individual component.

For the bessel_j2 task with sequence length 32, as shown in Figure 2, the Self-Modulating QFWP exhibits a clearly faster convergence behavior than the standard QFWP. At epoch 1, the standard model still produces an almost flattened response and fails to capture the oscillatory structure of the target sequence, whereas the Self-Modulating variant already aligns reasonably well with the underlying periodicity, phase, and damping trend. By epoch 15, the prediction of the Self-Modulating QFWP nearly overlaps with the ground truth, while the standard QFWP still shows visible amplitude and local shape mismatches. This advantage remains evident in both the training and testing regions, indicating that self-modulation improves not only fitting speed but also early-stage generalization. Although both models eventually achieve strong prediction quality by epoch 100, the main benefit of the Self-Modulating QFWP in this task lies in its substantially faster and more stable learning of the correct temporal dynamics.

Figure 3 shows representative test MSE convergence curves for the bessel_j2 task under several hidden-size (qubit-count) and sequence-length settings. In the easiest setting (hidden size = 4, sequence length = 4), all four variants eventually converge to nearly zero test error, although the modulated variants exhibit faster and smoother early-stage optimization. As the task becomes more challenging, especially at sequence lengths 16 and 64, the differences between methods become substantially more pronounced. The standard QFWP consistently maintains higher test MSE and exhibits noticeable fluctuations throughout training. The Only New Modulated variant improves over the standard model, but typically settles at a visibly higher error floor. In contrast, both the Self-Modulating QFWP and the Only Old Modulated variant rapidly drive the test MSE to near zero and maintain much more stable convergence trajectories, with Only Old Modulated appearing particularly stable across the harder settings. These results suggest that, for oscillatory sequence modeling in bessel_j2, modulation based on historical or old-state information is likely the dominant contributor to the performance gain, while full self-modulation provides similarly strong and robust optimization behavior.

The final-test-MSE heatmaps (Figure 4) further consolidate the trends suggested by the epoch-wise prediction plots and representative convergence curves. For the bessel_j2 task, the standard QFWP maintains low error only in shorter-sequence settings (e.g., sequence lengths 4 and 8), but its performance degrades substantially as the sequence length increases to 16, 32, and 64. In contrast, the Self-Modulating QFWP remains consistently strong across the entire grid of hidden sizes and sequence lengths, with final test errors staying in a uniformly low range. Even more notably, the Only Old Modulated variant appears to be the strongest overall configuration on this task, with most entries remaining at the $10^{-6}$ to $10^{-5}$ level and showing even greater consistency than the full Self-Modulating QFWP. By comparison, the Only New Modulated variant performs reasonably well for short sequences, but deteriorates markedly for longer ones, exhibiting a failure pattern much closer to the standard QFWP. Taken together, these results strongly suggest that, for the oscillatory and history-dependent dynamics of bessel_j2, old-state modulation is the dominant source of performance improvement, while full self-modulation preserves this advantage and provides robust performance across configurations.

The relative-improvement heatmaps (Figure 5) provide a complementary, ratio-based view of the gains achieved by each variant over the standard QFWP. Using the metric defined earlier (Equation 4), where positive values indicate lower MSE than the standard baseline, both the Self-Modulating and Only Old variants exhibit strong positive improvements across most settings on the bessel_j2 task. In particular, for sequence lengths 16, 32, and 64, their relative-improvement values are almost uniformly close to 1, indicating highly significant gains over the standard QFWP. By contrast, the Only New variant shows much weaker and less consistent improvements, with many entries remaining near zero or even becoming negative, suggesting that new-only modulation is insufficient to reproduce the benefits of old-related modulation. A few isolated large negative values appear in the Self-Modulating heatmap; however, these cases mainly arise because the standard QFWP already attains an extremely small final test MSE in those settings, making the ratio-based metric highly sensitive to tiny absolute differences. In practice, the modulating variants in these cases also converge successfully, and their prediction curves are often visually almost indistinguishable from those of the standard model. Therefore, such isolated negative entries should not be interpreted as failures of the modulating variants, but rather read together with the raw final-test-MSE heatmaps and prediction plots. Overall, these results are consistent with the preceding analyses and further support the conclusion that old-state modulation is the dominant source of performance gain for the bessel_j2 task, while full self-modulation preserves this advantage across most configurations.

For the damped_shm task with sequence length 32 (shown in Figure 6), the Self-Modulating QFWP clearly outperforms the standard QFWP. By around epoch 15, the Self-Modulating model already produces predictions that nearly overlap with the ground truth and continues to track the damped oscillatory trajectory accurately in both the training and testing regions. In contrast, the standard QFWP still exhibits noticeable shape and amplitude mismatches even at epoch 100, indicating that it fails to fully capture the underlying temporal dynamics. Overall, this task again supports the benefit of self-modulation for sequence modeling, with an even more pronounced advantage than in bessel_j2.

The representative test MSE convergence curves for damped_shm (Figure 7) show a pattern consistent with, but even stronger than, that of bessel_j2. The standard QFWP maintains substantially higher and more fluctuating test error in the medium-to-long sequence settings, whereas both the Self-Modulating QFWP and the Only Old Modulated variant rapidly converge to near-zero test MSE and remain stable thereafter. The Only-New variant improves over the standard model, but still lags behind the former two. This indicates that damped_shm is especially sensitive to old-state modulation, which directly improves the stability of temporal memory.

The final-test-MSE heatmaps in Figure 8 confirm the same trend across the full configuration grid. The standard QFWP deteriorates substantially in several medium-to-long sequence settings, while the Self-Modulating QFWP remains low-error and stable across hidden sizes and sequence lengths. The Only-Old variant is again among the most consistent performers, whereas the Only-New variant provides partial improvement but remains less robust. This reinforces the conclusion that the dominant gain in damped_shm comes from old-related modulation.

The relative-improvement heatmaps in Figure 9 further show that both the Self-Modulating and Only-Old variants achieve strong positive gains over the standard QFWP across most settings, with many values approaching 1 in the medium-to-long sequence regime. The Only-New variant is less consistent and still contains several near-zero or negative regions. As in the bessel_j2 case, isolated negative values should be interpreted together with the raw MSE heatmaps, since the ratio-based metric can amplify tiny absolute differences when the baseline error is already extremely small.

For the delayed_quantum_control task with sequence length 64 (shown in Figure 10), the contrast between the two models is nearly binary. The Self-Modulating QFWP reproduces the pulse-like decaying target dynamics almost perfectly from very early epochs and remains highly accurate in both the training and testing regions. In contrast, the standard QFWP fails to learn the correct temporal structure throughout training and still exhibits clearly mismatched low-amplitude oscillatory predictions even at epoch 100. This suggests that, in this task, self-modulation does not merely improve convergence speed, but is crucial for successfully learning the delayed temporal dynamics.

The convergence curves in Figure 11 show that old-related modulation is highly effective for this task. Both the Self-Modulating and Only-Old variants remain near zero and highly stable from early training, whereas the standard QFWP exhibits substantially larger errors and stronger fluctuations in medium-to-long sequence settings. The Only-New variant also improves over the standard model, but remains less stable than the Self-Modulating and Only-Old variants. These results indicate that old-state modulation almost eliminates the original optimization difficulty in delayed_quantum_control.

The final-test-MSE heatmaps for delayed_quantum_control (Figure 12) reveal an even sharper separation than in the previous tasks. The standard QFWP exhibits several clearly failed medium-to-long sequence settings with markedly elevated final test MSE, indicating limited ability to model the delayed temporal structure. In contrast, the Self-Modulating QFWP maintains uniformly low and stable error across almost the entire hidden-size and sequence-length grid. The Only-Old variant performs comparably well and shows similarly consistent low-error behavior, while the Only-New variant, although better than the standard model, still retains a few higher-error configurations in harder settings. Overall, these results again suggest that the dominant gain in delayed_quantum_control comes from old-related modulation, while full self-modulation preserves this advantage across the configuration space.

The relative-improvement heatmaps in Figure 13 are consistent with the raw MSE results. Both the Self-Modulating and Only-Old variants achieve strong positive gains over the standard QFWP across most settings, especially in the longer-sequence regime. The Only-New variant also improves over the standard baseline, but its gains are generally less uniform. Overall, these results again support the conclusion that the dominant gain in delayed_quantum_control comes from old-related modulation, while full self-modulation preserves this advantage across the configuration space.

We next examine the NARMA family—narma_5 (sequence length 32) and narma_10 (sequence length 64)—as progressively harder autoregressive memory tasks. The same overall ordering holds in both: Self-Modulating $\approx$ Only-Old $>$ Only-New $>$ standard QFWP, visible across rollouts (Figure 14, Figure 18), convergence curves (Figure 15, Figure 19), final-MSE heatmaps (Figure 16, Figure 20), and relative-improvement heatmaps (Figure 17, Figure 21).

For narma_5 the standard QFWP captures the coarse trend but smooths out local fluctuations and peak–valley structure; in convergence it eventually reaches low error but with greater instability and occasional spikes in the harder settings, while Self-Modulating and Only-Old stay smooth and low. The same ordering becomes more pronounced for narma_10: the standard model is more clearly underfit in the rollouts, fluctuates more sharply in the convergence curves, and degrades more visibly in the heatmaps, while full Self-Modulating and Only-Old variants converge smoothly and remain near-zero loss. The widening gap from narma_5 to narma_10 indicates that old-related modulation becomes increasingly important as autoregressive memory demand grows.

The convergence curves show that all variants can eventually reach relatively low test error, but the standard QFWP exhibits greater instability and occasional spikes in harder settings, whereas the Self-Modulating and Only-Old variants remain more stable overall (Figure 15). The final-test-MSE and relative-improvement heatmaps further confirm this ranking: Self-Modulating and Only-Old achieve the most consistent gains, Only-New provides intermediate improvement, and the standard QFWP is the least robust, especially at longer sequence lengths (Figure 16 and Figure 17).

The same ordering becomes more pronounced for narma_10 with sequence length 64. In the epoch-wise prediction comparison, the standard QFWP remains more underfit and less responsive to local variations, whereas the Self-Modulating variant tracks the target sequence more faithfully from mid training onward (Figure 18). This stronger gap is also visible in the convergence curves, where the standard model exhibits larger fluctuations and sharper spikes in harder settings, while the Self-Modulating and Only-Old variants converge more smoothly and maintain lower test MSE (Figure 19).

The final-test-MSE and relative-improvement heatmaps again identify Self-Modulating and Only-Old as the most reliable configurations across hidden sizes and sequence lengths, with Only-New remaining intermediate and the standard QFWP showing the most obvious degradation (Figure 20 and Figure 21). Compared with narma_5, the stronger separation in narma_10 suggests that old-related modulation becomes increasingly important as the autoregressive memory demand increases.

The task-wise mean final test MSE (Figure 23) confirms the same ranking at the aggregate level across all five benchmarks: standard QFWP is weakest, Only-New gives partial improvement, and full Self-Modulating and the Only-Old variant are strongest. The advantage is most dramatic on the oscillatory and delayed-control tasks, where full Self-Modulating and Only-Old reduce mean MSE by orders of magnitude; on NARMA absolute errors are smaller but the ordering is unchanged. The gain from self-modulation is therefore broadly task-independent.

The Relative Strength and Synergy heatmaps (Figure 22, Equation 5, Equation 6) interpret these gains mechanistically. Recall that positive Relative Strength means old-state modulation contributes more than new-state modulation, while positive Synergy means the full model exceeds the better single-sided variant. Across nearly all tasks and configurations, Relative Strength is predominantly positive—often strongly so—confirming that the gain is dominated by old-related modulation. By contrast, Synergy values are frequently near zero or negative, especially on bessel_j2 and damped_shm, indicating that the full Self-Modulating model rarely benefits from constructive old/new interaction. Its performance is largely explained by the dominant old-state effect, with new modulation providing only limited or task-dependent additional benefit. NARMA shows a milder, mixed pattern but the same direction. Together with the per-task results, these aggregate views support the conclusion that old-state modulation is the primary driver of improvement, and full self-modulation mainly serves as a robust wrapper preserving this benefit across diverse tasks.

VI Theoretical Discussion

Here we provide an explanation of why the Only-Old model can achieve performance comparable to the full Self-Modulating QFWP.

Basic Notation.

Consider a single layer-qubit coordinate $j=(k,q)$ . Define $\theta_{t,j}=[\Theta_{t}]_{j}$ , $d_{t,j}=[\Delta_{t}]_{j}$ , $c_{t,j}=[M_{t}^{\mathrm{new}}]_{j}$ , and $a_{t,j}=[M_{t}^{\mathrm{old}}]_{j}$ , where $d_{t,j}$ is the raw update, $c_{t,j}$ is the new-parameter modulation, and $a_{t,j}$ is the old-parameter modulation. The original QFWP is $\theta_{t,j}^{\mathrm{orig}}=\theta_{t-1,j}^{\mathrm{orig}}+d_{t,j}$ . The Only-New variant is $\theta_{t,j}^{\mathrm{new}}=\theta_{t-1,j}^{\mathrm{new}}+c_{t,j}d_{t,j}$ . The Only-Old variant is $\theta_{t,j}^{\mathrm{old}}=a_{t,j}\theta_{t-1,j}^{\mathrm{old}}+d_{t,j}$ . The full Self-Modulating QFWP is $\theta_{t,j}^{\mathrm{full}}=a_{t,j}\theta_{t-1,j}^{\mathrm{full}}+c_{t,j}d_{t,j}$ . We define the Old-modulated temporal memory kernel as $K_{s\to t,j}^{\mathrm{old}}=\prod_{u=s+1}^{t}a_{u,j}$ , with the empty product defined as $1$ .

Dynamics of Updates.

Assuming $\Theta_{0}=0$ , the original QFWP unrolls to $\theta_{t,j}^{\mathrm{orig}}=\sum_{s=1}^{t}d_{s,j}$ , meaning that all past updates are accumulated with equal weight. The Only-New variant unrolls to $\theta_{t,j}^{\mathrm{new}}=\sum_{s=1}^{t}c_{s,j}d_{s,j}$ , meaning that each update is rescaled when it is written, but is not dynamically reweighted afterward. The Only-Old variant unrolls to $\theta_{t,j}^{\mathrm{old}}=\sum_{s=1}^{t}d_{s,j}\prod_{u=s+1}^{t}a_{u,j}=\sum_{s=1}^{t}d_{s,j}K_{s\to t,j}^{\mathrm{old}}$ , meaning that each past update is weighted by a temporal memory kernel. The full model unrolls to $\theta_{t,j}^{\mathrm{full}}=\sum_{s=1}^{t}c_{s,j}d_{s,j}\prod_{u=s+1}^{t}a_{u,j}=\sum_{s=1}^{t}c_{s,j}d_{s,j}K_{s\to t,j}^{\mathrm{old}}$ . Thus, both the full and Only-Old models share the same temporal memory kernel $K_{s\to t,j}^{\mathrm{old}}$ ; the full model only adds the instantaneous write multiplier $c_{s,j}$ .

Difference Between Full and Only-Old.

With $d_{t,j}$ and $a_{t,j}$ fixed, only $c_{t,j}$ differs; this is a structural, not parameter-wise, comparison. Define $e_{t,j}=\theta_{t,j}^{\mathrm{full}}-\theta_{t,j}^{\mathrm{old}}$ . Subtracting the full and Only-Old recurrences gives $e_{t,j}=a_{t,j}e_{t-1,j}+(c_{t,j}-1)d_{t,j}$ . The first term $a_{t,j}e_{t-1,j}$ only propagates the discrepancy already present at the previous time step, while the second term $(c_{t,j}-1)d_{t,j}$ is the newly injected discrepancy caused by New modulation. Therefore, if the raw update $d_{t,j}$ is moderate, or if the Old-modulated memory dynamics attenuate past discrepancies, the full and Only-Old models can remain close even when $c_{t,j}\neq 1$ . In matrix form, with $E_{t}=\Theta_{t}^{\mathrm{full}}-\Theta_{t}^{\mathrm{old}}$ , the same relation is $E_{t}=M_{t}^{\mathrm{old}}\odot E_{t-1}+(M_{t}^{\mathrm{new}}-\mathbf{1})\odot\Delta_{t}$ .

Role of New Modulation.

There is also a structural reason why New modulation may be less essential than Old modulation. The raw update has the outer-product form $\Delta_{t}=\ell_{t}r_{t}^{\top}$ , and the New modulation matrix has the outer-product form $M_{t}^{\mathrm{new}}=m_{t}^{\mathrm{new},L}(m_{t}^{\mathrm{new},Q})^{\top}$ . Their element-wise product satisfies $\Delta_{t}\odot M_{t}^{\mathrm{new}}=(\ell_{t}r_{t}^{\top})\odot\left(m_{t}^{\mathrm{new},L}(m_{t}^{\mathrm{new},Q})^{\top}\right)=(\ell_{t}\odot m_{t}^{\mathrm{new},L})(r_{t}\odot m_{t}^{\mathrm{new},Q})^{\top}$ . Thus, New modulation can be viewed as redefining the effective write vectors, while the instantaneous update remains rank one. In contrast, Old modulation acts on $\Theta_{t-1}$ , which already contains information accumulated from previous inputs. The term $\Theta_{t-1}\odot M_{t}^{\mathrm{old}}$ therefore directly controls the temporal memory state. This history-dependent memory control cannot be easily reproduced by only modifying the current update $\Delta_{t}$ , since $\Delta_{t}$ is generated from the current input whereas $\Theta_{t-1}$ summarizes the past.

Interpretation.

The above derivation suggests that Old modulation captures the dominant mechanism of Self-Modulating QFWP: it determines how past quantum fast-weight updates are retained, suppressed, amplified, or sign-reversed through the temporal kernel $K_{s\to t,j}^{\mathrm{old}}$ . New modulation can still improve performance by refining the amplitude of newly written updates, which explains why the full model may perform best. However, when the task is primarily limited by memory dynamics rather than instantaneous write scaling, the Only-Old variant can naturally approach the performance of the full model.

VII Conclusion

In this work, we proposed Self-Modulating QFWP, a compact quantum fast-weight framework that adaptively regulates both newly generated updates and accumulated fast-weight memory. Across five time-series benchmarks and extensive hidden-size/sequence-length settings, the proposed model consistently improves convergence stability and predictive accuracy over standard QFWP. Our ablation studies further reveal that old-state modulation is the dominant source of improvement, indicating that effective control of historical quantum fast weights is central to robust sequential learning. These results position self-modulation as a simple but powerful mechanism for building adaptive and scalable quantum sequence models.

References

[1] A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner (2021) The power of quantum neural networks. Nature computational science 1 (6), pp. 403–409. Cited by: §I.
[2] J. Bausch (2020) Recurrent quantum neural networks. Advances in neural information processing systems 33, pp. 1368–1379. Cited by: §I.
[3] M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini (2019) Parameterized quantum circuits as machine learning models. Quantum science and technology 4 (4), pp. 043001. Cited by: §I.
[4] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke, W. Mok, S. Sim, L. Kwek, and A. Aspuru-Guzik (2022) Noisy intermediate-scale quantum algorithms. Reviews of Modern Physics 94 (1), pp. 015004. External Links: Document, Link Cited by: §I.
[5] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd (2017) Quantum machine learning. Nature 549 (7671), pp. 195–202. External Links: Document, Link Cited by: §I.
[6] Y. Cao, X. Zhou, X. Fei, H. Zhao, W. Liu, and J. Zhao (2023) Linear-layer-enhanced quantum long short-term memory for carbon price forecasting. Quantum Machine Intelligence 5 (2), pp. 26. External Links: Link Cited by: §II.
[7] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, and P. J. Coles (2021) Variational quantum algorithms. Nature Reviews Physics 3 (9), pp. 625–644. External Links: Document, Link Cited by: §I.
[8] C. Chen, S. Y. Chen, and Y. Tsai (2025) Benchmarking quantum and classical sequential models for urban telecommunication forecasting. arXiv preprint arXiv:2508.04488. Cited by: §II.
[9] C. Chen and E. Kuo (2025) Quantum adaptive self-attention for quantum transformer models. arXiv preprint arXiv:2504.05336. Cited by: §I.
[10] C. Chen and E. Kuo (2025) Quantum-enhanced channel mixing in rwkv models for time series forecasting. arXiv preprint arXiv:2505.13524. Cited by: §I.
[11] C. Chen, Y. Yang, and W. Jywe (2025) The development of the variational quantum circuits architecture of the quantum long short-term memory model for thermal error compensation in the z-axis of machine tools. The International Journal of Advanced Manufacturing Technology 140 (1), pp. 577–593. External Links: Link Cited by: §II.
[12] K. Chen, S. Y. Chen, C. Liu, and K. K. Leung (2025) Quantum-train-based distributed multi-agent reinforcement learning. In 2025 IEEE Symposium for Multidisciplinary Computational Intelligence Incubators (MCII Companion), pp. 1–5. Cited by: §I.
[13] K. Chen, S. Y. Chen, C. Liu, and K. K. Leung (2025) Toward large-scale distributed quantum long short-term memory with modular quantum computers. In 2025 International Wireless Communications and Mobile Computing (IWCMC), pp. 337–342. External Links: Link Cited by: §II.
[14] K. Chen, T. Li, Y. Wang, S. See, C. Wang, R. Wille, N. Chen, A. Yang, and C. Lin (2025) Validating large-scale quantum machine learning: efficient simulation of quantum support vector machines using tensor networks. Machine Learning: Science and Technology 6 (1), pp. 015047. Cited by: §I.
[15] S. Y. Chen, D. Fry, A. Deshmukh, V. Rastunkov, and C. Stefanski (2022) Reservoir computing via quantum recurrent neural networks. arXiv preprint arXiv:2211.02612. External Links: Link Cited by: §V, §V.
[16] S. Y. Chen and P. Tiwari (2025) Quantum long short-term memory with differentiable architecture search. In 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), pp. 13–18. External Links: Link Cited by: §V, §V.
[17] S. Y. Chen, H. Tseng, H. Lin, and S. Yoo (2025) Learning to program quantum measurements for machine learning. In 2025 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 1, pp. 1826–1836. Cited by: §II.
[18] S. Y. Chen, C. H. Yang, J. Qi, P. Chen, X. Ma, and H. Goan (2020) Variational quantum circuits for deep reinforcement learning. IEEE access 8, pp. 141007–141024. External Links: Link Cited by: §I.
[19] S. Y. Chen, S. Yoo, and Y. L. Fang (2022) Quantum long short-term memory. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8622–8626. External Links: Link Cited by: §I, §I, §II, §V, §V.
[20] S. Y. Chen (2023) Quantum deep recurrent reinforcement learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. External Links: Link Cited by: §II.
[21] S. Y. Chen (2024) Efficient quantum recurrent reinforcement learning via quantum reservoir computing. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13186–13190. External Links: Link Cited by: §II.
[22] S. Y. Chen (2024) Learning to program variational quantum circuits with fast weights. In 2024 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: §I, §II, §IV, §V, §V.
[23] S. Y. Chen (2026) Quantum artificial intelligence: from quantum neural networks to self-programming architectures [feature]. IEEE Circuits and Systems Magazine 26 (1), pp. 41–66. External Links: Document Cited by: §I.
[24] Y. Chen, A. Khaliq, and K. M. Furati (2025) Quantum recurrent neural networks with encoder-decoder for time-dependent partial differential equations. arXiv preprint arXiv:2502.13370. External Links: Link Cited by: §II.
[25] C. Chu, A. Hastak, and F. Chen (2025) LSTM-qgan: scalable nisq generative adversarial network. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. External Links: Link Cited by: §II.
[26] I. Cong, S. Choi, and M. D. Lukin (2019) Quantum convolutional neural networks. Nature Physics 15 (12), pp. 1273–1278. External Links: Document Cited by: §I.
[27] A. Delgado and K. E. Hamilton (2025) Quantum machine learning: concepts and possibilities. IOP Publishing. Cited by: §I.
[28] R. Di Sipio, J. Huang, S. Y. Chen, S. Mangini, and M. Worring (2022) The dawn of quantum natural language processing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8612–8616. External Links: Link Cited by: §I, §II.
[29] M. Elsayed and O. A. Dobre (2025) A hybrid quantum-classical machine learning approach for self-interference cancellation in full-duplex transceivers. IEEE Communications Letters. External Links: Link Cited by: §II.
[30] L. K. Grover (1996) A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 212–219. Cited by: §I.
[31] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
[32] Y. Hsu, N. Chen, T. Li, P. H. Lee, and K. Chen (2025) Quantum kernel-based long short-term memory for climate time-series forecasting. In 2025 International Conference on Quantum Communications, Networking, and Computing (QCNC), pp. 421–426. Cited by: §I.
[33] Y. Hsu, J. Jiang, C. Lin, W. Chen, K. Peng, P. Tiwari, S. Y. Chen, and E. Kuo (2025) Federated quantum kernel-based long short-term memory for human activity recognition. In 2025 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 2, pp. 54–58. External Links: Link Cited by: §II.
[34] Y. Hsu, J. Jiang, C. Lin, K. Peng, N. Chen, S. Y. Chen, E. Kuo, and H. Goan (2025) QKAN-LSTM: quantum-inspired Kolmogorov-Arnold long short-term memory. External Links: 2512.05049, Document, Link Cited by: §II.
[35] T. Hur, L. Kim, and D. K. Park (2022) Quantum convolutional neural network for classical data classification. Quantum Machine Intelligence 4 (1), pp. 3. Cited by: §I.
[36] K. Irie, R. Csordás, and J. Schmidhuber (2023) Practical computational power of linear transformers and their recurrent and self-referential extensions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9455–9465. Cited by: §II.
[37] K. Irie, I. Schlag, R. Csordás, and J. Schmidhuber (2021) Going beyond linear transformers with recurrent fast weight programmers. Advances in neural information processing systems 34, pp. 7703–7717. Cited by: §II.
[38] S. Jerbi, C. Gyurik, S. Marshall, H. J. Briegel, and V. Dunjko (2021) Parametrized quantum policies for reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 34, pp. 28362–28375. Cited by: §I.
[39] J. Jiang, Y. Huang, T. Chen, and H. Goan (2025) Quantum variational activation functions empower Kolmogorov-Arnold networks. arXiv preprint arXiv:2509.14026. External Links: Link Cited by: §II.
[40] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. Cited by: §II.
[41] A. Kundu, P. Bedełek, M. Ostaszewski, O. Danaci, Y. J. Patel, V. Dunjko, and J. A. Miszczak (2024) Enhancing variational quantum state diagonalization using reinforcement learning techniques. New Journal of Physics 26 (1), pp. 013034. Cited by: §I.
[42] A. Kundu (2024) Reinforcement learning-assisted quantum architecture search for variational quantum algorithms. arXiv preprint arXiv:2402.13754. Cited by: §I.
[43] A. Kutvonen, K. Fujii, and T. Sagawa (2020) Optimizing a quantum reservoir computer for time series prediction. Scientific reports 10 (1), pp. 14687. Cited by: §I.
[44] S. S. Li, X. Zhang, S. Zhou, H. Shu, R. Liang, H. Liu, and L. P. Garcia (2023) PQLM-multilingual decentralized portable quantum language model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. External Links: Link Cited by: §I, §II.
[45] Y. Li, Z. Wang, R. Han, S. Shi, J. Li, R. Shang, H. Zheng, G. Zhong, and Y. Gu (2023) Quantum recurrent neural networks for sequential learning. Neural Networks 166, pp. 148–161. Cited by: §I.
[46] C. A. Lin, C. Liu, and K. Chen (2024) Quantum-train long short-term memory: application on flood prediction problem. In 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 2, pp. 268–273. Cited by: §I.
[47] C. Liu, K. Chen, Y. Chen, S. Y. Chen, W. Huang, W. Huang, and Y. Chang (2025) Quantum-enhanced parameter-efficient learning for typhoon trajectory forecasting. In 2025 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 1, pp. 2046–2056. Cited by: §I.
[48] C. Liu, S. Y. Chen, K. Chen, W. Huang, and Y. Chang (2025) Federated quantum-train long short-term memory for gravitational wave signal. In IEEE INFOCOM 2025-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 1–6. Cited by: §I.
[49] C. Liu, S. Y. Chen, K. Chen, W. Huang, and Y. Chang (2025) Programming variational quantum circuits with quantum-train agent. In 2025 International Conference on Quantum Communications, Networking, and Computing (QCNC), pp. 544–548. Cited by: §II.
[50] C. Liu, C. A. Lin, C. H. Yang, K. Chen, and M. Hsieh (2024) Qtrl: toward practical quantum reinforcement learning via quantum-train. In 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 2, pp. 317–322. Cited by: §I.
[51] A. Macaluso, F. Orazi, M. Klusch, S. Lodi, and C. Sartori (2022) A variational algorithm for quantum single layer perceptron. In International Conference on Machine Learning, Optimization, and Data Science, pp. 341–356. Cited by: §I.
[52] A. Marchisio, E. Sychiuco, M. Kashif, and M. Shafique (2025) Cutting is all you need: execution of large-scale quantum neural networks on limited-qubit devices. In 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), pp. 330–336. Cited by: §I.
[53] E. Martin and C. Cundy (2018) Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, Cited by: §II.
[54] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii (2018) Quantum circuit learning. Physical Review A 98 (3), pp. 032309. External Links: Document, Link Cited by: §I.
[55] P. Mujal, R. Martínez-Peña, G. L. Giorgi, M. C. Soriano, and R. Zambrini (2023) Time-series quantum reservoir computing with weak and projective measurements. npj Quantum Information 9 (1), pp. 16. Cited by: §I.
[56] M. A. Nielsen and I. L. Chuang (2010) Quantum computation and quantum information. 10th Anniversary edition, Cambridge University Press. External Links: Document, Link Cited by: §I.
[57] A. Peruzzo, J. McClean, P. Shadbolt, M. Yung, X. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’brien (2014) A variational eigenvalue solver on a photonic quantum processor. Nature communications 5 (1), pp. 4213. Cited by: §I.
[58] A. Rosato, A. Ceschini, F. Succetti, S. Y. Chen, and M. Panella (2025) A study on quantum reservoir recurrent models for time-constrained volatile sequence forecasting. In 2025 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. External Links: Link Cited by: §II.
[59] I. Schlag, K. Irie, and J. Schmidhuber (2021) Linear transformers are secretly fast weight programmers. In International conference on machine learning, pp. 9355–9366. Cited by: §II.
[60] J. Schmidhuber (1992) Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §II.
[61] J. Schmidhuber (1993) Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. In ICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3, pp. 460–463. Cited by: §II.
[62] M. Schuld, A. Bocharov, K. M. Svore, and N. Wiebe (2020) Circuit-centric quantum classifiers. Physical Review A 101 (3), pp. 032308. External Links: Link Cited by: §I.
[63] M. Schuld and N. Killoran (2019) Quantum machine learning in feature hilbert spaces. Physical review letters 122 (4), pp. 040504. Cited by: §I.
[64] A. Sequeira, L. P. Santos, and L. S. Barbosa (2023) Policy gradients using variational quantum circuits. Quantum Machine Intelligence 5 (1), pp. 18. External Links: Document Cited by: §I.
[65] P. W. Shor (1994) Algorithms for quantum computation: discrete logarithms and factoring. In Proceedings 35th annual symposium on foundations of computer science, pp. 124–134. Cited by: §I.
[66] M. Siemaszko, A. Buraczewski, B. Le Saux, and M. Stobińska (2023) Rapid training of quantum recurrent neural networks. Quantum Machine Intelligence 5 (2), pp. 31. Cited by: §I.
[67] S. Sim, P. D. Johnson, and A. Aspuru-Guzik (2019) Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms. Advanced Quantum Technologies 2 (12), pp. 1900070. Cited by: §I.
[68] A. Skolik, S. Jerbi, and V. Dunjko (2022) Quantum agents in the gym: a variational quantum algorithm for deep Q-learning. Quantum 6, pp. 720. External Links: Document Cited by: §I.
[69] J. Stein, I. Christ, N. Kraus, M. B. Mansky, R. Müller, and C. Linnhoff-Popien (2023) Applying qnlp to sentiment analysis in finance. In 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 2, pp. 20–25. External Links: Link Cited by: §I, §II.
[70] Y. Takaki, K. Mitarai, M. Negoro, K. Fujii, and M. Kitagawa (2021) Learning temporal data with a variational quantum recurrent neural network. Physical Review Research 3 (4), pp. 043140. External Links: Document Cited by: §I.
[71] B. D. Tran, M. Fahim, B. D. McNiven, M. Guizani, H. Shin, and T. Q. Duong (2025) Quantum lstm model for estimation of energy expenditure in human aging using wearable iot healthcare technology. IEEE Internet of Things Journal. External Links: Link Cited by: §II.
[72] S. Tripathi, H. Upadhyay, and J. Soni (2025) Quantum long sort-term memory-based identification of distributed denial of service attacks. In 2025 IEEE 4th International Conference on AI in Cybersecurity (ICAIC), pp. 1–8. External Links: Link Cited by: §II.
[73] J. D. Viqueira, D. Faílde, M. M. Juane, A. Gómez, and D. Mera (2025) Density matrix emulation of quantum recurrent neural networks for multivariate time series prediction. Machine Learning: Science and Technology 6 (1), pp. 015023. Cited by: §I.
[74] S. Wang, E. Fontana, M. Cerezo, K. Sharma, A. Sone, L. Cincio, and P. J. Coles (2021) Noise-induced barren plateaus in variational quantum algorithms. Nature communications 12 (1), pp. 6961. Cited by: §I.
[75] Z. Zhang and X. Ma (2025) Wind turbine fault detection using quantum long-short term memory network. In 2025 30th International Conference on Automation and Computing (ICAC), Vol. , pp. 1–6. External Links: Document, Link Cited by: §II.