Post-processing methods for delay embedding and feature scaling of reservoir computers

Introduction

Reservoir computing is a machine learning method that exploits the dynamics of a reservoir for nonlinear mapping of inputs into a higher-dimensional space^1,2. This exploitation of reservoir dynamics means that only the weights coupled to the final prediction output need to be trained. This elementary training approach, combined with the fact that a reservoir can be any dynamical system with finite memory showing consistent input responses, means that reservoir computers are well-suited for hardware implementation. In this paper, we will apply simple post-processing methods to the elementary training scheme. We will then present a post-processing method that improves reservoir computer performance for minimal computational investment. Finally, we translate our post-processing methods to the training scheme of a physical reservoir computer. This reservoir is implemented in photonic hardware using high-speed light signals for information processing.

Our post-processing methods are inspired by delay embedding. This is the process in which delayed replicas of a time series can be used to reconstruct the full state-space dynamics of the underlying dynamical system³. In a reservoir computer, the sampled driving input is nonlinearly mapped, and the resultant higher-dimensional representation is observed as a readout of reservoir states evolving in response to the driving input. Our simple post-processing methods train on delayed versions of these reservoir states, analogous to delay embedding, in order to better capture the driving system and thus improve predictive performance. Our methods can apply the delayed states in a uniform-timeshifting^4,5 or random-timeshifting^6,7 manner, and will do so only through state rearrangement or replication. This notably preserves the fundamental benefit of using reservoir computers by keeping nonlinear mapping restricted only to the reservoir⁸.

Uniform-timeshifting of reservoir states was investigated by Marquez et al.⁴, who concatenated each readout state vector with a delayed replica of itself. They drew a connection to Taken’s delay embedding⁹ and showed that optimal delays are required. Picco et al. recently demonstrated that such uniformly timeshifted delays need not be evenly spaced¹⁰. The work of Del Frate et al.⁶ introduced random-timeshifting. This approach stores the readout of reservoir states, and the feature vector used for training is comprised of states randomly sampled between the current time and some appropriately chosen maximal delay. Carroll and Hart⁷ subsequently connected the performance improvement due to random-timeshifting with an increased covariance rank of the training state matrix.

Reservoir dynamics might also implicitly perform a delay embedding of a time series input. Reservoir memory is particularly important for optimal delay embedding and is dependent on the driving input’s measured precision, input timescale and the prediction task. In such cases, issues relating to reservoir memory can be difficult to compensate for via hyperparameter optimisation^11,12. For example, for a finely discretised input, the optimal delay embedding may be larger than the timescale of the memory of the reservoir decays. Moreover, given that the most commonly used reservoirs are networks of randomly connected nodes, such architectures only allow for indirect tuning of reservoir memory through hyperparameters such as the spectral radius. Conversely, time-multiplexed delay-based reservoirs allow for direct tuning through the reservoir’s internal delay. Indeed, we previously demonstrated predictive performance improvement by optimising the internal delay of a time-multiplexed reservoir⁵. Similar trends are shown in other works^13,14,15. However, in this paper, we will focus on post-processing methods that are applied after reservoir states have already been generated, circumventing difficulties associated with reservoir hyperparameter optimisation. Our post-processing focus also contrasts our previous work that applied pre-processing methods at the level of the driving input. We demonstrated that adding an optimally-delayed replica of the driving input as it is fed into the reservoir greatly improved performance. That approach could be considered a hybrid of reservoir computing and nonlinear vector regression, if the additional delayed input is viewed as training feature selection¹⁶.

In the following sections, we will establish some simple post-processing methods on our simulated reservoir computer, for a variety of chaotic attractor prediction tasks. These are the aforementioned uniform-timeshifting and random-timeshifting methods. We also expand into multi-uniform-timeshifting that takes multiple uniformly-delayed states for training. Based on key findings regarding predictive performance when we recall past states using these simple post-processing methods, we introduce our multi-random-timeshifting method. This method scales well with larger feature dimensions, is computationally cheap to implement and can outperform all other post-processing methods.

We will then translate our established methods to an experimentally-realised reservoir computing system built with photonic hardware¹⁷. Such physical reservoir computers utilise complex nonlinear dynamics of physical systems for high-speed and energy-efficient analogue computing. Examples of physically implemented reservoirs include optical, optoelectronic and micro-mechanical systems^18,19,20,21, among others. Although good computing performance has been demonstrated for a number of tasks^{10,22,23,24,25}, this is often reliant on computationally expensive hyperparameter optimisation. There have been numerous studies into efficient methods of hyperparameter optimisation^{26,27,28,29,30}, however, there are often drawbacks which include the computation cost, lack of applicability to hardware reservoirs and a lack of generalisability to a wide range of tasks. Fortunately, our post-processing methods are applied after reservoir states have already been generated. We show that our methods are able to improve the predictive performance of physical reservoir computers, and circumvent the difficulties associated with physical reservoir hyperparameter optimisation.

Methods

Reservoir computer model

Figure 1 illustrates the general architecture of our simulated reservoir computer. We feed in a 1-dimensional input-series x of K steps into the reservoir, and our output is a prediction of the 1-dimensional target series y of K steps. For every input-step ({x}_{i}in ,{{{boldsymbol{x}}}},in {{mathbb{R}}}^{K}) driving the reservoir there are a total of n nodes through which we observe the corresponding reservoir states vector ({{{{boldsymbol{s}}}}}_{i}=({s}_{i,1},…,{s}_{i,n})in {{mathbb{R}}}^{n}). For the entire input-series x of K input-steps (x₁, . . . , x_K) we generate a total of K sampled reservoir states vectors (s₁, . . . , s_K).

**Fig. 1: General architecture of our simulated reservoir computer.**

The focal methods of this paper relate to training and post-processing methods on the state matrix S. To generate the state matrix S, the sampled reservoir states (s₁, . . . , s_K) are collected into a state matrix

$${{{boldsymbol{S}}}}=left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1}\ {{{{boldsymbol{s}}}}}_{2}\ …\ {{{{boldsymbol{s}}}}}_{K-1}\ {{{{boldsymbol{s}}}}}_{K}end{array}right]=left[begin{array}{cccc}{s}_{1,1}&…&{s}_{1,n}&1\ {s}_{2,1}&…&{s}_{2,n}&1\ …&…&…&1\ {s}_{K-1,1}&…&{s}_{K-1,n}&1\ {s}_{K,1}&…&{s}_{K,n}&1end{array}right]$$

(1)

where the i-th row is the vector s_i of n states (s_i,1, . . . , s_i,n) driven by input-step x_i. A bias term of one is appended to each row for training. The predicted target-series output of the reservoir (hat{{{{boldsymbol{y}}}}}) is then given by

$${{{boldsymbol{S}}}},{{{boldsymbol{w}}}}=hat{{{{boldsymbol{y}}}}}$$

where w is the vector of n + 1 output weights. The weights w are found by minimising the difference³¹ between prediction (hat{{{{boldsymbol{y}}}}}) and the target y. The solution to this minimisation problem is analytically solved using ordinary least squares in matrix form

$${{{boldsymbol{w}}}}={({{{{boldsymbol{S}}}}}^{T}{{{boldsymbol{S}}}}+lambda {{{boldsymbol{I}}}})}^{-1}{{{{boldsymbol{S}}}}}^{T}{{{boldsymbol{y}}}}$$

on a training set of k < K sequential rows of the state matrix S.

To evaluate the predictive performance of our trained model, we use the normalised root mean square error (NRMSE) to quantify how accurate the predicted target-series (hat{{{{boldsymbol{y}}}}}) is to target-series y for both the training and testing sets. NRMSE is given by

$$NRMSE=sqrt{frac{mathop{sum }_{i = 1}^{k}{({y}_{i}-{hat{y}}_{i})}^{2}}{k{hat{sigma }}^{2}}}$$

where k is the row dimension of the training or testing set of the state matrix S, and ({hat{sigma }}^{2}) is derived from the set’s target y as an approximation of the true population variance. For simulations that undergo repeat realisations, the median NRMSE ± median absolute deviation (MAD) is recorded as the predictive performance for that realisation set.

The number of columns of state matrix S used for training is the feature dimension O. For base training without further post-processing, feature dimension O = n, referring to the number of sampled states (s_i,1, . . . , s_i,n) ∈ s_i per input-step x_i equal to the node dimension n. We will refer to this state matrix S with O = n features described in Eq. (1) as the base state matrix on which further post-processing methods will be applied.

Post-processing the state matrix

This paper focuses on several post-processing methods applied to the base state matrix S described by Eq. (1). These methods rearrange and/or replicate the base state matrix S in order to generate a post-processed state matrix. Notably, our post-processing methods keep nonlinear input mapping restricted to the reservoir, thereby preserving the computational advantage of reservoir computing⁸.

Uniform-timeshifting

The uniform timeshifts post-processing method generates a state matrix that has a larger feature dimension O compared to the base state matrix S. We start with a uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{2}}), which is a replica of the base state matrix S whose O = n features are timeshifted d₂ input-steps into the past. This uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{2}}) is then horizontally concatenated (denoted by ^⌢) to the base state matrix S, generating the uniform-timeshift state matrix

$${{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}} = left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1}\ …\ {{{{boldsymbol{s}}}}}_{K}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{d}_{2}}\ …\ {{{{boldsymbol{s}}}}}_{K-{d}_{2}}\ end{array}right]\ = left[begin{array}{ccc}{s}_{1,1}&…&{s}_{1,n}\ …&…&…\ {s}_{K,1}&…&{s}_{K,n}\ end{array}right]frown left[begin{array}{ccc}{s}_{1-{d}_{2},1}&…&{s}_{1-{d}_{2},n}\ …&…&…\ {s}_{K-{d}_{2},1}&…&{s}_{K-{d}_{2},n}end{array}right].$$

In effect, any feature vector s_i ∈ S is extended with the feature vector ({{{{boldsymbol{s}}}}}_{i-{d}_{2}},in ,{{{boldsymbol{S}}}}), where d₂ is the uniform time shift applied to the feature vector, illustrated by Fig. 2 at N = 2 state matrix replicas. In this way, the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) that is used for training and testing has an increased feature dimension of O = 2n. Earlier methods of concatenating uniformly timeshifted features have been previously established^4,5,32.

**Fig. 2: Post-processing methods on the base state matrix S.**

Multi-uniform-timeshifting

Inspired by delay embedding, we now take multiple timeshifts in order to better reconstruct the underlying dynamics of the driving system and improve predictive performance. The stepsize between multiple timeshifts can be of the same length, characteristic of delay embedding³, or of varying length as demonstrated by Picco et al. under their “Timesteps Of Interest” algorithm¹⁰.

Horizontally concatenating multiple uniformly-timeshifted state matrices generates a multi-uniform-timeshifts state matrix

$${{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}=left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1}\ …\ {{{{boldsymbol{s}}}}}_{K}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{d}_{2}}\ …\ {{{{boldsymbol{s}}}}}_{K-{d}_{2}}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{d}_{2}-{d}_{3}}\ …\ {{{{boldsymbol{s}}}}}_{K-{d}_{2}-{d}_{3}}\ end{array}right]$$

of O = 3n features, where ({d}_{3}^{Sigma }) indicates the sum of iterative timeshifts d₂ + d₃ (Fig. 2 at N = 3 replicas). For example, if uniform-timeshifts d₂ = 2 and d₃ = 8, then the first replica is the base state matrix S, the second replica is the uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{2}}) at d₂ = 2 input-steps into the past and the third replica is the uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) at d₂ + d₃ = 2 + 8 = 10 input-steps into the past.

Concatenating N state matrix replicas generates a multi-uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }}) of O = Nn features, with the earliest state matrix replica ({S}_{{d}_{N}^{Sigma }}) timeshifted d₂ + d₃ + . . . + d_N input-steps into the past. Allowing for an increased number of delayed states should in general enable an improved embedding. The increased computational cost associated with the increasing number of optimisation hyperparameters will be discussed.

Random-timeshifting

The random-timeshifting post-processing method is implemented according to published methods⁶. Here, each of the O = n features of the base state matrix S is timeshifted a number of input steps into the past. The timeshift for each feature is randomly chosen. This random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) can be described as

$${{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}=left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{{{{boldsymbol{r}}}}}^{1}}\ …\ {{{{boldsymbol{s}}}}}_{K-{{{{boldsymbol{r}}}}}^{1}}end{array}right]=left[begin{array}{ccc}{s}_{1-{r}_{1}^{1},1}&…&{s}_{1-{r}_{n}^{1},n}\ …&…&…\ {s}_{K-{r}_{1}^{1},1}&…&{s}_{K-{r}_{n}^{1},n}end{array}right]$$

where the random-timeshifts vector r¹ consists of random integers ({r}_{1}^{1},…,{r}_{n}^{1}) ranging from 0 to some max-random-timeshift R. Thus, the j-th feature of the base state matrix S is timeshifted ({r}_{j}^{1},in ,{{{{boldsymbol{r}}}}}^{1}) input-steps into the past, illustrated by Fig. 2 at N = 1 state matrix replica. The feature dimension of this random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) remains at O = n.

The max-random-timeshift R parameter sets the range of random-timeshifts r ∈ r that may be randomly chosen. It is analogous to the method used by Del Frate et al., where the half-life of the autocorrelation function is multiplied by a real scalar⁶ to determine the range of allowed random timeshifts r ∈ r. The max-random-timeshift R can be optimised with a one-dimensional scan.

Prediction tasks

Various chaotic time series prediction tasks are performed on the simulated reservoir to establish and quantify our post-processing methods in this manuscript. Our driving input series is a coordinate of one of the Lorenz, Rössler, or Mackey–Glass attractors. Our target series is another coordinate of the attractor for cross-prediction tasks, or the input series coordinates a number of steps ahead for forecasting tasks. For configuring our simulated reservoir computer and establishing our post-processing methods we carried out prediction tasks on the Lorenz attractor.

Lorenz attractor driving input-series x = [x₁, . . . , x_K] was the Lorenz x-coordinate defined by

$$frac{{{{rm{d}}}}x}{{{{rm{d}}}}t} =, a(y-x),\ frac{{{{rm{d}}}}y}{{{{rm{d}}}}t} =, x(b-z)-y,\ frac{{{{rm{d}}}}z}{{{{rm{d}}}}t} =, xy-cz,$$

and cross-prediction target-series y was the Lorenz z-coordinate³³.

The simulated reservoir training models established on the Lorenz attractor were corroborated with Mackey–Glass attractor P-to-P 10-step-ahead forecasting tasks (Suppl. Note 3). Mackey–Glass attractor driving input-series x was the Mackey–Glass P-coordinate defined by

$$frac{{{{rm{d}}}}P}{{{{rm{d}}}}t}=frac{{beta }_{0}{theta }^{n}P(t-{tau }_{{{{rm{MG}}}}})}{{theta }^{n}+P{(t-{tau }_{{{{rm{MG}}}}})}^{n}}-gamma P(t)$$

and forecasting target-series y was the 10-step-ahead Mackey–Glass P-coordinate³⁴. All time series coordinates were iterated using 4th-order Runga–Kutta. For the delay terms in the Mackey–Glass equation cubic Hermitian interpolation was used. The system parameters for the Lorenz and Mackey–Glass attractors are given in Table 1. Cross-prediction tasks on the Rössler attractor were also carried out in order to corroborate our established methods (Suppl. Note 4).

Table 1 Task and reservoir parameters

Full size table

The simulated reservoir training models were translated to experimentally-realised readout data acquired from a laser-based photonic reservoir system. Targets were P-to-P 1-step-ahead or 10-step-ahead coordinates of the input series, corresponding to a Mackey–Glass attractor.

Reservoir systems

Simulated reservoir

For our simulated reservoir, we use a time-multiplexed nonlinear delayed map introduced by Jaurigue et al.^5,16. This system describes a semiconductor optical amplifier subject to weak self-feedback given by

$${s}_{t+theta }=frac{a(b{s}_{t-tau }+{x}_{i}mg)}{1+b{s}_{t-tau }+{x}_{i}mg}$$

with simulated reservoir parameters specified in Table 1. The input-series x is masked (m) and scaled (g). The evolving reservoir state over time s_t is observed once during each mask interval θ and then indexed into the state vector (s_i,1, . . . , s_i,n) driven by input-step x_i ∈ x. There are a total of n nodes through which we observe the reservoir state s ∈ s_i, distinguished by their unique mask θ, so the length of a clock cycle driven by input-step x_i is nθ.

Given that we apply a random mask on the driving input for our simulated reservoir⁵, each reservoir computer configuration undergoes repeat realisations, with each realisation using a different random mask. When applicable, each realisation also uses a different vector of random timeshifts r. For specific details on our simulated reservoir and the time-multiplexed masking procedure, we refer the reader to previous work⁵, given that the reservoir has no influence on the post-processing methods which are the focus of this paper.

Physical photonic reservoir

To investigate post-processing methods on a physical reservoir system we utilised the readout of our previously published photonic reservoir based on a vertical-cavity surface-emitting laser (VCSEL) system¹⁷. This system enabled operation with high-speed (GHz-rate) and low-power (sub-mW) optical input signals. This system also enabled a hardware-friendly approach using a single VCSEL requiring very low bias currents, operating at a key optical telecom wavelength. The referred VCSEL-based photonic reservoir demonstrated good performance in data classification and time series prediction tasks, across a wide parameter space range. Physical reservoir parameters are in Table 1. For specific details on the physical reservoir, we refer the reader to our relevant work¹⁷.

We used a single hardware-realised readout, rather than a set of realisations each using a different random mask, as is the case in our simulated reservoir computer. Multiple realisations were required in post-processing when using different vectors of random timeshifts r.

Results

Simple post-processing

We established the simulated reservoir computer configurations and post-processing methods by performing Lorenz x-to-z cross-prediction tasks. Key results for characterising the performance of our post-processing methods are summarised in Table 2, which we reference in the following sections.

Table 2 Simulated reservoir computer results of Lorenz x-to-z cross-prediction

Full size table

In Table 2, “N replicas” refers to the number of state matrix replicas used to generate the final state matrix used for training and testing. “O features” refers to the feature dimension of this state matrix, expressed as multiples of the node dimension Nn. Results in the “uniform-timeshifting, ({{{{boldsymbol{S}}}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }})” category are for state matrices with uniformly-timeshifted features, which includes the base state matrix S at N = 1. Results in the “random-timeshifting, ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}})” category are for state matrices with randomly-timeshifted features. We present the median NRMSE ± MAD result under optimal delay configurations, for increasing replica number N. The reservoir is fixed at n nodes.

Uniform-timeshifting improved performance

We start with results for uniformly timeshifted state matrices summarised in Table 2. For N = 1 replicas this corresponds to no post-processing on the base state matrix S. As expected, the base training method demonstrates inferior performance when compared to any of the optimised post-processing methods performed on the base state matrix S.

For N = 2 replicas, we horizontally concatenate a uniformly-timeshifted state matrix to the base state matrix in order to generate the uniform-timeshifted state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) of O = Nn = 2n features. The optimal uniform-timeshift d₂ is also listed in Table 2 and was determined via a scan, illustrated in Fig. 3a (teal line). A 74% decrease in error was observed. Superior performance was observed even at sub-optimal configurations of uniform-timeshift d₂, which we attribute to the increase in the feature dimension O from n to 2n. The superior performance of training on the optimally-delayed uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}), compared to training only on the base state matrix S (orange dashed line), is consistent with previously published results^4,5.

**Fig. 3: Timeshifting on the simulated reservoir computer, Lorenz x-to-z cross-prediction.**

In general, reservoir computer performance scales with the size of the internal and readout layers of its reservoir, provided there are sufficient readout nodes for adequately sampling the reservoir. This is reflected in the information processing capacity introduced by Dambre et al.³⁵, where it was shown that the maximal possible information processing capacity is equal to the readout node dimension n, which corresponds to the O = n features of the base state matrix S. The improved performance of the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) can therefore, in part, be attributed to its increased feature dimension O = 2n, compared to the O = n feature dimension of the base state matrix S. Viewed in this way, uniform-timeshifting is a computationally-inexpensive post-processing alternative to simulating larger reservoirs for higher-dimensional mapping of the input-series^36,37, while achieving similar results. In the case of physical reservoir systems, post-processing methods are sometimes the only viable option for increasing state matrix feature dimensions when reservoir parameters, such as node dimension, are constrained.

Multi-uniform-timeshifting improved performance

From the established uniform-timeshifting post-processing method, we proceed with multiple-uniform-timeshifting that increases the final state matrix feature dimension beyond O = 2n.

With N = 3 state matrix replicas we generate a uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) of feature dimension O = 3n. Table 2 lists the NRMSE for N = 3 replicas, referring to performance at optimal uniform-timeshifts d₂ and d₃ determined via a 2-dimensional scan. We observed a 60% decrease in error over N = 2 replicas. For N replicas we generate uniform-timeshifts state matrices of feature dimension O = Nn, and the optimal uniform-timeshifts [d₂, . . . , d_N] are listed, determined via (N − 1)-dimensional scans. Going to N = 4, to N = 5 and to N = 6 replicas show a 51%, 30% and 13% decrease in error, respectively. Thus, a trend of increasingly improved predictive performance with increasing O = Nn feature dimensions was observed.

In addition to increasing state matrix features O = Nn, performance with increasing N replicas and optimal uniform-timeshifts d is related to delay embedding^4,11,37. The optimal values of the uniform-timeshifts [d₂, . . . , d_N] ∈ d are not the same, indicating that the optimal delay embedding from this task is when the uniform-timeshifts d ∈ d are not evenly spaced. In general, not all delay embeddings work equally well when the input data has only finite precision³, as the theoretical guarantees of Taken’s delay embedding theory (where delayed replicas of an input time series can be used to reconstruct the full state-space dynamics of the underlying dynamical system) only hold for infinite precision⁹. Furthermore, optimal delay embeddings depend on the criteria by which they are judged, which in our case is the NRMSE of the resulting cross-prediction. Moreover, since the reservoir itself performs a delay embedding^11,37,38, this affects the optimal delay embedding provided by multi-uniform-timeshifting of the reservoir states. Indeed, we observed that the optimal uniform-timeshift d₂ at N = 2 replicas changes with our simulated reservoir delay τ (Suppl. Note 3).

Finally, for multi-uniform-timeshifting, we show that the optimal configurations for the uniform-timeshifts d are constrained to a single zone only. For example, Fig. 3b illustrates the results of a 2-dimensional scan for uniform-timeshifts [d₂, d₃] ∈ d configurations when concatenating N = 3 replicas. We see superior performance for a single zone (dark teal/black area) of [d₂, d₃] ∈ d configurations. Figure 3c also illustrates a single optimal configuration zone (dark teal/black area) for N = 4 replicas as was observed for N = 3. We do not observe multiple or repeating optimal zone patterns, illustrating that configurations must be specifically chosen for superior performance. This indicates the importance of thorough scans to determine the optimal uniform-timeshifts d ∈ d configurations, which becomes computationally intractable with increasing N replicas.

Random-timeshifting improved performance

Given our indications that multi-uniform-timeshifting requires optimally chosen uniform-timeshifts d ∈ d configurations for superior performance, we looked to post-processing methods that perform well with randomly chosen timeshift parameters. Hence, we established the random-timeshifting post-processing method for our simulated reservoir computer. The “random-timeshifting, ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}})” column of Table 2 lists the result when training on the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) at N = 1 replica. The optimal max-random-timeshift R found via a parameter scan is listed alongside the NRMSE, with Fig. 3d illustrating this scan.

We report a 55% decrease in error when training on the optimal random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) (N = 1), compared to the base state matrix S with the same feature dimension of O = n. This finding is consistent with the work of Carroll and Hart⁷, and Del Frate et al.⁶, who also demonstrated improved performance after random timeshifting on the base state matrix, measured by a reduced prediction error or a reduction in required reservoir state samplings per input.

The performance improvement as a result of random-timeshifting cannot be explained by a feature dimension increase as it is for uniform-timeshifting, because the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) shares the same feature dimension of O = n as the base state matrix S. Additionally, although random-timeshifting introduces delayed versions of states, these timeshifted states replace the non-timeshifted base states as training features. Training on delayed replicas of states that append to the base state matrix, as we do in uniform-timeshifting, is more analogous to delay embedding.

Instead, a possible explanation for this performance improvement is that random timeshifting increases the covariance rank of the final training matrix, implying higher-dimensional mapping of the input^7,39. Figure 3d shows how covariance rank (maroon line) and NRMSE (teal line) change with increasing max-random-timeshift R. Indeed, we see that covariance rank is larger at the optimal max-random-timeshift R = 21 compared to the base state matrix without random-timeshifting, consistent with previously published explanations^7,39. However, the complete scan shows that the covariance rank of the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) increases with max-random-timeshift R, and then the covariance rank remains high even as predictive performance degrades past the optimal max-random-timeshift R > 21. Note that changing the prediction task from x-to-z cross-prediction to x-to-x 1-step-ahead forecasting gives the same outcome, where optimal performance is achieved when covariance rank is minimal (Suppl. Note 2). Thus, a large covariance rank may not guarantee good performance, and the assumption that a state matrix with a larger covariance rank generally outperforms a state matrix with a smaller covariance rank^7,12,39,40 warrants further investigation.

Recall implementation

Here we will report our investigation of recall, which refers to delayed versions of states from the base state matrix S being replicated in post-processing, in order to build up training state matrices of larger feature dimensions. For example, uniform-timeshifting that concatenates N = 2 state matrix replicas will result in states appearing up to N = 2 times in the uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}). The recalled state replicas ({{{{boldsymbol{S}}}}}_{{d}_{2}}) are delayed versions of those in the base state matrix S, separated by the uniform-timeshift d₂.

Reservoir computing for time series prediction seems to work by both delay embedding the input time series¹¹ and by projecting the input data into a higher-dimensional space³¹. The latter is generally related to the number of nodes of the reservoir. In the context of recall and uniform timeshifting, we show that reservoirs with a smaller node dimension perform better than reservoirs with a larger node dimension, which goes against the general assumptions that a higher-dimensional reservoir would result in better computational performance³⁵.

To characterise the recall phenomenon, recall J will indicate how many nodes will have sampled states recalled during post-processing, thereby introducing delayed versions of node states in the final state matrix. We start with an extended state matrix S of feature dimension O = 2n, generated from a larger reservoir of 2n nodes. A possible configuration of timeshifts [r₁, . . . , r_2n] ∈ r applied to this extended state matrix S is illustrated in Fig. 4a at recall J = 0. For this timeshifts vector r the first n timeshifts [r₁, . . . , r_n] are 0 and the second subset of n timeshifts [r_n+1, . . . , r_2n] are at the max-timeshift R.

**Fig. 4: Recall implementation on the simulated reservoir computer, Lorenz x-to-z cross-prediction.**

The constructed timeshifted state matrix S_r in Fig. 4a is similar to the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) at recall J = n. Here, both post-processed state matrices have O = 2n features, and the uniform-timeshift d₂ is equal to max-timeshift R. The key difference between these state matrices is the number of nodes required to generate them. For the constructed timeshift state matrix S_r, the timeshifted features [n + 1: 2n] are sampled states of the nodes [n + 1: 2n], from a reservoir of 2n nodes. Conversely, for the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}), the timeshifted features [n + 1: 2n] are generated by recalling the sampled states of nodes [1: n], from a smaller reservoir of n nodes. Specifically, the recalled node states are collected at d₂ = R input steps into the past, thereby clamping the required node dimension to n. Thus, in Fig. 4a at null recall J = 0, we generate the constructed timeshift state matrix S_r from a readout layer of 2n nodes. Although the features [n + 1: 2n] are timeshifted by R = d₂ input steps into the past, the final state matrix will have all states from the 2n nodes appearing only once. At total recall J = n, we generate the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) from a readout layer of n nodes. All n nodes will have delayed replicas of their sampled states appearing in the final state matrix at d₂ = R input steps into the past. A transitional state matrix at recall J = 0.5n illustrates the case where only a third of the 1.5n nodes will have sampled states recalled.

Total recall improves performance even as it reduces node dimension

We have shown how simple post-processing methods can introduce delayed states into the training scheme that lead to predictive performance improvement at optimal timeshift configurations. For uniform-timeshifting and multi-uniform-timeshifting, we attributed this to both an improved delay embedding of the input and an increase in the number of training features O. Recall J indicates how many nodes will have delayed replicas of sampled node states appearing in the final state matrix, and we investigated how recall affects prediction performance with delay embedding and feature dimension scaling effects in mind.

In Fig. 4b, we explored the effect of increasing recall J at specific uniform-timeshift d₂ values. The left side of the graph indicates null recall J = 0, and the right side indicates total recall J = n for generating the final state matrix of O = 2n features, corresponding to the left-to-right orientation of the Fig. 4a schematics. We see that at sub-optimal timeshift R = d₂ = 1 (black line) performance suffers as recall J increases. This is to be expected, since with increasing recall J values from 0 to n the node dimension of the reservoir decreases from 2n to n. However, the trend reverses unexpectedly at timeshift R = d₂ = 21 (teal line), where we reach peak performance at total recall J = n, corresponding to the smallest node dimension n.

To explain this phenomenon, we can look to delay embedding. The optimal delayed states introduced in post-processing are those that are timeshifted by R = d₂ = 21 input steps into the past. Thus, at null recall J = 0 we generate the constructed timeshifted state matrix S_r with max-timeshift R = 21, and observe improved performance over sub-optimal delays. As we increase from null recall J = 0 to total recall J = n, we increase the number of delayed state replicas used for the final training state matrix. By increasing this recall J, performance subsequently improves. Once we are at total recall J = n we generate the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}), which shows the best performance. Uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) introduces delayed replicas of states at d₂ = 21 input-steps into the past, and in this way is a more analogous implementation of delay embedding. This is in contrast to the constructed timeshifted state matrix S_r which introduces delayed states from new nodes, rather than recalling states from the past. Viewed in this way, we consider recall J as a way to scale the influence of a larger reservoir vs the influence of delay embedding. Therefore, the optimal amount of recall J depends on whether the delay embedding is also optimal.

For completeness in Fig. 4b, we include the result of the state matrix with null recall J = 0 at d₂ = 0 (orange dashed line), which is equivalent to a base state matrix derived from a reservoir of node dimension 2n. The inferior performance of the base state matrix derived from 2n nodes demonstrated that introducing even a sub-optimal uniform timeshift leads to better predictive performance.

In Fig. 4c, we see predictive performance when scanning increasing recall J with uniform-timeshift d₂ values. We find superior performance (dark teal/black area) at total or near total recall J ≈ n when uniform-timeshift d₂ is optimal, even though the reservoir node dimension is at its lowest. Moreover, this superior performance at total or near total recall J ≈ n is demonstrated even when the readout node dimension is not rescaled, but kept at 2n (Suppl. Note 1), indicating that explicit delay embedding effects introduced in post-processing dominate any implicit changes in reservoir memory as a result of changing the node dimension from 2n to n.

Figure 4d illustrates the effect of recall with N = 3 replicas, generating the multi-uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) of feature dimension O = 3n. Even in this higher-dimensional case, the optimal configurations of d₂ and d₃ demonstrate superior predictive performance (dark teal/black area) as we approach total recall J = n. The node dimension subsequently shrinks from 3n to n, indicating that increasing the influence of an optimal delay embedding overcomes any potential negative effect of a smaller reservoir.

Given that increasing recall J reduces the reservoir node dimension n, we evaluated the effect of recall J on the covariance rank of the state matrix, because it has been hypothesised that larger covariance ranks generally lead to superior predictive performance^7,12,39,40. We observed that covariance rank derived from the same realisation set used for Fig. 4c is not correlated with zones of superior performance (Suppl. Note 1). This corroborates Fig. 3d, where the covariance rank of the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) monotonically increased with max-random-timeshift R.

Multi-random-timeshifting

We now introduce multi-random-timeshifting. The motivation for this post-processing method is to increase the feature dimension of the final training state matrix, allow for better delay embeddings and utilise the lower optimisation cost of randomly-timeshifting features. We were inspired by recall, which scales the influence of delay embedding, and leads to improved performance as if we had increased reservoir size directly.

Multi-random-timeshifting recalls past states and increases the state matrix feature dimension beyond O = n, as is the case for uniform-timeshifting but not random-timeshifting. N state matrix replicas are timeshifted by a unique random-timeshift vector r, where the j-th component of the i-th random-timeshift ({r}_{j}^{i},in ,{{{{boldsymbol{r}}}}}^{i}) is unique across all random-timeshift vectors [r¹, . . . , r^N]. The randomly-timeshifted state matrix replicas are then horizontally concatenated. As an example, concatenating N = 2 randomly-timeshifted state matrix replicas

$${{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{2}} =left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{{{{boldsymbol{r}}}}}^{1}}\ …\ {{{{boldsymbol{s}}}}}_{K-{{{{boldsymbol{r}}}}}^{1}}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{{{{boldsymbol{r}}}}}^{2}}\ …\ {{{{boldsymbol{s}}}}}_{K-{{{{boldsymbol{r}}}}}^{2}}\ end{array}right]\ =left[begin{array}{ccc}{s}_{1-{r}_{1}^{1}}&…&{s}_{1-{r}_{n}^{1},n}\ …&…&…\ {s}_{K-{r}_{1}^{1}}&…&{s}_{K-{r}_{n}^{1},n}\ end{array}right]frown left[begin{array}{ccc}{s}_{1-{r}_{1}^{2}}&…&{s}_{1-{r}_{n}^{2},n}\ …&…&…\ {s}_{K-{r}_{1}^{2}}&…&{s}_{K-{r}_{n}^{2},n}\ end{array}right]$$

generates a multi-random-timeshifts state matrix of feature dimension O = 2n, illustrated in Fig. 2 at N = 2 replicas. Multi-random-timeshifting N replicas mean that states at each node are recalled up to N times, in order to generate the multi-random-timeshifts state matrix ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) of O = Nn features.

We can generate any desired feature dimension O using this method. In the case where the n features of the base state matrix S are not a divisor of the desired feature dimension O, then the N-th random-timeshifts state matrix replica ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) is only a partial replica of the first (n-(O,{{{rm{mod}}}},n)) features.

Multi-random-timeshifting gives the best performance

Multi-random-timeshifting results in Table 2 list the best performances when concatenating N ≥ 2 randomly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) replicas. The optimal max-random-timeshift R is also listed. We report a 62% decrease in error when training on N = 2 concatenated randomly-timeshifted state matrices, compared to the single N = 1 randomly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}). The multi-random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}^{,,,,frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{2}}) as a result of this concatenation has O = 2n features.

Performance up to N = 6 replicas shows continued improvement, training on feature dimensions up to O = 6n. Additionally, we observe that predictive performance up to N = 6 is the same when implementing either multi-uniform-timeshifting or multi-random-timeshifting.

We proceed with multi-random-timeshifting for N > 6 replicas. We see that performance continued to improve with increasing replica numbers N beyond what we achieved with multi-uniform-timeshifting. Performance is also seen as the increasingly better overlap of predicted Lorenz z-coordinates (hat{{{{boldsymbol{y}}}}}) vs target y (Suppl. Note 1).

We then investigated the parameter space for optimal max-random-timeshift R configurations, illustrated in Fig. 5a. Here, we scanned the max-random-timeshift R against increasing the feature dimension of the multi-random-timeshift state matrix ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{N}}). The size of the reservoir is clamped at n nodes, while the number of feature O increases in increments of 1. When focusing on any specific max-random-timeshift R we found that predictive performance generally improves with larger feature dimension O, even at partial multiples of n.

**Fig. 5: Multi-random-timeshifting on the simulated reservoir computer, Lorenz x-to-z cross-prediction.**

Another illustration of these data is in Fig. 5b, where we observed that the optimal max-random-timeshift R (maroon dots) generally increases as feature dimension O increases. This could be expected, since a higher number of sampled states from each node must be recalled in order to generate the multi-random-timeshifts state matrix ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{N}}). With very high replica numbers up to N ≈ 100, multi-random-timeshifting continues to improve (teal line). Beyond N ≈ 100 replicas, we observe the optimal max-random-timeshifts R-value consistently at the theoretical minimum R = N − 1, and predictive performance no longer improves.

The performance of the clamped reservoir of n nodes at total recall J = n performed just as well as the correspondingly upscaled reservoir of Nn nodes at null recall J = 0 indicating that implementing multi-random-timeshifting with total recall J = n is an effective way to increase the feature dimension O and maintain the superior predictive performance of this method, while also clamping the reservoir size to n nodes (Suppl. Note 1).

The simple post-processing and the multi-random-timeshifting methods were established on the Lorenz attractor. These methods were corroborated with the Mackey–Glass P-to-P 10-step-ahead forecasting (Suppl. Note 3) and Rössler x-to-z cross-prediction (Suppl. Note 4) tasks on the simulated reservoir computer. We consistently observed our post-processing methods improving performance. As expected, increasing the number of replicas N improved performance even further. Moreover, increased recall J resulted in superior performance, while also reducing reservoir node dimension from Nn to n.

Optimising timeshift delays

Finding the optimal timeshift configurations for our post-processing methods is key to optimal delay embedding. For multi-uniform-timeshifting, we saw that the optimal uniform-timeshifts d vary with increasing N. As the number of replicas N increases, each uniform-timeshifts configuration {d₂, . . . d_N} ∈ d must be re-optimised by an (N − 1)-dimensional scan. To perform this optimisation for all uniform-timeshifts d, the number of linear regression operations required exponentially increases by

$${d}_{{{{rm{scan}}}}}^{N-1}$$

where d_scan is the maximum of the range of uniform-timeshifts d ∈ d to be scanned.

For multi-random-timeshifting, only a 1-dimensional scan for the max-random-timeshift R is required even as the number of N replicas increases. The number of linear regression operations required for optimising the max-random-timeshift R increases by

$${R}_{{{{rm{scan}}}}}-N+1$$

where R_scan is the maximum of the range of max-random-timeshifts R to be scanned.

Computational time benchmarks on our simulated reservoir computer are given as an example. We require 190 s to compute a random-mask realisation set, for a post-processed training matrix of O = 10n features from N = 10 replicas. Optimising the max-random-timeshift R, where the scanning range of max-random-timeshifts R_scan = 180, is calculated as

$$190,{{{rm{s}}}}* (180-10+1)=9,{{{rm{h}}}}.$$

Comparatively, the computational time required to optimise the multi-uniform-timeshifts {d₂, . . . d₁₀} ∈ d, where the scanning range for each uniform-timeshift is d_scan = 20, such that d_scan(N − 1) = 180 matches the scanning max-random-timeshift scanning range R_scan = 180, is calculated as

$$190,{{{rm{seconds}}}}* (2{0}^{10-1})=185083714,{{{rm{years}}}}.$$

Due to this ease of optimisation for larger Nn feature dimensions, multi-random-timeshifting ultimately outperforms multi-uniform-timeshifting in terms of scalability and predictive performance. For similar reasoning, random-timeshifting is preferable to non-random-timeshifting of training features. Finding non-random, optimal-timeshifts [r₁, . . . , r_n] ∈ r quickly becomes computationally intractable as n increases, due to the n-dimensional scan required to optimise every optimal-timeshift r ∈ r applied to each feature. First-order approximations of optimal timeshifts have been previously implemented⁶.

Translation to a physical reservoir

Post-processing methods were established, characterised and corroborated on our simulated reservoir computer. Implementing our post-processing methods improves predictive performance, and need only be applied after the readout is generated. Thus, in this section we translate our methods to a physical reservoir implemented in a photonic system, enabling high-speed operation using GHz-rate data inputs. To accomplish this, we took our readout of a previously published VCSEL-based photonic reservoir system¹⁷ that was used for high-dimensional mapping of input. Figure 6a illustrates the experimental setup. When compared to the simulated reservoir the presence of an internal time delay (τ), as well as masking (m) and scaling (g) of the input is reflected in the experimental setup. The key difference between the simulated reservoir and the experimental reservoir is the number of nodes n. Despite these differences in the readout, post-processing methodologies remained unaffected and could be implemented as established.

**Fig. 6: Timeshifting on the physical photonic reservoir computer, Mackey–Glass P-to-P 10-step-ahead forecasting.**

In addition to the physical reservoir readout, we took our driving P-coordinates of the input series, corresponding to a Mackey–Glass attractor. This enabled us to choose our prediction task, which we set as P-to-P 10-step-ahead forecasting.

In the “uniform-timeshifting, ({{{{boldsymbol{S}}}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }})” column of Table 3, replica number N = 1 corresponds to training on the base state matrix S, without post-processing. We report a 45% decrease in error when training on the N = 2 uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}). This improved performance is consistent with improvements seen in the simulated computer. Training on N = 3 uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) gives a further 27% decrease in error from N = 2. Training on N ≥ 3 uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }}) indicates progressively improving performance. Uniform timeshifts {d₁, . . . d_N} must be re-optimised with increasing N replicas, which is also consistent with the simulated reservoir computer. Additionally, Fig. 6b shows the parameter space of optimal configurations (dark teal/black) for uniform timeshifts d₂ and d₃. The optimal configuration space on the physical reservoir follows a similar configuration space observed on the simulated reservoir computer for the same Mackey–Glass P-to-P 10-step-ahead-forecasting task, shown in Fig. 6d.

Table 3 Physical photonic reservoir computer results of Mackey–Glass P-to-P 10-step-ahead forecasting

Full size table

In the “random-timeshifting, ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}})” column of Table 3, starting at N = 1 replica we report that optimal random-timeshifting leads to a 51% decrease in error compared to base training. Consistent with observations on the simulated reservoir system, Fig. 6c illustrates how training on N ≥ 2 multi-random-timeshifts state matrices ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) progressively improves predictive performance (teal line). We also observed that the optimal max-random-timeshift R (maroon dots) generally increases with replica number N, as was observed on the simulated reservoir computer.

Mackey–Glass P-to-P 1-step-ahead forecasting was also carried out (Suppl. Note 5). Our results for training on the base state matrix S without post-processing corroborate with previously published results for this task¹⁷.

Taken together, these consistencies suggest that post-processing methods established using simulated reservoirs can be translated to experimentally realised readout data generated from a physical reservoir system.

Conclusion

We established simple post-processing methods on our simulated reservoir computer for a variety of chaotic attractor prediction tasks. These methods introduced delayed versions of reservoir states for training, as inspired by delay embedding.

Improved performance of training on the optimal uniform-timeshift state matrix, which concatenates a uniformly-timeshifted state matrix to the base state matrix, was partly attributed to its increased feature dimension³⁵. Uniform timeshifting can thus be viewed as a computationally inexpensive post-processing alternative to increasing reservoir node dimension for higher-dimensional mapping of the input^36,37, that achieves similar results.

We expanded uniform-timeshifting into multi-uniform-timeshifting, which concatenates uniformly-timeshifted state matrices into a multi-uniform-timeshifts state matrix. The improved performance with increasing state matrix replicas is consistent with delay embedding playing a role. Additionally, the optimal uniform timeshifts are not evenly spaced, indicating that a non-uniform delay embedding is ideal.

Random-timeshifting of base state matrix features was established on our simulated computer. The performance improvement when training on the optimal random-timeshifts state matrix was consistent with previous work^6,7. We did not observe state matrices with larger covariance rank generally outperform state matrices of smaller covariance rank^7,12,39,40, when scanning max-random-timeshifts at optimal and sub-optimal values. A larger covariance rank is therefore not a complete explanation of why optimal random-timeshifting outperformed training on the base state matrix of equal feature dimension.

Under optimal timeshift delays, we found that the total recall of past states from existing nodes is superior to collecting extra node states from a larger reservoir. This finding inspired multi-random-timeshifting, by randomly recalling past states of a clamped reservoir in order to increase the number of training features. Multi-random-timeshifting performs just as well as multi-uniform-timeshifting when training on equivalent feature dimensions derived from equivalent reservoir sizes. However, given that multi-random-timeshifting is much cheaper to optimise, we were able to reach better performance with this method at feature dimensions that are intractable to optimise for multi-uniform-timeshifting. Multi-random-timeshifting is thus our preferred post-processing method among those investigated in this paper.

We translated our established methods to an experimentally realised physical photonic reservoir system¹⁷, given that our post-processing methods are applied to the training scheme after reservoir states are generated. We showed that our methods improved predictive performance. We believe that integrating multi-random-timeshifting in the standard training scheme of both simulated and physical reservoir computers can improve performance for minimal computational investment.

Our implementation of the multi-random-timeshifting algorithm is applicable to autonomous prediction tasks, such as Lorenz attractor closed-loop predictions⁴¹. Aside from influencing training performance, the multi-random-timeshifting algorithm introduces delay terms to the trained autonomous system when the reservoir is operated in a closed loop. The influence of these time delays on solutions and the stability of the trained autonomous system remains to be explored in future work.

Introduction

Methods

Reservoir computer model

Post-processing the state matrix

Uniform-timeshifting

Multi-uniform-timeshifting

Random-timeshifting

Prediction tasks

Reservoir systems

Simulated reservoir

Physical photonic reservoir

Results

Simple post-processing

Uniform-timeshifting improved performance

Multi-uniform-timeshifting improved performance

Random-timeshifting improved performance

Recall implementation

Total recall improves performance even as it reduces node dimension

Multi-random-timeshifting

Multi-random-timeshifting gives the best performance

Optimising timeshift delays

Translation to a physical reservoir

Conclusion

Related Articles

Responses