Post-processing methods for delay embedding and feature scaling of reservoir computers

Introduction

Reservoir computing is a machine learning method that exploits the dynamics of a reservoir for nonlinear mapping of inputs into a higher-dimensional space1,2. This exploitation of reservoir dynamics means that only the weights coupled to the final prediction output need to be trained. This elementary training approach, combined with the fact that a reservoir can be any dynamical system with finite memory showing consistent input responses, means that reservoir computers are well-suited for hardware implementation. In this paper, we will apply simple post-processing methods to the elementary training scheme. We will then present a post-processing method that improves reservoir computer performance for minimal computational investment. Finally, we translate our post-processing methods to the training scheme of a physical reservoir computer. This reservoir is implemented in photonic hardware using high-speed light signals for information processing.

Our post-processing methods are inspired by delay embedding. This is the process in which delayed replicas of a time series can be used to reconstruct the full state-space dynamics of the underlying dynamical system3. In a reservoir computer, the sampled driving input is nonlinearly mapped, and the resultant higher-dimensional representation is observed as a readout of reservoir states evolving in response to the driving input. Our simple post-processing methods train on delayed versions of these reservoir states, analogous to delay embedding, in order to better capture the driving system and thus improve predictive performance. Our methods can apply the delayed states in a uniform-timeshifting4,5 or random-timeshifting6,7 manner, and will do so only through state rearrangement or replication. This notably preserves the fundamental benefit of using reservoir computers by keeping nonlinear mapping restricted only to the reservoir8.

Uniform-timeshifting of reservoir states was investigated by Marquez et al.4, who concatenated each readout state vector with a delayed replica of itself. They drew a connection to Taken’s delay embedding9 and showed that optimal delays are required. Picco et al. recently demonstrated that such uniformly timeshifted delays need not be evenly spaced10. The work of Del Frate et al.6 introduced random-timeshifting. This approach stores the readout of reservoir states, and the feature vector used for training is comprised of states randomly sampled between the current time and some appropriately chosen maximal delay. Carroll and Hart7 subsequently connected the performance improvement due to random-timeshifting with an increased covariance rank of the training state matrix.

Reservoir dynamics might also implicitly perform a delay embedding of a time series input. Reservoir memory is particularly important for optimal delay embedding and is dependent on the driving input’s measured precision, input timescale and the prediction task. In such cases, issues relating to reservoir memory can be difficult to compensate for via hyperparameter optimisation11,12. For example, for a finely discretised input, the optimal delay embedding may be larger than the timescale of the memory of the reservoir decays. Moreover, given that the most commonly used reservoirs are networks of randomly connected nodes, such architectures only allow for indirect tuning of reservoir memory through hyperparameters such as the spectral radius. Conversely, time-multiplexed delay-based reservoirs allow for direct tuning through the reservoir’s internal delay. Indeed, we previously demonstrated predictive performance improvement by optimising the internal delay of a time-multiplexed reservoir5. Similar trends are shown in other works13,14,15. However, in this paper, we will focus on post-processing methods that are applied after reservoir states have already been generated, circumventing difficulties associated with reservoir hyperparameter optimisation. Our post-processing focus also contrasts our previous work that applied pre-processing methods at the level of the driving input. We demonstrated that adding an optimally-delayed replica of the driving input as it is fed into the reservoir greatly improved performance. That approach could be considered a hybrid of reservoir computing and nonlinear vector regression, if the additional delayed input is viewed as training feature selection16.

In the following sections, we will establish some simple post-processing methods on our simulated reservoir computer, for a variety of chaotic attractor prediction tasks. These are the aforementioned uniform-timeshifting and random-timeshifting methods. We also expand into multi-uniform-timeshifting that takes multiple uniformly-delayed states for training. Based on key findings regarding predictive performance when we recall past states using these simple post-processing methods, we introduce our multi-random-timeshifting method. This method scales well with larger feature dimensions, is computationally cheap to implement and can outperform all other post-processing methods.

We will then translate our established methods to an experimentally-realised reservoir computing system built with photonic hardware17. Such physical reservoir computers utilise complex nonlinear dynamics of physical systems for high-speed and energy-efficient analogue computing. Examples of physically implemented reservoirs include optical, optoelectronic and micro-mechanical systems18,19,20,21, among others. Although good computing performance has been demonstrated for a number of tasks10,22,23,24,25, this is often reliant on computationally expensive hyperparameter optimisation. There have been numerous studies into efficient methods of hyperparameter optimisation26,27,28,29,30, however, there are often drawbacks which include the computation cost, lack of applicability to hardware reservoirs and a lack of generalisability to a wide range of tasks. Fortunately, our post-processing methods are applied after reservoir states have already been generated. We show that our methods are able to improve the predictive performance of physical reservoir computers, and circumvent the difficulties associated with physical reservoir hyperparameter optimisation.

Methods

Reservoir computer model

Figure 1 illustrates the general architecture of our simulated reservoir computer. We feed in a 1-dimensional input-series x of K steps into the reservoir, and our output is a prediction of the 1-dimensional target series y of K steps. For every input-step ({x}_{i}in ,{{{boldsymbol{x}}}},in {{mathbb{R}}}^{K}) driving the reservoir there are a total of n nodes through which we observe the corresponding reservoir states vector ({{{{boldsymbol{s}}}}}_{i}=({s}_{i,1},…,{s}_{i,n})in {{mathbb{R}}}^{n}). For the entire input-series x of K input-steps (x1, . . . , xK) we generate a total of K sampled reservoir states vectors (s1, . . . , sK).

Fig. 1: General architecture of our simulated reservoir computer.
figure 1

Input steps ({x}_{i}in {{{boldsymbol{x}}}}in {{mathbb{R}}}^{K}) are mapped through the reservoir of n nodes. O = n features of state matrix S are used for training. Post-processing may increase the number of features O. The prediction output ({hat{y}}_{i}) is measured to target yi using normalised root mean square error (NRMSE).

Full size image

The focal methods of this paper relate to training and post-processing methods on the state matrix S. To generate the state matrix S, the sampled reservoir states (s1, . . . , sK) are collected into a state matrix

$${{{boldsymbol{S}}}}=left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1}\ {{{{boldsymbol{s}}}}}_{2}\ …\ {{{{boldsymbol{s}}}}}_{K-1}\ {{{{boldsymbol{s}}}}}_{K}end{array}right]=left[begin{array}{cccc}{s}_{1,1}&…&{s}_{1,n}&1\ {s}_{2,1}&…&{s}_{2,n}&1\ …&…&…&1\ {s}_{K-1,1}&…&{s}_{K-1,n}&1\ {s}_{K,1}&…&{s}_{K,n}&1end{array}right]$$
(1)

where the i-th row is the vector si of n states (si,1, . . . , si,n) driven by input-step xi. A bias term of one is appended to each row for training. The predicted target-series output of the reservoir (hat{{{{boldsymbol{y}}}}}) is then given by

$${{{boldsymbol{S}}}},{{{boldsymbol{w}}}}=hat{{{{boldsymbol{y}}}}}$$

where w is the vector of n + 1 output weights. The weights w are found by minimising the difference31 between prediction (hat{{{{boldsymbol{y}}}}}) and the target y. The solution to this minimisation problem is analytically solved using ordinary least squares in matrix form

$${{{boldsymbol{w}}}}={({{{{boldsymbol{S}}}}}^{T}{{{boldsymbol{S}}}}+lambda {{{boldsymbol{I}}}})}^{-1}{{{{boldsymbol{S}}}}}^{T}{{{boldsymbol{y}}}}$$

on a training set of k < K sequential rows of the state matrix S.

To evaluate the predictive performance of our trained model, we use the normalised root mean square error (NRMSE) to quantify how accurate the predicted target-series (hat{{{{boldsymbol{y}}}}}) is to target-series y for both the training and testing sets. NRMSE is given by

$$NRMSE=sqrt{frac{mathop{sum }_{i = 1}^{k}{({y}_{i}-{hat{y}}_{i})}^{2}}{k{hat{sigma }}^{2}}}$$

where k is the row dimension of the training or testing set of the state matrix S, and ({hat{sigma }}^{2}) is derived from the set’s target y as an approximation of the true population variance. For simulations that undergo repeat realisations, the median NRMSE ± median absolute deviation (MAD) is recorded as the predictive performance for that realisation set.

The number of columns of state matrix S used for training is the feature dimension O. For base training without further post-processing, feature dimension O = n, referring to the number of sampled states (si,1, . . . , si,n) si per input-step xi equal to the node dimension n. We will refer to this state matrix S with O = n features described in Eq. (1) as the base state matrix on which further post-processing methods will be applied.

Post-processing the state matrix

This paper focuses on several post-processing methods applied to the base state matrix S described by Eq. (1). These methods rearrange and/or replicate the base state matrix S in order to generate a post-processed state matrix. Notably, our post-processing methods keep nonlinear input mapping restricted to the reservoir, thereby preserving the computational advantage of reservoir computing8.

Uniform-timeshifting

The uniform timeshifts post-processing method generates a state matrix that has a larger feature dimension O compared to the base state matrix S. We start with a uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{2}}), which is a replica of the base state matrix S whose O = n features are timeshifted d2 input-steps into the past. This uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{2}}) is then horizontally concatenated (denoted by ) to the base state matrix S, generating the uniform-timeshift state matrix

$${{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}} = left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1}\ …\ {{{{boldsymbol{s}}}}}_{K}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{d}_{2}}\ …\ {{{{boldsymbol{s}}}}}_{K-{d}_{2}}\ end{array}right]\ = left[begin{array}{ccc}{s}_{1,1}&…&{s}_{1,n}\ …&…&…\ {s}_{K,1}&…&{s}_{K,n}\ end{array}right]frown left[begin{array}{ccc}{s}_{1-{d}_{2},1}&…&{s}_{1-{d}_{2},n}\ …&…&…\ {s}_{K-{d}_{2},1}&…&{s}_{K-{d}_{2},n}end{array}right].$$

In effect, any feature vector siS is extended with the feature vector ({{{{boldsymbol{s}}}}}_{i-{d}_{2}},in ,{{{boldsymbol{S}}}}), where d2 is the uniform time shift applied to the feature vector, illustrated by Fig. 2 at N = 2 state matrix replicas. In this way, the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) that is used for training and testing has an increased feature dimension of O = 2n. Earlier methods of concatenating uniformly timeshifted features have been previously established4,5,32.

Fig. 2: Post-processing methods on the base state matrix S.
figure 2

Each entry, reading left-to-right, indicates timeshifting of S under a given post-processing method. N = 1 uniform-timeshifting indicates no post-processing.

Full size image

Multi-uniform-timeshifting

Inspired by delay embedding, we now take multiple timeshifts in order to better reconstruct the underlying dynamics of the driving system and improve predictive performance. The stepsize between multiple timeshifts can be of the same length, characteristic of delay embedding3, or of varying length as demonstrated by Picco et al. under their “Timesteps Of Interest” algorithm10.

Horizontally concatenating multiple uniformly-timeshifted state matrices generates a multi-uniform-timeshifts state matrix

$${{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}=left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1}\ …\ {{{{boldsymbol{s}}}}}_{K}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{d}_{2}}\ …\ {{{{boldsymbol{s}}}}}_{K-{d}_{2}}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{d}_{2}-{d}_{3}}\ …\ {{{{boldsymbol{s}}}}}_{K-{d}_{2}-{d}_{3}}\ end{array}right]$$

of O = 3n features, where ({d}_{3}^{Sigma }) indicates the sum of iterative timeshifts d2 + d3 (Fig. 2 at N = 3 replicas). For example, if uniform-timeshifts d2 = 2 and d3 = 8, then the first replica is the base state matrix S, the second replica is the uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{2}}) at d2 = 2 input-steps into the past and the third replica is the uniformly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) at d2 + d3 = 2 + 8 = 10 input-steps into the past.

Concatenating N state matrix replicas generates a multi-uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }}) of O = Nn features, with the earliest state matrix replica ({S}_{{d}_{N}^{Sigma }}) timeshifted d2 + d3 + . . . + dN input-steps into the past. Allowing for an increased number of delayed states should in general enable an improved embedding. The increased computational cost associated with the increasing number of optimisation hyperparameters will be discussed.

Random-timeshifting

The random-timeshifting post-processing method is implemented according to published methods6. Here, each of the O = n features of the base state matrix S is timeshifted a number of input steps into the past. The timeshift for each feature is randomly chosen. This random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) can be described as

$${{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}=left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{{{{boldsymbol{r}}}}}^{1}}\ …\ {{{{boldsymbol{s}}}}}_{K-{{{{boldsymbol{r}}}}}^{1}}end{array}right]=left[begin{array}{ccc}{s}_{1-{r}_{1}^{1},1}&…&{s}_{1-{r}_{n}^{1},n}\ …&…&…\ {s}_{K-{r}_{1}^{1},1}&…&{s}_{K-{r}_{n}^{1},n}end{array}right]$$

where the random-timeshifts vector r1 consists of random integers ({r}_{1}^{1},…,{r}_{n}^{1}) ranging from 0 to some max-random-timeshift R. Thus, the j-th feature of the base state matrix S is timeshifted ({r}_{j}^{1},in ,{{{{boldsymbol{r}}}}}^{1}) input-steps into the past, illustrated by Fig. 2 at N = 1 state matrix replica. The feature dimension of this random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) remains at O = n.

The max-random-timeshift R parameter sets the range of random-timeshifts rr that may be randomly chosen. It is analogous to the method used by Del Frate et al., where the half-life of the autocorrelation function is multiplied by a real scalar6 to determine the range of allowed random timeshifts rr. The max-random-timeshift R can be optimised with a one-dimensional scan.

Prediction tasks

Various chaotic time series prediction tasks are performed on the simulated reservoir to establish and quantify our post-processing methods in this manuscript. Our driving input series is a coordinate of one of the Lorenz, Rössler, or Mackey–Glass attractors. Our target series is another coordinate of the attractor for cross-prediction tasks, or the input series coordinates a number of steps ahead for forecasting tasks. For configuring our simulated reservoir computer and establishing our post-processing methods we carried out prediction tasks on the Lorenz attractor.

Lorenz attractor driving input-series x = [x1, . . . , xK] was the Lorenz x-coordinate defined by

$$frac{{{{rm{d}}}}x}{{{{rm{d}}}}t} =, a(y-x),\ frac{{{{rm{d}}}}y}{{{{rm{d}}}}t} =, x(b-z)-y,\ frac{{{{rm{d}}}}z}{{{{rm{d}}}}t} =, xy-cz,$$

and cross-prediction target-series y was the Lorenz z-coordinate33.

The simulated reservoir training models established on the Lorenz attractor were corroborated with Mackey–Glass attractor P-to-P 10-step-ahead forecasting tasks (Suppl. Note 3). Mackey–Glass attractor driving input-series x was the Mackey–Glass P-coordinate defined by

$$frac{{{{rm{d}}}}P}{{{{rm{d}}}}t}=frac{{beta }_{0}{theta }^{n}P(t-{tau }_{{{{rm{MG}}}}})}{{theta }^{n}+P{(t-{tau }_{{{{rm{MG}}}}})}^{n}}-gamma P(t)$$

and forecasting target-series y was the 10-step-ahead Mackey–Glass P-coordinate34. All time series coordinates were iterated using 4th-order Runga–Kutta. For the delay terms in the Mackey–Glass equation cubic Hermitian interpolation was used. The system parameters for the Lorenz and Mackey–Glass attractors are given in Table 1. Cross-prediction tasks on the Rössler attractor were also carried out in order to corroborate our established methods (Suppl. Note 4).

Table 1 Task and reservoir parameters
Full size table

The simulated reservoir training models were translated to experimentally-realised readout data acquired from a laser-based photonic reservoir system. Targets were P-to-P 1-step-ahead or 10-step-ahead coordinates of the input series, corresponding to a Mackey–Glass attractor.

Reservoir systems

Simulated reservoir

For our simulated reservoir, we use a time-multiplexed nonlinear delayed map introduced by Jaurigue et al.5,16. This system describes a semiconductor optical amplifier subject to weak self-feedback given by

$${s}_{t+theta }=frac{a(b{s}_{t-tau }+{x}_{i}mg)}{1+b{s}_{t-tau }+{x}_{i}mg}$$

with simulated reservoir parameters specified in Table 1. The input-series x is masked (m) and scaled (g). The evolving reservoir state over time st is observed once during each mask interval θ and then indexed into the state vector (si,1, . . . , si,n) driven by input-step xix. There are a total of n nodes through which we observe the reservoir state ssi, distinguished by their unique mask θ, so the length of a clock cycle driven by input-step xi is nθ.

Given that we apply a random mask on the driving input for our simulated reservoir5, each reservoir computer configuration undergoes repeat realisations, with each realisation using a different random mask. When applicable, each realisation also uses a different vector of random timeshifts r. For specific details on our simulated reservoir and the time-multiplexed masking procedure, we refer the reader to previous work5, given that the reservoir has no influence on the post-processing methods which are the focus of this paper.

Physical photonic reservoir

To investigate post-processing methods on a physical reservoir system we utilised the readout of our previously published photonic reservoir based on a vertical-cavity surface-emitting laser (VCSEL) system17. This system enabled operation with high-speed (GHz-rate) and low-power (sub-mW) optical input signals. This system also enabled a hardware-friendly approach using a single VCSEL requiring very low bias currents, operating at a key optical telecom wavelength. The referred VCSEL-based photonic reservoir demonstrated good performance in data classification and time series prediction tasks, across a wide parameter space range. Physical reservoir parameters are in Table 1. For specific details on the physical reservoir, we refer the reader to our relevant work17.

We used a single hardware-realised readout, rather than a set of realisations each using a different random mask, as is the case in our simulated reservoir computer. Multiple realisations were required in post-processing when using different vectors of random timeshifts r.

Results

Simple post-processing

We established the simulated reservoir computer configurations and post-processing methods by performing Lorenz x-to-z cross-prediction tasks. Key results for characterising the performance of our post-processing methods are summarised in Table 2, which we reference in the following sections.

Table 2 Simulated reservoir computer results of Lorenz x-to-z cross-prediction
Full size table

In Table 2, “N replicas” refers to the number of state matrix replicas used to generate the final state matrix used for training and testing. “O features” refers to the feature dimension of this state matrix, expressed as multiples of the node dimension Nn. Results in the “uniform-timeshifting, ({{{{boldsymbol{S}}}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }})” category are for state matrices with uniformly-timeshifted features, which includes the base state matrix S at N = 1. Results in the “random-timeshifting, ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}})” category are for state matrices with randomly-timeshifted features. We present the median NRMSE ± MAD result under optimal delay configurations, for increasing replica number N. The reservoir is fixed at n nodes.

Uniform-timeshifting improved performance

We start with results for uniformly timeshifted state matrices summarised in Table 2. For N = 1 replicas this corresponds to no post-processing on the base state matrix S. As expected, the base training method demonstrates inferior performance when compared to any of the optimised post-processing methods performed on the base state matrix S.

For N = 2 replicas, we horizontally concatenate a uniformly-timeshifted state matrix to the base state matrix in order to generate the uniform-timeshifted state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) of O = Nn = 2n features. The optimal uniform-timeshift d2 is also listed in Table 2 and was determined via a scan, illustrated in Fig. 3a (teal line). A 74% decrease in error was observed. Superior performance was observed even at sub-optimal configurations of uniform-timeshift d2, which we attribute to the increase in the feature dimension O from n to 2n. The superior performance of training on the optimally-delayed uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}), compared to training only on the base state matrix S (orange dashed line), is consistent with previously published results4,5.

Fig. 3: Timeshifting on the simulated reservoir computer, Lorenz x-to-z cross-prediction.
figure 3

a Scan of uniform-timeshift d2, for training error (grey line) and testing error (teal line) at N = 2 replicas and O = 2n features. The orange dashed line indicates the result of training on the base state matrix S. The shaded area around each line indicates the MAD of the realisation set. b Scan of multi-uniform-timeshifts d2 vs d3, for testing error at N = 3 replicas and O = 3n features. c Scan of multi-uniform-timeshifts d2 vs d3 vs d4, for testing error at N = 4 replicas and O = 4n features. d Scan of max-random-timeshift R for covariance rank (maroon line), training error (grey line) and testing error (teal line) at N = 1 replica and O = n features. The orange dashed line indicates the result of training on the base state matrix S. The shaded area around each line indicates the MAD of the realisation set. NRMSE normalised root mean square error, MAD median absolute deviation.

Full size image

In general, reservoir computer performance scales with the size of the internal and readout layers of its reservoir, provided there are sufficient readout nodes for adequately sampling the reservoir. This is reflected in the information processing capacity introduced by Dambre et al.35, where it was shown that the maximal possible information processing capacity is equal to the readout node dimension n, which corresponds to the O = n features of the base state matrix S. The improved performance of the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) can therefore, in part, be attributed to its increased feature dimension O = 2n, compared to the O = n feature dimension of the base state matrix S. Viewed in this way, uniform-timeshifting is a computationally-inexpensive post-processing alternative to simulating larger reservoirs for higher-dimensional mapping of the input-series36,37, while achieving similar results. In the case of physical reservoir systems, post-processing methods are sometimes the only viable option for increasing state matrix feature dimensions when reservoir parameters, such as node dimension, are constrained.

Multi-uniform-timeshifting improved performance

From the established uniform-timeshifting post-processing method, we proceed with multiple-uniform-timeshifting that increases the final state matrix feature dimension beyond O = 2n.

With N = 3 state matrix replicas we generate a uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) of feature dimension O = 3n. Table 2 lists the NRMSE for N = 3 replicas, referring to performance at optimal uniform-timeshifts d2 and d3 determined via a 2-dimensional scan. We observed a 60% decrease in error over N = 2 replicas. For N replicas we generate uniform-timeshifts state matrices of feature dimension O = Nn, and the optimal uniform-timeshifts [d2, . . . , dN] are listed, determined via (N − 1)-dimensional scans. Going to N = 4, to N = 5 and to N = 6 replicas show a 51%, 30% and 13% decrease in error, respectively. Thus, a trend of increasingly improved predictive performance with increasing O = Nn feature dimensions was observed.

In addition to increasing state matrix features O = Nn, performance with increasing N replicas and optimal uniform-timeshifts d is related to delay embedding4,11,37. The optimal values of the uniform-timeshifts [d2, . . . , dN] d are not the same, indicating that the optimal delay embedding from this task is when the uniform-timeshifts dd are not evenly spaced. In general, not all delay embeddings work equally well when the input data has only finite precision3, as the theoretical guarantees of Taken’s delay embedding theory (where delayed replicas of an input time series can be used to reconstruct the full state-space dynamics of the underlying dynamical system) only hold for infinite precision9. Furthermore, optimal delay embeddings depend on the criteria by which they are judged, which in our case is the NRMSE of the resulting cross-prediction. Moreover, since the reservoir itself performs a delay embedding11,37,38, this affects the optimal delay embedding provided by multi-uniform-timeshifting of the reservoir states. Indeed, we observed that the optimal uniform-timeshift d2 at N = 2 replicas changes with our simulated reservoir delay τ (Suppl. Note 3).

Finally, for multi-uniform-timeshifting, we show that the optimal configurations for the uniform-timeshifts d are constrained to a single zone only. For example, Fig. 3b illustrates the results of a 2-dimensional scan for uniform-timeshifts [d2d3] d configurations when concatenating N = 3 replicas. We see superior performance for a single zone (dark teal/black area) of [d2d3] d configurations. Figure 3c also illustrates a single optimal configuration zone (dark teal/black area) for N = 4 replicas as was observed for N = 3. We do not observe multiple or repeating optimal zone patterns, illustrating that configurations must be specifically chosen for superior performance. This indicates the importance of thorough scans to determine the optimal uniform-timeshifts dd configurations, which becomes computationally intractable with increasing N replicas.

Random-timeshifting improved performance

Given our indications that multi-uniform-timeshifting requires optimally chosen uniform-timeshifts dd configurations for superior performance, we looked to post-processing methods that perform well with randomly chosen timeshift parameters. Hence, we established the random-timeshifting post-processing method for our simulated reservoir computer. The “random-timeshifting, ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}})” column of Table 2 lists the result when training on the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) at N = 1 replica. The optimal max-random-timeshift R found via a parameter scan is listed alongside the NRMSE, with Fig. 3d illustrating this scan.

We report a 55% decrease in error when training on the optimal random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) (N = 1), compared to the base state matrix S with the same feature dimension of O = n. This finding is consistent with the work of Carroll and Hart7, and Del Frate et al.6, who also demonstrated improved performance after random timeshifting on the base state matrix, measured by a reduced prediction error or a reduction in required reservoir state samplings per input.

The performance improvement as a result of random-timeshifting cannot be explained by a feature dimension increase as it is for uniform-timeshifting, because the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) shares the same feature dimension of O = n as the base state matrix S. Additionally, although random-timeshifting introduces delayed versions of states, these timeshifted states replace the non-timeshifted base states as training features. Training on delayed replicas of states that append to the base state matrix, as we do in uniform-timeshifting, is more analogous to delay embedding.

Instead, a possible explanation for this performance improvement is that random timeshifting increases the covariance rank of the final training matrix, implying higher-dimensional mapping of the input7,39. Figure 3d shows how covariance rank (maroon line) and NRMSE (teal line) change with increasing max-random-timeshift R. Indeed, we see that covariance rank is larger at the optimal max-random-timeshift R = 21 compared to the base state matrix without random-timeshifting, consistent with previously published explanations7,39. However, the complete scan shows that the covariance rank of the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) increases with max-random-timeshift R, and then the covariance rank remains high even as predictive performance degrades past the optimal max-random-timeshift R > 21. Note that changing the prediction task from x-to-z cross-prediction to x-to-x 1-step-ahead forecasting gives the same outcome, where optimal performance is achieved when covariance rank is minimal (Suppl. Note 2). Thus, a large covariance rank may not guarantee good performance, and the assumption that a state matrix with a larger covariance rank generally outperforms a state matrix with a smaller covariance rank7,12,39,40 warrants further investigation.

Recall implementation

Here we will report our investigation of recall, which refers to delayed versions of states from the base state matrix S being replicated in post-processing, in order to build up training state matrices of larger feature dimensions. For example, uniform-timeshifting that concatenates N = 2 state matrix replicas will result in states appearing up to N = 2 times in the uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}). The recalled state replicas ({{{{boldsymbol{S}}}}}_{{d}_{2}}) are delayed versions of those in the base state matrix S, separated by the uniform-timeshift d2.

Reservoir computing for time series prediction seems to work by both delay embedding the input time series11 and by projecting the input data into a higher-dimensional space31. The latter is generally related to the number of nodes of the reservoir. In the context of recall and uniform timeshifting, we show that reservoirs with a smaller node dimension perform better than reservoirs with a larger node dimension, which goes against the general assumptions that a higher-dimensional reservoir would result in better computational performance35.

To characterise the recall phenomenon, recall J will indicate how many nodes will have sampled states recalled during post-processing, thereby introducing delayed versions of node states in the final state matrix. We start with an extended state matrix S of feature dimension O = 2n, generated from a larger reservoir of 2n nodes. A possible configuration of timeshifts [r1, . . . , r2n] r applied to this extended state matrix S is illustrated in Fig. 4a at recall J = 0. For this timeshifts vector r the first n timeshifts [r1, . . . , rn] are 0 and the second subset of n timeshifts [rn+1, . . . , r2n] are at the max-timeshift R.

Fig. 4: Recall implementation on the simulated reservoir computer, Lorenz x-to-z cross-prediction.
figure 4

a Applying recall to the 2n-extended base state matrix S. Null recall J = 0 generates a timeshift matrix Sr. Partial recall J = 0.5n generates a transitional final state matrix. Total recall J = n generates a uniform-timeshifts matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}). b Scan of recall J vs uniform-timeshift d2 = 1 (black line) and d2 = 21 (teal line), for testing error at O = 2n features. The orange dashed line indicates the result of training on the extended base state matrix S of node dimension 2n. The shaded area around each line indicates the MAD of the realisation set. c Scan of recall J vs uniform-timeshift d2, for testing error at O = 2n features. d Scan of recall J vs multi-uniform-timeshifts d2 vs d3, for testing error at O = 3n features. NRMSE normalised root mean square error, MAD median absolute deviation.

Full size image

The constructed timeshifted state matrix Sr in Fig. 4a is similar to the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) at recall J = n. Here, both post-processed state matrices have O = 2n features, and the uniform-timeshift d2 is equal to max-timeshift R. The key difference between these state matrices is the number of nodes required to generate them. For the constructed timeshift state matrix Sr, the timeshifted features [n + 1: 2n] are sampled states of the nodes [n + 1: 2n], from a reservoir of 2n nodes. Conversely, for the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}), the timeshifted features [n + 1: 2n] are generated by recalling the sampled states of nodes [1: n], from a smaller reservoir of n nodes. Specifically, the recalled node states are collected at d2 = R input steps into the past, thereby clamping the required node dimension to n. Thus, in Fig. 4a at null recall J = 0, we generate the constructed timeshift state matrix Sr from a readout layer of 2n nodes. Although the features [n + 1: 2n] are timeshifted by R = d2 input steps into the past, the final state matrix will have all states from the 2n nodes appearing only once. At total recall J = n, we generate the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) from a readout layer of n nodes. All n nodes will have delayed replicas of their sampled states appearing in the final state matrix at d2 = R input steps into the past. A transitional state matrix at recall J = 0.5n illustrates the case where only a third of the 1.5n nodes will have sampled states recalled.

Total recall improves performance even as it reduces node dimension

We have shown how simple post-processing methods can introduce delayed states into the training scheme that lead to predictive performance improvement at optimal timeshift configurations. For uniform-timeshifting and multi-uniform-timeshifting, we attributed this to both an improved delay embedding of the input and an increase in the number of training features O. Recall J indicates how many nodes will have delayed replicas of sampled node states appearing in the final state matrix, and we investigated how recall affects prediction performance with delay embedding and feature dimension scaling effects in mind.

In Fig. 4b, we explored the effect of increasing recall J at specific uniform-timeshift d2 values. The left side of the graph indicates null recall J = 0, and the right side indicates total recall J = n for generating the final state matrix of O = 2n features, corresponding to the left-to-right orientation of the Fig. 4a schematics. We see that at sub-optimal timeshift R = d2 = 1 (black line) performance suffers as recall J increases. This is to be expected, since with increasing recall J values from 0 to n the node dimension of the reservoir decreases from 2n to n. However, the trend reverses unexpectedly at timeshift R = d2 = 21 (teal line), where we reach peak performance at total recall J = n, corresponding to the smallest node dimension n.

To explain this phenomenon, we can look to delay embedding. The optimal delayed states introduced in post-processing are those that are timeshifted by R = d2 = 21 input steps into the past. Thus, at null recall J = 0 we generate the constructed timeshifted state matrix Sr with max-timeshift R = 21, and observe improved performance over sub-optimal delays. As we increase from null recall J = 0 to total recall J = n, we increase the number of delayed state replicas used for the final training state matrix. By increasing this recall J, performance subsequently improves. Once we are at total recall J = n we generate the uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}), which shows the best performance. Uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}) introduces delayed replicas of states at d2 = 21 input-steps into the past, and in this way is a more analogous implementation of delay embedding. This is in contrast to the constructed timeshifted state matrix Sr which introduces delayed states from new nodes, rather than recalling states from the past. Viewed in this way, we consider recall J as a way to scale the influence of a larger reservoir vs the influence of delay embedding. Therefore, the optimal amount of recall J depends on whether the delay embedding is also optimal.

For completeness in Fig. 4b, we include the result of the state matrix with null recall J = 0 at d2 = 0 (orange dashed line), which is equivalent to a base state matrix derived from a reservoir of node dimension 2n. The inferior performance of the base state matrix derived from 2n nodes demonstrated that introducing even a sub-optimal uniform timeshift leads to better predictive performance.

In Fig. 4c, we see predictive performance when scanning increasing recall J with uniform-timeshift d2 values. We find superior performance (dark teal/black area) at total or near total recall J ≈ n when uniform-timeshift d2 is optimal, even though the reservoir node dimension is at its lowest. Moreover, this superior performance at total or near total recall J ≈ n is demonstrated even when the readout node dimension is not rescaled, but kept at 2n (Suppl. Note 1), indicating that explicit delay embedding effects introduced in post-processing dominate any implicit changes in reservoir memory as a result of changing the node dimension from 2n to n.

Figure 4d illustrates the effect of recall with N = 3 replicas, generating the multi-uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) of feature dimension O = 3n. Even in this higher-dimensional case, the optimal configurations of d2 and d3 demonstrate superior predictive performance (dark teal/black area) as we approach total recall J = n. The node dimension subsequently shrinks from 3n to n, indicating that increasing the influence of an optimal delay embedding overcomes any potential negative effect of a smaller reservoir.

Given that increasing recall J reduces the reservoir node dimension n, we evaluated the effect of recall J on the covariance rank of the state matrix, because it has been hypothesised that larger covariance ranks generally lead to superior predictive performance7,12,39,40. We observed that covariance rank derived from the same realisation set used for Fig. 4c is not correlated with zones of superior performance (Suppl. Note 1). This corroborates Fig. 3d, where the covariance rank of the random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}) monotonically increased with max-random-timeshift R.

Multi-random-timeshifting

We now introduce multi-random-timeshifting. The motivation for this post-processing method is to increase the feature dimension of the final training state matrix, allow for better delay embeddings and utilise the lower optimisation cost of randomly-timeshifting features. We were inspired by recall, which scales the influence of delay embedding, and leads to improved performance as if we had increased reservoir size directly.

Multi-random-timeshifting recalls past states and increases the state matrix feature dimension beyond O = n, as is the case for uniform-timeshifting but not random-timeshifting. N state matrix replicas are timeshifted by a unique random-timeshift vector r, where the j-th component of the i-th random-timeshift ({r}_{j}^{i},in ,{{{{boldsymbol{r}}}}}^{i}) is unique across all random-timeshift vectors [r1, . . . , rN]. The randomly-timeshifted state matrix replicas are then horizontally concatenated. As an example, concatenating N = 2 randomly-timeshifted state matrix replicas

$${{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{2}} =left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{{{{boldsymbol{r}}}}}^{1}}\ …\ {{{{boldsymbol{s}}}}}_{K-{{{{boldsymbol{r}}}}}^{1}}\ end{array}right]frown left[begin{array}{c}{{{{boldsymbol{s}}}}}_{1-{{{{boldsymbol{r}}}}}^{2}}\ …\ {{{{boldsymbol{s}}}}}_{K-{{{{boldsymbol{r}}}}}^{2}}\ end{array}right]\ =left[begin{array}{ccc}{s}_{1-{r}_{1}^{1}}&…&{s}_{1-{r}_{n}^{1},n}\ …&…&…\ {s}_{K-{r}_{1}^{1}}&…&{s}_{K-{r}_{n}^{1},n}\ end{array}right]frown left[begin{array}{ccc}{s}_{1-{r}_{1}^{2}}&…&{s}_{1-{r}_{n}^{2},n}\ …&…&…\ {s}_{K-{r}_{1}^{2}}&…&{s}_{K-{r}_{n}^{2},n}\ end{array}right]$$

generates a multi-random-timeshifts state matrix of feature dimension O = 2n, illustrated in Fig. 2 at N = 2 replicas. Multi-random-timeshifting N replicas mean that states at each node are recalled up to N times, in order to generate the multi-random-timeshifts state matrix ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) of O = Nn features.

We can generate any desired feature dimension O using this method. In the case where the n features of the base state matrix S are not a divisor of the desired feature dimension O, then the N-th random-timeshifts state matrix replica ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) is only a partial replica of the first (n-(O,{{{rm{mod}}}},n)) features.

Multi-random-timeshifting gives the best performance

Multi-random-timeshifting results in Table 2 list the best performances when concatenating N ≥ 2 randomly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) replicas. The optimal max-random-timeshift R is also listed. We report a 62% decrease in error when training on N = 2 concatenated randomly-timeshifted state matrices, compared to the single N = 1 randomly-timeshifted state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}). The multi-random-timeshifts state matrix ({{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}^{,,,,frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{2}}) as a result of this concatenation has O = 2n features.

Performance up to N = 6 replicas shows continued improvement, training on feature dimensions up to O = 6n. Additionally, we observe that predictive performance up to N = 6 is the same when implementing either multi-uniform-timeshifting or multi-random-timeshifting.

We proceed with multi-random-timeshifting for N > 6 replicas. We see that performance continued to improve with increasing replica numbers N beyond what we achieved with multi-uniform-timeshifting. Performance is also seen as the increasingly better overlap of predicted Lorenz z-coordinates (hat{{{{boldsymbol{y}}}}}) vs target y (Suppl. Note 1).

We then investigated the parameter space for optimal max-random-timeshift R configurations, illustrated in Fig. 5a. Here, we scanned the max-random-timeshift R against increasing the feature dimension of the multi-random-timeshift state matrix ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{N}}). The size of the reservoir is clamped at n nodes, while the number of feature O increases in increments of 1. When focusing on any specific max-random-timeshift R we found that predictive performance generally improves with larger feature dimension O, even at partial multiples of n.

Fig. 5: Multi-random-timeshifting on the simulated reservoir computer, Lorenz x-to-z cross-prediction.
figure 5

a Scan of feature dimension O (up to O = 13n features) vs max-random-timeshift R, for testing error. Maroon dots indicate optimal max-timeshift R at feature dimension O. The maroon dashed line indicates the minimum value that the max-random-timeshift R may take at a given feature dimension. b Scan of feature dimension O (up to O = 120n features) vs max-random-timeshift R, for testing error (teal line). For O ≤ 20n, data are complete for partial replicas of the state matrix. Maroon dots indicate optimal max-timeshift R at feature dimension O. The maroon dashed line indicates the minimum value that the max-random-timeshift R may take at a given feature dimension. The orange dashed line indicates the result of training on the base state matrix S. The shaded area around each line indicates the MAD of the realisation set. NRMSE normalised root mean square error, MAD median absolute deviation.

Full size image

Another illustration of these data is in Fig. 5b, where we observed that the optimal max-random-timeshift R (maroon dots) generally increases as feature dimension O increases. This could be expected, since a higher number of sampled states from each node must be recalled in order to generate the multi-random-timeshifts state matrix ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}_{N}}). With very high replica numbers up to N ≈ 100, multi-random-timeshifting continues to improve (teal line). Beyond N ≈ 100 replicas, we observe the optimal max-random-timeshifts R-value consistently at the theoretical minimum R = N − 1, and predictive performance no longer improves.

The performance of the clamped reservoir of n nodes at total recall J = n performed just as well as the correspondingly upscaled reservoir of Nn nodes at null recall J = 0 indicating that implementing multi-random-timeshifting with total recall J = n is an effective way to increase the feature dimension O and maintain the superior predictive performance of this method, while also clamping the reservoir size to n nodes (Suppl. Note 1).

The simple post-processing and the multi-random-timeshifting methods were established on the Lorenz attractor. These methods were corroborated with the Mackey–Glass P-to-P 10-step-ahead forecasting (Suppl. Note 3) and Rössler x-to-z cross-prediction (Suppl. Note 4) tasks on the simulated reservoir computer. We consistently observed our post-processing methods improving performance. As expected, increasing the number of replicas N improved performance even further. Moreover, increased recall J resulted in superior performance, while also reducing reservoir node dimension from Nn to n.

Optimising timeshift delays

Finding the optimal timeshift configurations for our post-processing methods is key to optimal delay embedding. For multi-uniform-timeshifting, we saw that the optimal uniform-timeshifts d vary with increasing N. As the number of replicas N increases, each uniform-timeshifts configuration {d2, . . . dN} d must be re-optimised by an (N − 1)-dimensional scan. To perform this optimisation for all uniform-timeshifts d, the number of linear regression operations required exponentially increases by

$${d}_{{{{rm{scan}}}}}^{N-1}$$

where dscan is the maximum of the range of uniform-timeshifts dd to be scanned.

For multi-random-timeshifting, only a 1-dimensional scan for the max-random-timeshift R is required even as the number of N replicas increases. The number of linear regression operations required for optimising the max-random-timeshift R increases by

$${R}_{{{{rm{scan}}}}}-N+1$$

where Rscan is the maximum of the range of max-random-timeshifts R to be scanned.

Computational time benchmarks on our simulated reservoir computer are given as an example. We require 190 s to compute a random-mask realisation set, for a post-processed training matrix of O = 10n features from N = 10 replicas. Optimising the max-random-timeshift R, where the scanning range of max-random-timeshifts Rscan = 180, is calculated as

$$190,{{{rm{s}}}}* (180-10+1)=9,{{{rm{h}}}}.$$

Comparatively, the computational time required to optimise the multi-uniform-timeshifts {d2, . . . d10} d, where the scanning range for each uniform-timeshift is dscan = 20, such that dscan(N − 1) = 180 matches the scanning max-random-timeshift scanning range Rscan = 180, is calculated as

$$190,{{{rm{seconds}}}}* (2{0}^{10-1})=185083714,{{{rm{years}}}}.$$

Due to this ease of optimisation for larger Nn feature dimensions, multi-random-timeshifting ultimately outperforms multi-uniform-timeshifting in terms of scalability and predictive performance. For similar reasoning, random-timeshifting is preferable to non-random-timeshifting of training features. Finding non-random, optimal-timeshifts [r1, . . . , rn] r quickly becomes computationally intractable as n increases, due to the n-dimensional scan required to optimise every optimal-timeshift rr applied to each feature. First-order approximations of optimal timeshifts have been previously implemented6.

Translation to a physical reservoir

Post-processing methods were established, characterised and corroborated on our simulated reservoir computer. Implementing our post-processing methods improves predictive performance, and need only be applied after the readout is generated. Thus, in this section we translate our methods to a physical reservoir implemented in a photonic system, enabling high-speed operation using GHz-rate data inputs. To accomplish this, we took our readout of a previously published VCSEL-based photonic reservoir system17 that was used for high-dimensional mapping of input. Figure 6a illustrates the experimental setup. When compared to the simulated reservoir the presence of an internal time delay (τ), as well as masking (m) and scaling (g) of the input is reflected in the experimental setup. The key difference between the simulated reservoir and the experimental reservoir is the number of nodes n. Despite these differences in the readout, post-processing methodologies remained unaffected and could be implemented as established.

Fig. 6: Timeshifting on the physical photonic reservoir computer, Mackey–Glass P-to-P 10-step-ahead forecasting.
figure 6

a Experimental setup of our physical reservoir17. The device receives and processes the input signal x(t) (gold components) before feeding into the reservoir that performs nonlinear mapping (orange components). The reservoir state is observed and collected as the physical reservoir’s readout (teal components). b Scan of multi-uniform-timeshifts d2 vs d3, for testing error at N = 3 replicas and O = 3n features. c Scan of feature dimension O (up to O = 20n features) vs max-random-timeshift R, for testing error (teal line). Maroon dots indicate optimal max-timeshift R at feature dimension O. The maroon dashed line indicates the minimum value that the max-random-timeshift R may take at a given feature dimension. The orange dashed line indicates the result of training on the base state matrix S. The shaded area around each line indicates the MAD of the realisation set. d Simulated reservoir comparison to panel (b). NRMSE normalised root mean square error, MAD median absolute deviation.

Full size image

In addition to the physical reservoir readout, we took our driving P-coordinates of the input series, corresponding to a Mackey–Glass attractor. This enabled us to choose our prediction task, which we set as P-to-P 10-step-ahead forecasting.

In the “uniform-timeshifting, ({{{{boldsymbol{S}}}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }})” column of Table 3, replica number N = 1 corresponds to training on the base state matrix S, without post-processing. We report a 45% decrease in error when training on the N = 2 uniform-timeshift state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{2}}). This improved performance is consistent with improvements seen in the simulated computer. Training on N = 3 uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }{{{{{boldsymbol{S}}}}}_{{d}_{2}}}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{3}^{Sigma }}) gives a further 27% decrease in error from N = 2. Training on N ≥ 3 uniform-timeshifts state matrix ({{{{boldsymbol{S}}}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{d}_{N}^{Sigma }}) indicates progressively improving performance. Uniform timeshifts {d1, . . . dN} must be re-optimised with increasing N replicas, which is also consistent with the simulated reservoir computer. Additionally, Fig. 6b shows the parameter space of optimal configurations (dark teal/black) for uniform timeshifts d2 and d3. The optimal configuration space on the physical reservoir follows a similar configuration space observed on the simulated reservoir computer for the same Mackey–Glass P-to-P 10-step-ahead-forecasting task, shown in Fig. 6d.

Table 3 Physical photonic reservoir computer results of Mackey–Glass P-to-P 10-step-ahead forecasting
Full size table

In the “random-timeshifting, ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}})” column of Table 3, starting at N = 1 replica we report that optimal random-timeshifting leads to a 51% decrease in error compared to base training. Consistent with observations on the simulated reservoir system, Fig. 6c illustrates how training on N ≥ 2 multi-random-timeshifts state matrices ({{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{1}}}^{frown }..{.}^{frown }{{{{boldsymbol{S}}}}}_{{{{{boldsymbol{r}}}}}^{N}}) progressively improves predictive performance (teal line). We also observed that the optimal max-random-timeshift R (maroon dots) generally increases with replica number N, as was observed on the simulated reservoir computer.

Mackey–Glass P-to-P 1-step-ahead forecasting was also carried out (Suppl. Note 5). Our results for training on the base state matrix S without post-processing corroborate with previously published results for this task17.

Taken together, these consistencies suggest that post-processing methods established using simulated reservoirs can be translated to experimentally realised readout data generated from a physical reservoir system.

Conclusion

We established simple post-processing methods on our simulated reservoir computer for a variety of chaotic attractor prediction tasks. These methods introduced delayed versions of reservoir states for training, as inspired by delay embedding.

Improved performance of training on the optimal uniform-timeshift state matrix, which concatenates a uniformly-timeshifted state matrix to the base state matrix, was partly attributed to its increased feature dimension35. Uniform timeshifting can thus be viewed as a computationally inexpensive post-processing alternative to increasing reservoir node dimension for higher-dimensional mapping of the input36,37, that achieves similar results.

We expanded uniform-timeshifting into multi-uniform-timeshifting, which concatenates uniformly-timeshifted state matrices into a multi-uniform-timeshifts state matrix. The improved performance with increasing state matrix replicas is consistent with delay embedding playing a role. Additionally, the optimal uniform timeshifts are not evenly spaced, indicating that a non-uniform delay embedding is ideal.

Random-timeshifting of base state matrix features was established on our simulated computer. The performance improvement when training on the optimal random-timeshifts state matrix was consistent with previous work6,7. We did not observe state matrices with larger covariance rank generally outperform state matrices of smaller covariance rank7,12,39,40, when scanning max-random-timeshifts at optimal and sub-optimal values. A larger covariance rank is therefore not a complete explanation of why optimal random-timeshifting outperformed training on the base state matrix of equal feature dimension.

Under optimal timeshift delays, we found that the total recall of past states from existing nodes is superior to collecting extra node states from a larger reservoir. This finding inspired multi-random-timeshifting, by randomly recalling past states of a clamped reservoir in order to increase the number of training features. Multi-random-timeshifting performs just as well as multi-uniform-timeshifting when training on equivalent feature dimensions derived from equivalent reservoir sizes. However, given that multi-random-timeshifting is much cheaper to optimise, we were able to reach better performance with this method at feature dimensions that are intractable to optimise for multi-uniform-timeshifting. Multi-random-timeshifting is thus our preferred post-processing method among those investigated in this paper.

We translated our established methods to an experimentally realised physical photonic reservoir system17, given that our post-processing methods are applied to the training scheme after reservoir states are generated. We showed that our methods improved predictive performance. We believe that integrating multi-random-timeshifting in the standard training scheme of both simulated and physical reservoir computers can improve performance for minimal computational investment.

Our implementation of the multi-random-timeshifting algorithm is applicable to autonomous prediction tasks, such as Lorenz attractor closed-loop predictions41. Aside from influencing training performance, the multi-random-timeshifting algorithm introduces delay terms to the trained autonomous system when the reservoir is operated in a closed loop. The influence of these time delays on solutions and the stability of the trained autonomous system remains to be explored in future work.

Related Articles

When the customers comes to you: mobile apps and corporate investment efficiency

Firms are increasingly shifting towards digital channels, yet the implications of this shift remain underexplored. Using a unique database of customer behaviors extracted from the top 2000 mobile apps developed by companies in China, this study investigates the impact of mobile apps on inefficient corporate investments. The results indicate that metrics such as active user count, usage duration, and app launch frequency can mitigate inefficient investments, notably by curtailing overinvestment. These findings survive a series of robustness checks such as altering the measures of inefficient investment, extending the analysis to include the top five apps, incorporating H-share listed firms, and employing instrumental variables regression. Moreover, the mechanism analysis indicates that mobile apps help reduce inefficient investments by lowering agency costs and relaxing financial constraints. Further analysis examines the business models of these apps (paid vs. free) as well as their reputation mechanisms, revealing that the pricing strategies of apps and the reputation of corporate brands also play a role in how the adoption of mobile apps affects inefficient investment.

Efficient optimisation of physical reservoir computers using only a delayed input

Reservoir computing is a machine learning algorithm for processing time dependent data which is well suited for experimental implementation. Tuning the hyperparameters of the reservoir is a time-consuming task that limits is applicability. Here we present an experimental validation of a recently proposed optimisation technique in which the reservoir receives both the input signal and a delayed version of the input signal. This augments the memory of the reservoir and improves its performance. It also simplifies the time-consuming task of hyperparameter tuning. The experimental system is an optoelectronic setup based on a fiber delay loop and a single nonlinear node. It is tested on several benchmark tasks and reservoir operating conditions. Our results demonstrate the effectiveness of the delayed input method for experimental implementation of reservoir computing systems.

MARBLE: interpretable representations of neural population dynamics using geometric deep learning

The dynamics of neuron populations commonly evolve on low-dimensional manifolds. Thus, we need methods that learn the dynamical processes over neural manifolds to infer interpretable and consistent latent representations. We introduce a representation learning method, MARBLE, which decomposes on-manifold dynamics into local flow fields and maps them into a common latent space using unsupervised geometric deep learning. In simulated nonlinear dynamical systems, recurrent neural networks and experimental single-neuron recordings from primates and rodents, we discover emergent low-dimensional latent representations that parametrize high-dimensional neural dynamics during gain modulation, decision-making and changes in the internal state. These representations are consistent across neural networks and animals, enabling the robust comparison of cognitive computations. Extensive benchmarking demonstrates state-of-the-art within- and across-animal decoding accuracy of MARBLE compared to current representation learning approaches, with minimal user input. Our results suggest that a manifold structure provides a powerful inductive bias to develop decoding algorithms and assimilate data across experiments.

Annealing-inspired training of an optical neural network with ternary weights

Artificial neural networks (ANNs) represent a fundamentally connectionist and distributed approach to computing, and as such they differ from classical computers that utilize the von Neumann architecture. This has revived research interest in new unconventional hardware for more efficient ANNs rather than emulating them on traditional machines. To fully leverage ANNs, optimization algorithms must account for hardware limitations and imperfections. Photonics offers a promising platform with scalability, speed, energy efficiency, and parallel processing capabilities. However, fully autonomous optical neural networks (ONNs) with in-situ learning are scarce. In this work, we propose and demonstrate a ternary weight high-dimensional semiconductor laser-based ONN and introduce a method for achieving ternary weights using Boolean hardware, enhancing the ONN’s information processing capabilities. Furthermore, we design an in-situ optimization algorithm that is compatible with both Boolean and ternary weights. Our algorithm results in benefits, both in terms of convergence speed and performance. Our experimental results show the ONN’s long-term inference stability, with a consistency above 99% for over 10 h. Our work is of particular relevance in the context of in-situ learning under restricted hardware resources, especially since minimizing the power consumption of auxiliary hardware is crucial to preserving efficiency gains achieved by non-von Neumann ANN implementations.

Photonic-crystal surface-emitting lasers

High-performance lasers are important to realize a range of applications including smart mobility and smart manufacturing, for example, through their uses in key technologies such as light detection and ranging (LiDAR) and laser processing. However, existing lasers have a number of performance limitations that hinder their practical use. For example, conventional semiconductor lasers are associated with low brightness and low functionality, even though they are compact and highly efficient. Conventional semiconductor lasers therefore require external optics and mechanical elements for reshaping and scanning of emitted beams, resulting in large, complicated systems for various practical uses. Furthermore, even with such external elements, the brightness of these lasers cannot be sufficiently increased for use in laser processing. Similarly, gas and solid-state lasers, while having high-brightness, are also large and complicated. Photonic-crystal surface-emitting lasers (PCSELs) boast both high brightness and high functionality while maintaining the merits of semiconductor lasers, and thus PCSELs are solutions to the issues of existing laser technologies. In this Review, we discuss recent progress of PCSELs towards high-brightness and high-functionality operations. We then elaborate on new trends such as short-pulse and short-wavelength operations as well as the combination with machine learning and quantum technologies. Finally, we outline future research directions of PCSELs with regard to various applications, including not only LiDAR and laser processing, as described above, but also communications, mobile technologies, and even aerospace and laser fusion.

Responses

Your email address will not be published. Required fields are marked *