Annealing-inspired training of an optical neural network with ternary weights

Introduction

Artificial neural networks (ANNs) represent a fundamentally connectionist and distributed approach to computing, differing from classical von Neumann architectures. Over the past decade, ANNs have revolutionized computing¹, disrupting fields ranging from natural language^2,3 and image processing⁴ to self-driving vehicles⁵ and game playing⁶. The success of these systems is based on their flexibility, high performance in solving abstract tasks, and their fundamentally parallel approach to information processing, allowing them to distill knowledge from large amounts of data. Because they conceptually differ from classical computers, ANNs have to be emulated when running on conventional hardware. This has led, on the one hand, to the meteoric rise of more parallel von Neumann processors such as graphical processing units and application specific tensor processing units⁷. And, on the other hand, it has spurred significant research interest in developing new hardware to enable more efficient implementations of ANNs⁸, often leveraging the strengths of unconventional platforms to either realize co-processors or to build autonomous hardware that directly maps ANN topologies onto the physical substrate.

Although optical neural networks (ONNs) were already demonstrated decades ago^9,10, photonics has risen again as a particularly promising platform^11,12, offering scalability^13,14,15, high speed^16,17,18, energy efficiency¹⁹, and the capability for parallel information processing^17,20. Recent advances in photonic hardware include on-chip integrated tensor cores²¹, and high dimensional optical pre-processors^22,23. Among these, semiconductor lasers have emerged as major candidates to implement ONNs due to their ultra-fast modulation rates and complex dynamics¹⁷. Vertical-cavity surface-emitting lasers (VCSELs) are notable for their efficiency, speed, intrinsic nonlinearity, and mature complementary metal-oxide-semiconductor (CMOS) compatible fabrication process^24,25.

Reservoir computing (RC)^26,27 simplifies the use of recurrent neural networks (RNNs) by eliminating the need for intensive backpropagation through time training. In RC, input data is transformed through a high-dimensional network of fixed, interconnected nonlinear nodes-the reservoir, and only the output weights are trained. Thus, RC and its closely related extreme learning machine²⁸ can be thought of as the lowest complexity architecture for ANNs, and as such has been implemented on a wide variety of physical substrates²⁹ ranging from electronics³⁰ and spintronics³¹, to optics for ultra-fast information processing using a time-multiplexed approach in refs. ^17,30, frequency-multiplexed approach in ref. ²⁰, and spatial multiplexing in refs. ^32,33, and mechanical substrates in ref. ³⁴. Recently, there are also efforts towards a quantum RC implementation to leverage the exponential scaling of the Hilbert space as a high dimensional resource for information processing³⁵.

With the objective of leveraging unconventional ANN hardware to solve a particular task, there has been a significant advancement in hardware-compatible training algorithms^{36,37,38,39,40}. These advances encompass a variety of approaches, including model-based techniques like backpropagation using a digital twin³⁶, augmented direct feedback alignment³⁹, as well as experimental methods for implementing in-situ backpropagation^40,41. Additionally, model-free or black-box strategies have gained traction^32,37,42,43. Yet, fully realized analog hardware implementations of these techniques remain scarce. For unconventional neural networks to be truly competitive, implementing in-situ training techniques is essential to overcome numerous bottlenecks and to reduce dependence on external high-performance computers, whose usage generally challenges the overall benefit of unconventional ANN computing substrates in general⁴⁴.

In refs. ^32,33, we demonstrated RC through spatial multiplexing of modes on a large area VCSEL (LA-VCSEL), creating a parallel and autonomous network that minimizes the need for an external computer via hardware learning rules with Boolean weights. Here, we crucially expand on our previous work by introducing a minimalist implementation of ternary weights that requires no physical changes to existing hardware and can be applied broadly to other systems relying on Boolean weights. Binary neural networks leveraging Boolean weights and activation functions have garnered significant attention as they are inherently simpler. Indeed, complex operations such as multiplication can be replaced with simpler bitwise operations such as XNOR and population count, which drastically reduce computational overhead while maintaining reasonable accuracy in certain tasks^45,46,47. Despite these advantages, the simplicity of binary weights can limit the representational capacity of the network, which may impact performance on more complex tasks. For additional information, we refer to the reviews on binary neural networks in refs. ^48,49. In the context of physically implemented neural networks, binary weights have additional appeal as they do not require high-complexity control. Indeed, using bistable systems such as ferro-electric devices offers a clear avenue towards non-volatile weights, with the associated substantial gains in energy efficiency and system simplicity. Ternary weights provide a richer representation, which substantially improves the learning capability of the network while still maintaining low system complexity and energy costs. This added expressiveness significantly enhances the learning ability of the network, enabling improved accuracy compared to binary approaches, without a proportional increase in computational overhead^50,51,52. As a consequence, there has been growing interest in applications of ternary weights in ANNs for increased efficiency, for instance in the context of large language models⁵³, as well as in the context of electronic hardware implementations^54,55,56. Although binary and ternary ANNs have been extensively studied, comparisons in the context of analog unconventional physical ANNs remain scarce. We demonstrate a significant performance increase using ternary weights on the Modified National Institute of Standards and technology database (MNIST) dataset, gaining on average 7% classification accuracy with our fully hardware implemented and parallel ONN comprising ~450 neurons. Additionally, we present an original annealing-like algorithm compatible with both Boolean and ternary weights, which enhances performance and improves convergence speed. Finally, we measure the inference stability of our ONN, demonstrating extremely stable performance over a duration of more than 10 h, and hence address one of the major concerns of full hardware implementation of semiconductor laser neural networks.

Methods

Optical experimental setup

Ourphotonic ONN implementation builds upon our previous implementations based on LA-VCSELs presented in^32,33. As such, it follows the RC architecture, and is therefore comprised of three parts: the input layer, the reservoir, and the output layer. A simplified diagram shows the working principle of our ONN, illustrating the analogy between the machine learning concept and optical hardware, while the complete experimental setup schematic is shown in Fig. 1b. Supplementary Table 1 gives a detailed list of all the components used.

**Fig. 1: Experimental setup and concept scheme.**

The input layer consists of three main components, a continuously tunable external cavity laser (ECL, Toptica CTL 950) that we use to carry information, a digital micro-mirror device (DMD, Vialux XGA 0.7” V4100), and a multimode fibre (MMF, THORLABS M42L01). First, the single-mode fiber-coupled ECL is collimated with an f₁ = 6 mm focal distance lens (L₁, THORLABS C110TMD-B), yielding a collimated beam diameter of ~1.4 mm, that is used to illuminate DMD_a, which through its pixels modulates the input beam that carries information to the LA-VCSEL. A DMD is a digitally controlled matrix of micro-mirrors, which can flip between two angles, here at angles of ± 12°, allowing us to display Boolean images that constitute our input information u. We then use a simple two-lens imaging system L₂ (THORLABS AC508-150-B-ML, f₂ = 150 mm), and L₃ (THORLABS C110TMD-B, f₃ = 6 mm) adjusting the MMF’s image size to approximately the size of the collimated ECL beam on DMD_a. This ensures maximal coupling into the 50 μm diameter MMF, which through its transmission matrix passively implements the random input weights W^rand for RC.

Following this, the nearfield output of the MMF, W^randu, is imaged onto the LA-VCSEL. Here we use a LA-VCSEL with an oxide aperture of ~ 5 5μm and a threshold current of I_th = 20 mA. A standard ground-signal-ground (GSG) probe (Microworld MWRF-40A-GSG-200-LP) is used to bias the LA-VCSEL, and the laser is set to ~28 °C with a thermal stability on the order of 10 mK. The imaging system we use here is also a two lens system. The output of the multimode fibre is collimated using L₄ (THORLABS AC127-20-B-ML, f₄ = 20 mm) and imaged onto the large area VCSEL using a microscope objective (MO, OLYMPUS LMPLN10XIR, f_MO = 18 mm). We achieve a magnification factor of m = f_MO/f₄ = 0.9, closely matching the MMF’s and the LA-VCSEL’s apertures. Furthermore, the MMF’s numerical aperture (NA) should be chosen such that, considering the effects of magnification m, it it is similar but smaller than that of the LA-VCSEL. In our design, the LA-VCSEL’s physical properties and dynamics form the reservoir’s components, including nonlinear nodes and their interconnections. These nodes, represented by specific areas on the LA-VCSEL’s surface, interact through inherent physical mechanisms within the device. Localized Gaussian-like coupling arises from carrier diffusion in the semiconductor quantum wells, while a more complex global coupling is created by the optical field’s diffraction within the laser cavity. The LA-VCSEL takes the input information W^randu and transforms it in a complex nonlinear way as is generally the case for optical injection into a semiconductor laser⁵⁷. This process produces the reservoir state x, and the complexity of the process can be appreciated from the perturbed LA-VCSEL mode profile under injection shown in Fig. 1a.

Finally, the output layer is realized via DMD_b and a photodetector (DET, Thorlabs PM100A, S150C). By imaging the LA-VCSEL’s near field onto DMD_b, we discretize the VCSEL’s continuous spatial nonlinear response into a discrete matrix, sampling its surface and thus allowing us to adjust the effective neuron-count for one and the same device through super-pixel size control.

Imaging of the LA-VCSEL onto DMD_b is realized via a two-lens system, using L₅ (THORLABS AC254-100-B-ML, f₅ = 100 mm) and MO. The choice of L₅ is motivated by the size of DMD_b’s pixels and the characteristic size of features in the LA-VCSEL’s mode profile. Ideally, we want to oversample the mode profile with the DMD pixels, which we ensure via magnification m = f₅/f_MO = 5.55. The size of speckles on the LA-VCSEL’s surface are ~5.6 μm, while the size of one mirror on DMD_b is ~13.7 μm, and taking into account the magnification every speckle is imaged onto ~5 mirrors. The LA-VCSEL is magnified on DMD_b resulting in a diameter of 305 μm corresponding to an area of 24 × 24 DMD mirrors. L₆ (THORLABS AC254-150-B-ML, f₆ = 150 mm) and L₇ (THORLABS AC254-45-B-ML, f₇ = 45 mm) are chosen to demagnify the LA-VCSEL’s image to ensure it fits on the DET.

The total area of the rectangular region of interest on DMD_b represents 24 × 24 = 576 mirrors in total. Since the LA-VCSEL is circular, its area corresponds to 576 ⋅ π/4 ≃ 452 mirrors. We hence achieve approximately 450 fully parallel neurons. The DMD’s pixels act as a spatial filter between the LA-VCSEL and the DET, realizing output weights W^out by directing the optical signal associated to DMD_b mirrors to one of the two angular configurations one of which has DET, while signals associated to DMD_b mirrors in the opposite angular configuration are discarded. DMD_b therefore implements a Boolean readout matrix. Through DMD_b, we then tune the spatial positions of the LA-VCSEL that contribute to the optical power detected at DET, and by optimizing this mirror configuration, we train the output y of the reservoir. Although the DMD is an inherently Boolean, i.e., binary device, we will later in the “Ternary weights implementation” subsection in “Methods” explain how we use it to implement ternary weights.

As shown in Fig. 1b, a digital computer controls our experiment, sending the input data to be written on DMD_a, recording the output via an oscilloscope as well as controlling the output weights on DMD_b. This computer only acts as a supervisor to our experiment and to implement learning, and as such does not partake in any information processing. It could therefore be replaced by a low-power alternative such as a single board computer like the Raspberry Pi.

The data acquisition loop or forward pass of our ONN is as follows. A batch of N images is loaded on DMD_a’s onboard memory to allow for a fast frame rate of 15 kHz. The computer then triggers DMD_a, which in turn hardware-triggers the oscilloscope to start the acquisition of the signal detected at DET. Each frame on DMD_a is displayed for 66 μs, which is orders of magnitude slower than the intrinsic time scales of the LA-VCSEL (~ 1 ns), meaning that we operate the LA-VCSEL and hence our ONN in its steady state. Consequently, we do not utilize the memory capacity of the LA-VCSEL, making our ONN functionally comparable to an extreme learning machine (ELM). However, certain characteristics, such as the internal coupling between nodes that is also dependent on the injected information, distinguish further between our experimental ONN and conventional ELM and RC schemes.

The waveform acquired via the oscilloscope is digitally downsampled to N points, yielding the output of the ONN y^out, allowing us to compute an error between this output and a set target y^target, according to the normalized mean square error (NMSE) computed at each epoch k:

$${{{rm{NMSE}}}}_{k}=frac{1}{Ntimes {{rm{std}}}({{{bf{y}}}}_{k}^{{{rm{out}}}})}mathop{sum }limits_{i=1}^{N}{({{{bf{y}}}}_{k}^{{{rm{out}}}}(i)-{{{bf{y}}}}^{{{rm{target}}}}(i))}^{2}.$$

(1)

Following the detailed characterization provided in³² and to ensure optimal performance, we operate the LA-VCSEL at I_bias = 1.5I_threshold ~ 30 mA, see Fig. 2a. Moreover, the injection wavelength λ_inj ~ 919 nm was optimized to yield optimal injection locking conditions as shown in Fig. 2b, resulting in maximal suppression of the LA-VCSEL’s numerous modes when left free, and the injected power is set to match the emission power of the LA-VCSEL yielding an injection power ratio PR = P_inj/P_VCSEL ~ 1. The free running and injection locked mode profiles of the LA-VCSEL are shown in Fig. 2c.

**Fig. 2: Experimental characterization of the LA-VCSEL.**

Ternary weights implementation

A simple, yet powerful change to our previous experimental setup consists in the implementation of ternary weights, i.e W^out ∈ {−1, 0, + 1} with virtually no change to the experiment. Let W^out be our output matrix with ternary entries. We can then define two strictly positive matrices, W^out+ and W^out− as follows:

$${{{bf{W}}}}_{ij}^{{{rm{out}}}+}=left{begin{array}{ll}1quad &{{rm{if}}},{{{bf{W}}}}_{ij}^{{{rm{out}}}}=1,\ 0quad &{{rm{otherwise}}}.end{array}right.$$

$${{{bf{W}}}}_{ij}^{{{rm{out}}}-}=left{begin{array}{ll}1quad &,{mbox{if}},,{{{bf{W}}}}_{ij}^{{{rm{out}}}}=-1,\ 0quad &{{rm{otherwise}}}.end{array}right.$$

By sequentially displaying W^out+ and W^out− and measuring their respective outputs signals, y^out+ and y^out⁻, the output of our ONN, y^out, is computed by electronic subtraction y^out = y^out+–y^out⁻. Note that, both, W^out+ and W^out−, are positive Boolean matrices that by definition can be displayed on the DMD. The negative weights result from the subtraction of their respective output signals. Figure 3 gives a diagram view of these operations.

**Fig. 3: Ternary weights implementation.**

However, in its present form, our implementation of ternary output weights results in halving the inference bandwidth due to its sequential nature, requiring two measurements at each step. To remedy this, we could create an optical copy of the LA-VCSEL state on top of DMD_b. Each state would be sent to a separate area of DMD_b, where the corresponding weight matrices W^out+ and W^out− would be displayed simultaneously and two outputs would be detected by separate detectors, and their outputs would be subtracted electronically in real-time. This would allow us to maintain the same inference bandwidth as with Boolean weights, while still benefiting from the performance increase provided by ternary weights. A similar concept was already leveraged in the first experimental ONN demonstration⁵⁸. More recent implementations of positive and negative weights in the context of ONNs include^21,59 leveraging balanced photodetection. Other, more complex techniques involving DMDs proposed in refs. ^60,61 allow for multi-level modulation and can leverage both amplitude and phase modulation potentially yielding analog, positive and negative weights.

In-situ learning

Our in-situ optimization algorithm builds upon, and significantly improves the evolutionary algorithm presented in refs. ^42,62 designed for Boolean learning. Originally, a single randomly chosen Boolean weight is inverted at each epoch, and if this change results in a decreased error it is kept, otherwise we revert the Boolean weight to its former value then select a different Boolean weight. Here, instead of simply inverting a single Boolean weight, i.e., flipping a mirror, we propose flipping several mirrors per epoch. Furthermore, we make the number of mirrors we flip n^mirrors adaptive and proportional to the error according to

$${n}^{{{rm{mirrors}}}}=lfloor alpha cdot {{{rm{NMSE}}}}_{k}rfloor +1,$$

(2)

where α is a hyperparameter that can be likened to a learning, that does not control step size but rather the number of permutations from one epoch to the next. Setting α = 0 corresponds to the original algorithm^42,62, the floor function ⌊⋅⌋ + 1 ensures that at least one mirror is flipped at each epoch. The pseudo-code for our algorithm and optimization loop is given in algorithm 1.

When using ternary weights, at each epoch instead of simply flipping the state of previously Boolean weights, they are individually assigned a value of either −1, 0, or + 1 with equal probability, making our simple algorithm inherently compatible with Boolean and ternary weights.

This strategy combines elements of random search and a well known gradient free optimization technique known as simulated annealing (SA)^63,64,65. Indeed, SA iteratively optimises a function by applying random fluctuations to the given parameters while only accepting changes that yield positive outcomes in a process similar to a random walk. In order to prevent the algorithm from getting stuck in local extrema, worse parameter configurations can also be accepted with a probability that decreases with a so-called temperature parameter which decreases with time, causing the algorithm to settle at a given position in the landscape. Therefore, our strategy can be viewed as a version of simulated annealing where instead of arbitrarily relying on time to force convergence we directly leverage the error to guide our search in the landscape. As such, high values of α tend to correspond to high temperature states in SA which help escape local extrema.

Algorithm 1

Ternary Weights adaptive optimization

1: Initialize W^out randomly

2: W^out_best ← W^out

3: MSE_best ← Forward_pass(W^out)

4: while not converged do

5: n^mirrors ← ⌊α × MSE_best⌋ + 1

6: W^out_temp ← W^out

7: for i = 1 to n^mirrors do

8: Randomly select a mirror m_i

9: Randomly set W^out_temp[m_i] ∈ { − 1, 0, + 1}

10: end for

11: MSE_temp ← Forward_pass(W^out_temp)

12: if MSE_temp < MSE_best then

13: W^out ← W^out_temp

14: W^out_best ← W^out_temp

15: MSE_best ← MSE_temp

16: else

17: Revert to W^out

18: end if

19: end while

20: returnW^out_best

Results

Benchmark tasks

In order to benchmark our ONN and the corresponding training algorithm we use two tasks. The first is binary header recognition, the second the is hand-written MNIST dataset. Due to the scalar output given by the single photodetector DET, we use “one-vs-all” classification.

In a “one-vs-all” setting, for n_classes we train n_classes independent binary classifiers whose purpose is to distinguish one specific class out of all the others. This makes it generally ill-suited to problems with many classes⁶⁶. In contrast, each individual binary classification problem is comparatively easier to solve than a single classification problem with n_classes, yet “one-vs-all” still provides sufficient ground for the comparison of different algorithms and their convergence behavior. For each comparison, all alternative approaches were also trained according to a one-vs-all scheme. Our choice of using this method of classification is primarily due to us wanting to stay as close to full hardware implementation as possible. For our tests, input sequences u were comprised of N = 1000 images comprised of a 50–50 split between positive and negative examples for one-vs-all classification.

As a reference, for the first task the input data consists of binary pie-shaped headers. The Gaussian input beam is divided into n_bits equal sections that encode the header. Although seemingly simple, these orthogonal patterns are quite convenient and allow us to reliably scale the dimensionality and complexity of our dataset and computational task. A more detailed description of the encoding used for the binary header can be found in refs. ^32,33.

Comparison with conventional simulated annealing

We first start by comparing our adaptive modified annealing algorithm to existing variants of simulated annealing which we adapt to conform to hardware limitations. In order to simulate conditions similar to our experimental ONN, we train an extreme learning machine (ELM) of 100 neurons on the “one-vs-all” MNIST classification with ternary weights using different variants of SA. When applied in the context of deep neural network optimization as shown in ref. 67, SA is generally less efficient, reaching lower performance levels and taking longer to converge compared to backpropagation. However, it should be stressed that in this work, optimization constraints are different given that we have a discrete optimization problem, and search space dimensionality is significantly reduced to ~ 450 ternary or Boolean weights.

SA is originally designed for continuous optimization as opposed to discrete problems as is the case for our ternary weights ONN. As such, we compare our algorithm with basic or “Vanilla” SA and a “Thermo-Adaptive” SA whose pseudo-code is given in supplementary algorithm 1 and 2 respectively. In contrast with continuous SA optimization where mutations reflect Euclidean differences, our discrete configuration implies that changes between states resemble a Hamming distance, essentially corresponding to the number of discrete transformations (or “flips”) required to move from one configuration to another.

In its “Vanilla” version, a random weight matrix is generated from scratch at each epoch. This entails that n^mirrors is fixed and always equal to the number of parameters in the weight matrix. Parameter configurations that yield reduced errors are always kept, while those that yield worse errors can be kept with a probability dependent on temperature which decreases with a fixed schedule in time according to a cool-down rate β < 1, i.e., T_k+1 = T_k*β, where k is an index keeping track of optimization steps or epochs. As a consequence, the “Vanilla” SA algorithm has two hyperparameters β and the initial temperature T₀. The “Thermo-Adaptive” version builds upon the simpler SA algorithm by adapting n^mirrors over time with a fixed schedule n^mirrors = ⌊α(1 − k/n_epochs)⌋ + 1 adding an additional hyperparameter α.

Figure 4a shows the convergence behavior in terms of NMSE averaged across all classes in our “one-vs-all” MNIST classification problem for an ELM of 100 neurons using ternary weights with three different versions of SA, namely “Vanilla” (green), “Thermo-Adaptive” (pink) and our SA algorithm (blue). As a reference for convergence speed, we also show the baseline of a backpropagation-trained ELM trained with full weight resolution. While “Vanilla” SA can initially improve the error, it is unable to reach low error levels converging to an average NMSE of 0.76 due the changes in the weight configuration being too extreme since all weights are changed at each epoch. The other two versions of SA are able to overcome this effect through an adaptation of n^mirrors, as a result both algorithms reach similar levels of performance at NMSE = 0.35. Crucially, our algorithm reaches the same performance as the “Thermo-Adaptive” SA with close to an order of magnitude improvement in convergence speed. Crucially, our algorithm only requires tuning a single hyperparameter, which makes it much easier to set up.

**Fig. 4: Comparison between different simulated annealing (SA) algorithms.**

Hardware Learning

We can now proceed to apply our adaptive SA algorithm to optimize the experimental ONN, beginning with an investigation of the effect of α on convergence behavior and performance. Figure 5 shows the impact of α on the performance and convergence for a 4-bit header recognition task. The error decreases faster for an optimal value of α = 10, yet higher values of α lead to a more unstable learning trajectory. Indeed, flipping more mirrors is akin to bigger steps taken in the search space at each epoch, as such, α behaves much like a learning rate in the context of gradient descent. Moreover, the errors reached by optimal values of α are lower. With α = 10 we have NMSE = 0.25 compared to 0.5 for the original algorithm with α = 0. This shows that the adaptive strategy is a simple yet powerful modification to the original algorithm⁴². We should note that α is task dependent. Leveraging this improved strategy we are able to increase performance and reach a symbol error rate of 1.5% for a 6-bit header recognition task.

**Fig. 5: Impact of hyperparameter α on the learning performance for a 4-bit header recognition task.**

Rather than using the synthetic benchmark on binary header recognition, the MNIST task provides a more challenging and genuine image classification problem. Here we study the influence of α on convergence using one-vs-all classification of digits 8 and 9, the hardest in the dataset. Generally speaking, the same trend appears with optimal α values in the range 5 ≤ α ≤ 20. Choosing optimal α values results in faster convergence and better performance, allowing the ONN to reach ~ 85% accuracy for α = 20 instead of ~ 81% for α = 2 while converging in ~ 200 epochs as opposed to ~ 600.

Our adaptive search strategy offers significant gains both in terms of performance and convergence speed with minimal computational overhead with no changes to the existing hardware, while being compatible with both Boolean and ternary weights. Finally, in the context of RC or ELM, training of the linear output weights is often realized via a “single-shot” approach via matrix inversion as follows:

$${W}^{{{rm{out}}}}={{{bf{X}}}}^{{dagger} }{y}^{{{rm{train}}}},$$

(3)

where † is the Moore-Penrose inverse, X is the so-called state collect matrix that contains the response of all nodes or neurons to each input example, and y^train is the training set target for each corresponding input example. It is therefore interesting in the context of hardware learning to compare the computational training cost of matrix inversion training with iterative methods such as the one presented in this work. While it is true that training via matrix inversion is potentially realized in a single step, the physical nature of a scheme such as the one presented here requires measuring responses of each node individually to obtain all information to construct the state collect matrix. The associated measurement process scales with the number of neurons, n_neurons = 450 in our case. From results presented in Fig. 6 we can see that our SA-inspired algorithm converges in ~200 epochs, moreover convergence scales with network size as previously shown in ref. 42. As a consequence, training with matrix inversion or via our iterative method have comparable computational overheads in principle since both scale linearly with network size. However, offline weight computation using matrix inversion is very likely not going to correspond with high accuracy to the physically detected output signal. For that, various optical effects such as diffraction at the DMD, aberrations by the imaging optics, and detection nonlinearity have to be included. The impact of both will differ for each W^out, rendering matrix inversion for ‘single-shot’ training of physical hardware likely inapplicable.

**Fig. 6: Impact of tuning hyperparameter α on the learning performance for classification of digit 8 in the MNIST dataset.**

The merit of our algorithm is that just like other model-free methods it allows optimization based solely on measurements of the output data y and hence offers the potentially lowest implementation complexity. In addition, relying only on simple random mutations does not involve heavy computations while being specifically designed to handle discrete problems. Furthermore, we found it to be more efficient than traditional SA methods in terms of performance and convergence speed. Unlike methods such as those described in refs. ^36,39, where a digital twin or augmented direct feedback alignment are employed to estimate gradients or error signals, our approach does not require access or knowledge about internal variables of the physical system such as neuron activations, which can be prohibitive as LA-VCSELs are notoriously complex to probe and simulate^68,69. This highlights that in potentially numerous physical ANN implementations, model-free optimization or physical backpropagation may potentially be the only realistically available training options.

Benefits of ternary weights

We can experimentally quantify the net gain following the implementation of ternary weights in our ONN. Table 1 and Supplementary Fig. 2 summarize the classification performance for one-vs-all classification on the MNIST dataset across various systems. We compare Boolean and ternary weights in our ONN, alongside a digital linear classifier and an ELM with 400 neurons trained via ridge regression on the same dataset.

Table 1 Comparison of mean test accuracy across methods

Full size table

As a reminder, we used training and testing sequences u of length N = 1000 images. Crucially, Boolean weights perform poorly, achieving on average ~ 83.5% classification accuracy on the testing dataset. In contrast, with no physical changes to our experimental setup, we can achieve 90.4% on average with ternary weights, approaching the digital linear limit of ~ 91.8%.

To contextualize these results, the performance of an ELM with 400 neurons was also evaluated. When trained with ridge regression using full-precision weights via the pseudo-inverse method (pinv), the ELM achieves the highest accuracy of 93.2%. Interestingly, when constrained to ternary weights, the ELM’s performance remains nearly identical at 93.18%, indicating that increased weight resolution is not needed when using the 400 neuron ELM for this present task. This could be because the ELM compensates for low resolution with its high neuron-count. A similar effect is reported in ref. 52, where software ternary neural networks are made “wider”, i.e., more neurons per layer are used to counteract the limited resolution.

The performance of our ONN with the LA-VCSEL switched off was also measured with ternary weights as a reference and reaches ~ 87.2%. This corresponds to a more linear hardware system comprised only of the MMF as a passive linear mixing element and the absolute square nonlinearity of optical detection. Yet, as shown in our present measurement, this nonlinearity is not sufficient and cannot result in performance close to a digital linear classifier. Interestingly, our ternary weight ONN with the LA-VCSEL switched off performs better than when the LA-VCSEL is on and using Boolean weights, which indicates that at this very low-end of resolution, weight resolution is a severely limiting factor. Conceptually, because they are positive, Boolean weights cannot exploit nodes or neurons whose responses are anti-correlated with the desired target. The respective weights for these nodes are set to 0 when using Boolean weights which effectively restricts the dimensionality of our ONN. In addition, we performed simulations comparing an ELM of 100 neurons trained with our annealing inspired algorithm in 3 different output weight configurations: Boolean 0/1; Binary −1/+1; and Ternary −1/0/+1 as shown in Supplementary Fig. 1. The possibility of having 0 even when negative weights are available clearly improves performance in certain architectures. Boolean weights achieve on average 88.6%, binary positive and negative weights i.e., −1/+1 achieves 90.2%, while ternary weights achieve 91.2%. Showing that there is a significant improvement when simply adding a “0” level.

This comparison highlights how crucial weight resolution and weights with a sign can be. Although the performance improvements when using ternary weights are substantial, our ONN still falls short and cannot outperform a digital linear classifier. It therefore follows that our next step should be to further improve our scheme, in order to overcome the digital linear limit. A rather non-invasive way of improving our current experiment would be to increase the imaging magnification of the LA-VCSEL on DMD_b by changing L₅. Doing so would effectively provide higher resolution weights since every LA-VCSEL speckle would be significantly oversampled, allowing for the creation of more fractional weights using super-pixels. Another way would be to replace DMD_b with a liquid crystal on silicon spatial light modulator, which usually provide at least 8-bit resolution. Furthermore, the development and exploration of more advanced hardware-compatible learning algorithms, whether in the form of model-based or model-free black-box algorithms, could unlock the potential of this enhanced weight resolution, requiring further investigation.

Long-term stability characterization

The ternary-weight ONN described above performs all processing steps online, reducing the need for additional offline resources. In addition, due to its full hardware implementation, our system is prone to drifts. This makes long-term stability and robustness against drift crucial in our setup. These drifts are significantly mitigated through standard proportional-integral-derivative (PID) temperature control of the LA-VCSEL and mechanically securing the injection MMF in place. In our long-term characterization, we find that post learning convergence, the system maintains stable performance for several hours, exhibiting only gradual performance degradation rather than abrupt drops. This stability suggests that continuous online learning can effectively counteract these slow drifts, maintaining consistent system performance over time.

In order to quantify the impact of physical drift on our PNN’s performance, we first conducted an optimization loop for 100 epochs, training the ONN to classify digit 0 of MNIST. After convergence to a low error, we kept the output weights fixed and countinuously monitored the output of the system every 10 s over a period of 10 h. We were therefore able to measure long-term physical drifts and their impact on computing performance in our system, see Fig. 7a. Surprisingly, the error remains effectively constant over 10 h with no noticeable drift in terms of performance. Finally, Fig. 7b shows the raw output of our setup during these 10 h, the dark blue line shows the first response at the beginning of the 10 h, and the light blue shows the next responses over the duration of the experiment. The correlation between responses, i.e., consistency⁷⁰, after learning is ~99.3%, showcasing how even with our simple control schemes we can greatly minimize instabilities in our system.

**Fig. 7: Long-term stability characterization after learning.**

It should be noted that, because the inference speed is orders of magnitude slower than the dynamics of the LA-VCSEL, we do not measure the dynamical stability of our device on the timescale of its inherent dynamics. We rather measure a time averaged or steady state response output of the LA-VCSEL, which together with injection locking further increases stability. In addition, we use a short MMF with a length of 1 m, which is inherently less susceptible to variations in its output speckle pattern compared to longer fibers. Indeed, longer MMFs are more vulnerable to external perturbations such as vibrations, temperature fluctuations, and mechanical stress. Moreover, the use of a DMD, providing only two different possible states for its pixels in the form of two different mechanical positions is also inherently less prone to drifts compared to analog valued weights as would be implemented with a liquid crystal on silicon Spatial Light Modulator. Finally, the entire experimental apparatus is constructed based on a cage-system, preventing component-independent drifts.

Conclusion

We significantly improved upon our previous LA-VCSEL based ONN, by demonstrating an efficient and low complexity method to implement ternary weights that is broadly compatible with Boolean weight based RC. Crucially, we report significant improvements in performance without physical changes to our experimental setup.

In addition, we introduced a improved in-situ optimization algorithm that is compatible with both Boolean and ternary weights. We provided a detailed hyperparameter study of said algorithm for two different tasks, and experimentally verified its benefits in terms of convergence speed and performance gain, resulting in an increase of 7% on average when going from Boolean to ternary weights. The scalability of our algorithm to a potentially trainable input layer remains an open question, and we are actively exploring this direction for future work. We also confirmed that weight resolution is the main limiting factor in our relatively low neuron-count ONN, offering a clear avenue for future improvements. Finally, we experimentally characterized the long-term inference stability of our ONN and found that it was extremely stable with a consistency above 99% over a period of more than 10 h. Our work is of high relevance in the context of in-situ learning particularly under restricted hardware resources.