Annealing-inspired training of an optical neural network with ternary weights

Introduction

Artificial neural networks (ANNs) represent a fundamentally connectionist and distributed approach to computing, differing from classical von Neumann architectures. Over the past decade, ANNs have revolutionized computing1, disrupting fields ranging from natural language2,3 and image processing4 to self-driving vehicles5 and game playing6. The success of these systems is based on their flexibility, high performance in solving abstract tasks, and their fundamentally parallel approach to information processing, allowing them to distill knowledge from large amounts of data. Because they conceptually differ from classical computers, ANNs have to be emulated when running on conventional hardware. This has led, on the one hand, to the meteoric rise of more parallel von Neumann processors such as graphical processing units and application specific tensor processing units7. And, on the other hand, it has spurred significant research interest in developing new hardware to enable more efficient implementations of ANNs8, often leveraging the strengths of unconventional platforms to either realize co-processors or to build autonomous hardware that directly maps ANN topologies onto the physical substrate.

Although optical neural networks (ONNs) were already demonstrated decades ago9,10, photonics has risen again as a particularly promising platform11,12, offering scalability13,14,15, high speed16,17,18, energy efficiency19, and the capability for parallel information processing17,20. Recent advances in photonic hardware include on-chip integrated tensor cores21, and high dimensional optical pre-processors22,23. Among these, semiconductor lasers have emerged as major candidates to implement ONNs due to their ultra-fast modulation rates and complex dynamics17. Vertical-cavity surface-emitting lasers (VCSELs) are notable for their efficiency, speed, intrinsic nonlinearity, and mature complementary metal-oxide-semiconductor (CMOS) compatible fabrication process24,25.

Reservoir computing (RC)26,27 simplifies the use of recurrent neural networks (RNNs) by eliminating the need for intensive backpropagation through time training. In RC, input data is transformed through a high-dimensional network of fixed, interconnected nonlinear nodes-the reservoir, and only the output weights are trained. Thus, RC and its closely related extreme learning machine28 can be thought of as the lowest complexity architecture for ANNs, and as such has been implemented on a wide variety of physical substrates29 ranging from electronics30 and spintronics31, to optics for ultra-fast information processing using a time-multiplexed approach in refs. 17,30, frequency-multiplexed approach in ref. 20, and spatial multiplexing in refs. 32,33, and mechanical substrates in ref. 34. Recently, there are also efforts towards a quantum RC implementation to leverage the exponential scaling of the Hilbert space as a high dimensional resource for information processing35.

With the objective of leveraging unconventional ANN hardware to solve a particular task, there has been a significant advancement in hardware-compatible training algorithms36,37,38,39,40. These advances encompass a variety of approaches, including model-based techniques like backpropagation using a digital twin36, augmented direct feedback alignment39, as well as experimental methods for implementing in-situ backpropagation40,41. Additionally, model-free or black-box strategies have gained traction32,37,42,43. Yet, fully realized analog hardware implementations of these techniques remain scarce. For unconventional neural networks to be truly competitive, implementing in-situ training techniques is essential to overcome numerous bottlenecks and to reduce dependence on external high-performance computers, whose usage generally challenges the overall benefit of unconventional ANN computing substrates in general44.

In refs. 32,33, we demonstrated RC through spatial multiplexing of modes on a large area VCSEL (LA-VCSEL), creating a parallel and autonomous network that minimizes the need for an external computer via hardware learning rules with Boolean weights. Here, we crucially expand on our previous work by introducing a minimalist implementation of ternary weights that requires no physical changes to existing hardware and can be applied broadly to other systems relying on Boolean weights. Binary neural networks leveraging Boolean weights and activation functions have garnered significant attention as they are inherently simpler. Indeed, complex operations such as multiplication can be replaced with simpler bitwise operations such as XNOR and population count, which drastically reduce computational overhead while maintaining reasonable accuracy in certain tasks45,46,47. Despite these advantages, the simplicity of binary weights can limit the representational capacity of the network, which may impact performance on more complex tasks. For additional information, we refer to the reviews on binary neural networks in refs. 48,49. In the context of physically implemented neural networks, binary weights have additional appeal as they do not require high-complexity control. Indeed, using bistable systems such as ferro-electric devices offers a clear avenue towards non-volatile weights, with the associated substantial gains in energy efficiency and system simplicity. Ternary weights provide a richer representation, which substantially improves the learning capability of the network while still maintaining low system complexity and energy costs. This added expressiveness significantly enhances the learning ability of the network, enabling improved accuracy compared to binary approaches, without a proportional increase in computational overhead50,51,52. As a consequence, there has been growing interest in applications of ternary weights in ANNs for increased efficiency, for instance in the context of large language models53, as well as in the context of electronic hardware implementations54,55,56. Although binary and ternary ANNs have been extensively studied, comparisons in the context of analog unconventional physical ANNs remain scarce. We demonstrate a significant performance increase using ternary weights on the Modified National Institute of Standards and technology database (MNIST) dataset, gaining on average 7% classification accuracy with our fully hardware implemented and parallel ONN comprising  ~450 neurons. Additionally, we present an original annealing-like algorithm compatible with both Boolean and ternary weights, which enhances performance and improves convergence speed. Finally, we measure the inference stability of our ONN, demonstrating extremely stable performance over a duration of more than 10 h, and hence address one of the major concerns of full hardware implementation of semiconductor laser neural networks.

Methods

Optical experimental setup

Ourphotonic ONN implementation builds upon our previous implementations based on LA-VCSELs presented in32,33. As such, it follows the RC architecture, and is therefore comprised of three parts: the input layer, the reservoir, and the output layer. A simplified diagram shows the working principle of our ONN, illustrating the analogy between the machine learning concept and optical hardware, while the complete experimental setup schematic is shown in Fig. 1b. Supplementary Table 1 gives a detailed list of all the components used.

Fig. 1: Experimental setup and concept scheme.
figure 1

a Simplified diagram explaining our experimental scheme. A digital micro-mirror device (DMDa) encodes the input information u, which is randomly mixed through the complex transfer matrix of a multimode fiber (MMF). The large-area vertical cavity surface-emitting laser (LA-VCSEL) serves as a reservoir layer, nonlinearly transforming the input information through its inherent physics and yielding the reservoir response u. A second DMD (DMDb) implements programmable Boolean or ternary readout weights Wout, while a detector (DET) captures the output yout. b Schematic of the experimental setup: optical signals are guided using several beam splitters (BS 50/50 and 90/10). Two-lens systems image the injection onto the LA-VCSEL and direct its emission toward DMDb and a camera used for characterization purposes. We measure emission spectra using an optical spectrum analyzer (OSA), while a computer (PC) controls the experiment by handling data acquisition and loading. Li lenses, MO microscope objective, SMF single mode fiber, ECL external cavity laser, CAM camera.

Full size image

The input layer consists of three main components, a continuously tunable external cavity laser (ECL, Toptica CTL 950) that we use to carry information, a digital micro-mirror device (DMD, Vialux XGA 0.7” V4100), and a multimode fibre (MMF, THORLABS M42L01). First, the single-mode fiber-coupled ECL is collimated with an f1 = 6 mm focal distance lens (L1, THORLABS C110TMD-B), yielding a collimated beam diameter of  ~1.4 mm, that is used to illuminate DMDa, which through its pixels modulates the input beam that carries information to the LA-VCSEL. A DMD is a digitally controlled matrix of micro-mirrors, which can flip between two angles, here at angles of  ± 12°, allowing us to display Boolean images that constitute our input information u. We then use a simple two-lens imaging system L2 (THORLABS AC508-150-B-ML, f2 = 150 mm), and L3 (THORLABS C110TMD-B, f3 = 6 mm) adjusting the MMF’s image size to approximately the size of the collimated ECL beam on DMDa. This ensures maximal coupling into the 50 μm diameter MMF, which through its transmission matrix passively implements the random input weights Wrand for RC.

Following this, the nearfield output of the MMF, Wrandu, is imaged onto the LA-VCSEL. Here we use a LA-VCSEL with an oxide aperture of  ~ 5 5μm and a threshold current of Ith = 20 mA. A standard ground-signal-ground (GSG) probe (Microworld MWRF-40A-GSG-200-LP) is used to bias the LA-VCSEL, and the laser is set to  ~28 °C with a thermal stability on the order of 10 mK. The imaging system we use here is also a two lens system. The output of the multimode fibre is collimated using L4 (THORLABS AC127-20-B-ML, f4 = 20 mm) and imaged onto the large area VCSEL using a microscope objective (MO, OLYMPUS LMPLN10XIR, fMO = 18 mm). We achieve a magnification factor of m = fMO/f4 = 0.9, closely matching the MMF’s and the LA-VCSEL’s apertures. Furthermore, the MMF’s numerical aperture (NA) should be chosen such that, considering the effects of magnification m, it it is similar but smaller than that of the LA-VCSEL. In our design, the LA-VCSEL’s physical properties and dynamics form the reservoir’s components, including nonlinear nodes and their interconnections. These nodes, represented by specific areas on the LA-VCSEL’s surface, interact through inherent physical mechanisms within the device. Localized Gaussian-like coupling arises from carrier diffusion in the semiconductor quantum wells, while a more complex global coupling is created by the optical field’s diffraction within the laser cavity. The LA-VCSEL takes the input information Wrandu and transforms it in a complex nonlinear way as is generally the case for optical injection into a semiconductor laser57. This process produces the reservoir state x, and the complexity of the process can be appreciated from the perturbed LA-VCSEL mode profile under injection shown in Fig. 1a.

Finally, the output layer is realized via DMDb and a photodetector (DET, Thorlabs PM100A, S150C). By imaging the LA-VCSEL’s near field onto DMDb, we discretize the VCSEL’s continuous spatial nonlinear response into a discrete matrix, sampling its surface and thus allowing us to adjust the effective neuron-count for one and the same device through super-pixel size control.

Imaging of the LA-VCSEL onto DMDb is realized via a two-lens system, using L5 (THORLABS AC254-100-B-ML, f5 = 100 mm) and MO. The choice of L5 is motivated by the size of DMDb’s pixels and the characteristic size of features in the LA-VCSEL’s mode profile. Ideally, we want to oversample the mode profile with the DMD pixels, which we ensure via magnification m = f5/fMO = 5.55. The size of speckles on the LA-VCSEL’s surface are  ~5.6 μm, while the size of one mirror on DMDb is  ~13.7 μm, and taking into account the magnification every speckle is imaged onto  ~5 mirrors. The LA-VCSEL is magnified on DMDb resulting in a diameter of 305 μm corresponding to an area of 24 × 24 DMD mirrors. L6 (THORLABS AC254-150-B-ML, f6 = 150 mm) and L7 (THORLABS AC254-45-B-ML, f7 = 45 mm) are chosen to demagnify the LA-VCSEL’s image to ensure it fits on the DET.

The total area of the rectangular region of interest on DMDb represents 24 × 24 = 576 mirrors in total. Since the LA-VCSEL is circular, its area corresponds to 576 π/4 452 mirrors. We hence achieve approximately 450 fully parallel neurons. The DMD’s pixels act as a spatial filter between the LA-VCSEL and the DET, realizing output weights Wout by directing the optical signal associated to DMDb mirrors to one of the two angular configurations one of which has DET, while signals associated to DMDb mirrors in the opposite angular configuration are discarded. DMDb therefore implements a Boolean readout matrix. Through DMDb, we then tune the spatial positions of the LA-VCSEL that contribute to the optical power detected at DET, and by optimizing this mirror configuration, we train the output y of the reservoir. Although the DMD is an inherently Boolean, i.e., binary device, we will later in the “Ternary weights implementation” subsection in “Methods” explain how we use it to implement ternary weights.

As shown in Fig. 1b, a digital computer controls our experiment, sending the input data to be written on DMDa, recording the output via an oscilloscope as well as controlling the output weights on DMDb. This computer only acts as a supervisor to our experiment and to implement learning, and as such does not partake in any information processing. It could therefore be replaced by a low-power alternative such as a single board computer like the Raspberry Pi.

The data acquisition loop or forward pass of our ONN is as follows. A batch of N images is loaded on DMDa’s onboard memory to allow for a fast frame rate of 15 kHz. The computer then triggers DMDa, which in turn hardware-triggers the oscilloscope to start the acquisition of the signal detected at DET. Each frame on DMDa is displayed for 66 μs, which is orders of magnitude slower than the intrinsic time scales of the LA-VCSEL (~ 1 ns), meaning that we operate the LA-VCSEL and hence our ONN in its steady state. Consequently, we do not utilize the memory capacity of the LA-VCSEL, making our ONN functionally comparable to an extreme learning machine (ELM). However, certain characteristics, such as the internal coupling between nodes that is also dependent on the injected information, distinguish further between our experimental ONN and conventional ELM and RC schemes.

The waveform acquired via the oscilloscope is digitally downsampled to N points, yielding the output of the ONN yout, allowing us to compute an error between this output and a set target ytarget, according to the normalized mean square error (NMSE) computed at each epoch k:

$${{{rm{NMSE}}}}_{k}=frac{1}{Ntimes {{rm{std}}}({{{bf{y}}}}_{k}^{{{rm{out}}}})}mathop{sum }limits_{i=1}^{N}{({{{bf{y}}}}_{k}^{{{rm{out}}}}(i)-{{{bf{y}}}}^{{{rm{target}}}}(i))}^{2}.$$
(1)

Following the detailed characterization provided in32 and to ensure optimal performance, we operate the LA-VCSEL at Ibias = 1.5Ithreshold ~ 30 mA, see Fig. 2a. Moreover, the injection wavelength λinj ~ 919 nm was optimized to yield optimal injection locking conditions as shown in Fig. 2b, resulting in maximal suppression of the LA-VCSEL’s numerous modes when left free, and the injected power is set to match the emission power of the LA-VCSEL yielding an injection power ratio PR = Pinj/PVCSEL ~ 1. The free running and injection locked mode profiles of the LA-VCSEL are shown in Fig. 2c.

Fig. 2: Experimental characterization of the LA-VCSEL.
figure 2

a L-I curve (static optical output power versus forward bias current) of the LA-VCSEL. b Spectra of the free running (blue) and injection locked (red) LA-VCSEL. c Mode profiles of the free running (blue) and injection locked (red) LA-VCSEL.

Full size image

Ternary weights implementation

A simple, yet powerful change to our previous experimental setup consists in the implementation of ternary weights, i.e Wout {−1, 0, + 1} with virtually no change to the experiment. Let Wout be our output matrix with ternary entries. We can then define two strictly positive matrices, Wout+ and Wout− as follows:

$${{{bf{W}}}}_{ij}^{{{rm{out}}}+}=left{begin{array}{ll}1quad &{{rm{if}}},{{{bf{W}}}}_{ij}^{{{rm{out}}}}=1,\ 0quad &{{rm{otherwise}}}.end{array}right.$$
$${{{bf{W}}}}_{ij}^{{{rm{out}}}-}=left{begin{array}{ll}1quad &,{mbox{if}},,{{{bf{W}}}}_{ij}^{{{rm{out}}}}=-1,\ 0quad &{{rm{otherwise}}}.end{array}right.$$

By sequentially displaying Wout+ and Wout− and measuring their respective outputs signals, yout+ and yout, the output of our ONN, yout, is computed by electronic subtraction yout = yout+yout. Note that, both, Wout+ and Wout−, are positive Boolean matrices that by definition can be displayed on the DMD. The negative weights result from the subtraction of their respective output signals. Figure 3 gives a diagram view of these operations.

Fig. 3: Ternary weights implementation.
figure 3

a Diagram showing how we use a single digital micromirror device (DMD) to achieve ternary weights by time multiplexing the Boolean output weights, displaying positive Wout+ and negative parts Wout in succession. The corresponding positive and negative outputs, are then measured using the detector (DET) and electronically subtracted yout = yout+yout−. b Schematic of a potential experimental setup that would allow for simultaneous display of positive and negative weights on two different areas of DMDb allowing the system to operate at full bandwidth. First, a copy of the VCSEL is made by placing a second 50:50 beamsplitter (BS) in collimated space after the microscope objective (MO) creating two equal beams. These beams are then imaged onto two distinc areads of the output DMD via lenses (gray ellipses) and a mirror that directs the second beam to ensure the two do not overlap. Finally these two VCSEL copies imaged on the DMD are then imaged on two separate detectors measuring positive and negative signals which are then electronically subtracted.

Full size image

However, in its present form, our implementation of ternary output weights results in halving the inference bandwidth due to its sequential nature, requiring two measurements at each step. To remedy this, we could create an optical copy of the LA-VCSEL state on top of DMDb. Each state would be sent to a separate area of DMDb, where the corresponding weight matrices Wout+ and Wout− would be displayed simultaneously and two outputs would be detected by separate detectors, and their outputs would be subtracted electronically in real-time. This would allow us to maintain the same inference bandwidth as with Boolean weights, while still benefiting from the performance increase provided by ternary weights. A similar concept was already leveraged in the first experimental ONN demonstration58. More recent implementations of positive and negative weights in the context of ONNs include21,59 leveraging balanced photodetection. Other, more complex techniques involving DMDs proposed in refs. 60,61 allow for multi-level modulation and can leverage both amplitude and phase modulation potentially yielding analog, positive and negative weights.

In-situ learning

Our in-situ optimization algorithm builds upon, and significantly improves the evolutionary algorithm presented in refs. 42,62 designed for Boolean learning. Originally, a single randomly chosen Boolean weight is inverted at each epoch, and if this change results in a decreased error it is kept, otherwise we revert the Boolean weight to its former value then select a different Boolean weight. Here, instead of simply inverting a single Boolean weight, i.e., flipping a mirror, we propose flipping several mirrors per epoch. Furthermore, we make the number of mirrors we flip nmirrors adaptive and proportional to the error according to

$${n}^{{{rm{mirrors}}}}=lfloor alpha cdot {{{rm{NMSE}}}}_{k}rfloor +1,$$
(2)

where α is a hyperparameter that can be likened to a learning, that does not control step size but rather the number of permutations from one epoch to the next. Setting α = 0 corresponds to the original algorithm42,62, the floor function + 1 ensures that at least one mirror is flipped at each epoch. The pseudo-code for our algorithm and optimization loop is given in algorithm 1.

When using ternary weights, at each epoch instead of simply flipping the state of previously Boolean weights, they are individually assigned a value of either  −1, 0, or + 1 with equal probability, making our simple algorithm inherently compatible with Boolean and ternary weights.

This strategy combines elements of random search and a well known gradient free optimization technique known as simulated annealing (SA)63,64,65. Indeed, SA iteratively optimises a function by applying random fluctuations to the given parameters while only accepting changes that yield positive outcomes in a process similar to a random walk. In order to prevent the algorithm from getting stuck in local extrema, worse parameter configurations can also be accepted with a probability that decreases with a so-called temperature parameter which decreases with time, causing the algorithm to settle at a given position in the landscape. Therefore, our strategy can be viewed as a version of simulated annealing where instead of arbitrarily relying on time to force convergence we directly leverage the error to guide our search in the landscape. As such, high values of α tend to correspond to high temperature states in SA which help escape local extrema.

Algorithm 1

Ternary Weights adaptive optimization

1: Initialize Wout randomly

2: Wout_best ← Wout

3: MSE_best ← Forward_pass(Wout)

4: while not converged do

5:   nmirrorsα × MSE_best + 1

6:   Wout_temp ← Wout

7:   fori = 1 to nmirrorsdo

8:     Randomly select a mirror mi

9:     Randomly set Wout_temp[mi] { − 1, 0, + 1}

10:     end for

11:     MSE_temp ← Forward_pass(Wout_temp)

12:     if MSE_temp < MSE_best then

13:   Wout ← Wout_temp

14:   Wout_best ← Wout_temp

15:   MSE_best ← MSE_temp

16:     else

17:   Revert to Wout

18:     end if

19: end while

20: returnWout_best

Results

Benchmark tasks

In order to benchmark our ONN and the corresponding training algorithm we use two tasks. The first is binary header recognition, the second the is hand-written MNIST dataset. Due to the scalar output given by the single photodetector DET, we use “one-vs-all” classification.

In a “one-vs-all” setting, for nclasses we train nclasses independent binary classifiers whose purpose is to distinguish one specific class out of all the others. This makes it generally ill-suited to problems with many classes66. In contrast, each individual binary classification problem is comparatively easier to solve than a single classification problem with nclasses, yet “one-vs-all” still provides sufficient ground for the comparison of different algorithms and their convergence behavior. For each comparison, all alternative approaches were also trained according to a one-vs-all scheme. Our choice of using this method of classification is primarily due to us wanting to stay as close to full hardware implementation as possible. For our tests, input sequences u were comprised of N = 1000 images comprised of a 50–50 split between positive and negative examples for one-vs-all classification.

As a reference, for the first task the input data consists of binary pie-shaped headers. The Gaussian input beam is divided into nbits equal sections that encode the header. Although seemingly simple, these orthogonal patterns are quite convenient and allow us to reliably scale the dimensionality and complexity of our dataset and computational task. A more detailed description of the encoding used for the binary header can be found in refs. 32,33.

Comparison with conventional simulated annealing

We first start by comparing our adaptive modified annealing algorithm to existing variants of simulated annealing which we adapt to conform to hardware limitations. In order to simulate conditions similar to our experimental ONN, we train an extreme learning machine (ELM) of 100 neurons on the “one-vs-all” MNIST classification with ternary weights using different variants of SA. When applied in the context of deep neural network optimization as shown in ref. 67, SA is generally less efficient, reaching lower performance levels and taking longer to converge compared to backpropagation. However, it should be stressed that in this work, optimization constraints are different given that we have a discrete optimization problem, and search space dimensionality is significantly reduced to  ~ 450 ternary or Boolean weights.

SA is originally designed for continuous optimization as opposed to discrete problems as is the case for our ternary weights ONN. As such, we compare our algorithm with basic or “Vanilla” SA and a “Thermo-Adaptive” SA whose pseudo-code is given in supplementary algorithm 1 and 2 respectively. In contrast with continuous SA optimization where mutations reflect Euclidean differences, our discrete configuration implies that changes between states resemble a Hamming distance, essentially corresponding to the number of discrete transformations (or “flips”) required to move from one configuration to another.

In its “Vanilla” version, a random weight matrix is generated from scratch at each epoch. This entails that nmirrors is fixed and always equal to the number of parameters in the weight matrix. Parameter configurations that yield reduced errors are always kept, while those that yield worse errors can be kept with a probability dependent on temperature which decreases with a fixed schedule in time according to a cool-down rate β < 1, i.e., Tk+1 = Tk*β, where k is an index keeping track of optimization steps or epochs. As a consequence, the “Vanilla” SA algorithm has two hyperparameters β and the initial temperature T0. The “Thermo-Adaptive” version builds upon the simpler SA algorithm by adapting nmirrors over time with a fixed schedule nmirrors = α(1 − k/nepochs) + 1 adding an additional hyperparameter α.

Figure 4a shows the convergence behavior in terms of NMSE averaged across all classes in our “one-vs-all” MNIST classification problem for an ELM of 100 neurons using ternary weights with three different versions of SA, namely “Vanilla” (green), “Thermo-Adaptive” (pink) and our SA algorithm (blue). As a reference for convergence speed, we also show the baseline of a backpropagation-trained ELM trained with full weight resolution. While “Vanilla” SA can initially improve the error, it is unable to reach low error levels converging to an average NMSE of 0.76 due the changes in the weight configuration being too extreme since all weights are changed at each epoch. The other two versions of SA are able to overcome this effect through an adaptation of nmirrors, as a result both algorithms reach similar levels of performance at NMSE = 0.35. Crucially, our algorithm reaches the same performance as the “Thermo-Adaptive” SA with close to an order of magnitude improvement in convergence speed. Crucially, our algorithm only requires tuning a single hyperparameter, which makes it much easier to set up.

Fig. 4: Comparison between different simulated annealing (SA) algorithms.
figure 4

a Convergence of different SA algorithms on a ternary output layer of an ELM with 100 neurons. The “Vanilla” (green), “Thermo-Adaptive” (pink), and our SA variant (blue) are shown and compared with a backpropagation baseline training of the ELM using full resolution weights (grey). b Mean “one-vs-all” accuracy averaged across all classes of the MNIST dataset for the different algorithms shown in (a).

Full size image

Hardware Learning

We can now proceed to apply our adaptive SA algorithm to optimize the experimental ONN, beginning with an investigation of the effect of α on convergence behavior and performance. Figure 5 shows the impact of α on the performance and convergence for a 4-bit header recognition task. The error decreases faster for an optimal value of α = 10, yet higher values of α lead to a more unstable learning trajectory. Indeed, flipping more mirrors is akin to bigger steps taken in the search space at each epoch, as such, α behaves much like a learning rate in the context of gradient descent. Moreover, the errors reached by optimal values of α are lower. With α = 10 we have NMSE = 0.25 compared to 0.5 for the original algorithm with α = 0. This shows that the adaptive strategy is a simple yet powerful modification to the original algorithm42. We should note that α is task dependent. Leveraging this improved strategy we are able to increase performance and reach a symbol error rate of 1.5% for a 6-bit header recognition task.

Fig. 5: Impact of hyperparameter α on the learning performance for a 4-bit header recognition task.
figure 5

a Heatmap of NMSE across epochs and learning rates, illustrating how the choice of α influences convergence dynamics. b Final NMSE, averaged over the last 100 epochs, as a function of α, highlighting an optimal value around α = 10. c NMSE convergence comparison for α = 0 and α = 10, demonstrating that optimal α values lead to improved performance and faster convergence.

Full size image

Rather than using the synthetic benchmark on binary header recognition, the MNIST task provides a more challenging and genuine image classification problem. Here we study the influence of α on convergence using one-vs-all classification of digits 8 and 9, the hardest in the dataset. Generally speaking, the same trend appears with optimal α values in the range 5 ≤ α ≤ 20. Choosing optimal α values results in faster convergence and better performance, allowing the ONN to reach  ~ 85% accuracy for α = 20 instead of  ~ 81% for α = 2 while converging in  ~ 200 epochs as opposed to  ~ 600.

Our adaptive search strategy offers significant gains both in terms of performance and convergence speed with minimal computational overhead with no changes to the existing hardware, while being compatible with both Boolean and ternary weights. Finally, in the context of RC or ELM, training of the linear output weights is often realized via a “single-shot” approach via matrix inversion as follows:

$${W}^{{{rm{out}}}}={{{bf{X}}}}^{{dagger} }{y}^{{{rm{train}}}},$$
(3)

where † is the Moore-Penrose inverse, X is the so-called state collect matrix that contains the response of all nodes or neurons to each input example, and ytrain is the training set target for each corresponding input example. It is therefore interesting in the context of hardware learning to compare the computational training cost of matrix inversion training with iterative methods such as the one presented in this work. While it is true that training via matrix inversion is potentially realized in a single step, the physical nature of a scheme such as the one presented here requires measuring responses of each node individually to obtain all information to construct the state collect matrix. The associated measurement process scales with the number of neurons, nneurons = 450 in our case. From results presented in Fig. 6 we can see that our SA-inspired algorithm converges in  ~200 epochs, moreover convergence scales with network size as previously shown in ref. 42. As a consequence, training with matrix inversion or via our iterative method have comparable computational overheads in principle since both scale linearly with network size. However, offline weight computation using matrix inversion is very likely not going to correspond with high accuracy to the physically detected output signal. For that, various optical effects such as diffraction at the DMD, aberrations by the imaging optics, and detection nonlinearity have to be included. The impact of both will differ for each Wout, rendering matrix inversion for ‘single-shot’ training of physical hardware likely inapplicable.

Fig. 6: Impact of tuning hyperparameter α on the learning performance for classification of digit 8 in the MNIST dataset.
figure 6

a Heatmap of NMSE across epochs and learning rates, illustrating the effect of α on convergence and performance. b Final classification accuracy [%], averaged over the last 100 epochs, as a function of α, showing a peak performance near α = 10. c Comparison of NMSE convergence for α = 2 and α = 20, highlighting that appropriate tuning of α results in better accuracy and faster error reduction.

Full size image

The merit of our algorithm is that just like other model-free methods it allows optimization based solely on measurements of the output data y and hence offers the potentially lowest implementation complexity. In addition, relying only on simple random mutations does not involve heavy computations while being specifically designed to handle discrete problems. Furthermore, we found it to be more efficient than traditional SA methods in terms of performance and convergence speed. Unlike methods such as those described in refs. 36,39, where a digital twin or augmented direct feedback alignment are employed to estimate gradients or error signals, our approach does not require access or knowledge about internal variables of the physical system such as neuron activations, which can be prohibitive as LA-VCSELs are notoriously complex to probe and simulate68,69. This highlights that in potentially numerous physical ANN implementations, model-free optimization or physical backpropagation may potentially be the only realistically available training options.

Benefits of ternary weights

We can experimentally quantify the net gain following the implementation of ternary weights in our ONN. Table 1 and Supplementary Fig. 2 summarize the classification performance for one-vs-all classification on the MNIST dataset across various systems. We compare Boolean and ternary weights in our ONN, alongside a digital linear classifier and an ELM with 400 neurons trained via ridge regression on the same dataset.

Table 1 Comparison of mean test accuracy across methods
Full size table

As a reminder, we used training and testing sequences u of length N = 1000 images. Crucially, Boolean weights perform poorly, achieving on average  ~ 83.5% classification accuracy on the testing dataset. In contrast, with no physical changes to our experimental setup, we can achieve 90.4% on average with ternary weights, approaching the digital linear limit of  ~ 91.8%.

To contextualize these results, the performance of an ELM with 400 neurons was also evaluated. When trained with ridge regression using full-precision weights via the pseudo-inverse method (pinv), the ELM achieves the highest accuracy of 93.2%. Interestingly, when constrained to ternary weights, the ELM’s performance remains nearly identical at 93.18%, indicating that increased weight resolution is not needed when using the 400 neuron ELM for this present task. This could be because the ELM compensates for low resolution with its high neuron-count. A similar effect is reported in ref. 52, where software ternary neural networks are made “wider”, i.e., more neurons per layer are used to counteract the limited resolution.

The performance of our ONN with the LA-VCSEL switched off was also measured with ternary weights as a reference and reaches  ~ 87.2%. This corresponds to a more linear hardware system comprised only of the MMF as a passive linear mixing element and the absolute square nonlinearity of optical detection. Yet, as shown in our present measurement, this nonlinearity is not sufficient and cannot result in performance close to a digital linear classifier. Interestingly, our ternary weight ONN with the LA-VCSEL switched off performs better than when the LA-VCSEL is on and using Boolean weights, which indicates that at this very low-end of resolution, weight resolution is a severely limiting factor. Conceptually, because they are positive, Boolean weights cannot exploit nodes or neurons whose responses are anti-correlated with the desired target. The respective weights for these nodes are set to 0 when using Boolean weights which effectively restricts the dimensionality of our ONN. In addition, we performed simulations comparing an ELM of 100 neurons trained with our annealing inspired algorithm in 3 different output weight configurations: Boolean 0/1; Binary  −1/+1; and Ternary  −1/0/+1 as shown in Supplementary Fig. 1. The possibility of having 0 even when negative weights are available clearly improves performance in certain architectures. Boolean weights achieve on average 88.6%, binary positive and negative weights i.e.,  −1/+1 achieves 90.2%, while ternary weights achieve 91.2%. Showing that there is a significant improvement when simply adding a “0” level.

This comparison highlights how crucial weight resolution and weights with a sign can be. Although the performance improvements when using ternary weights are substantial, our ONN still falls short and cannot outperform a digital linear classifier. It therefore follows that our next step should be to further improve our scheme, in order to overcome the digital linear limit. A rather non-invasive way of improving our current experiment would be to increase the imaging magnification of the LA-VCSEL on DMDb by changing L5. Doing so would effectively provide higher resolution weights since every LA-VCSEL speckle would be significantly oversampled, allowing for the creation of more fractional weights using super-pixels. Another way would be to replace DMDb with a liquid crystal on silicon spatial light modulator, which usually provide at least 8-bit resolution. Furthermore, the development and exploration of more advanced hardware-compatible learning algorithms, whether in the form of model-based or model-free black-box algorithms, could unlock the potential of this enhanced weight resolution, requiring further investigation.

Long-term stability characterization

The ternary-weight ONN described above performs all processing steps online, reducing the need for additional offline resources. In addition, due to its full hardware implementation, our system is prone to drifts. This makes long-term stability and robustness against drift crucial in our setup. These drifts are significantly mitigated through standard proportional-integral-derivative (PID) temperature control of the LA-VCSEL and mechanically securing the injection MMF in place. In our long-term characterization, we find that post learning convergence, the system maintains stable performance for several hours, exhibiting only gradual performance degradation rather than abrupt drops. This stability suggests that continuous online learning can effectively counteract these slow drifts, maintaining consistent system performance over time.

In order to quantify the impact of physical drift on our PNN’s performance, we first conducted an optimization loop for 100 epochs, training the ONN to classify digit 0 of MNIST. After convergence to a low error, we kept the output weights fixed and countinuously monitored the output of the system every 10 s over a period of 10 h. We were therefore able to measure long-term physical drifts and their impact on computing performance in our system, see Fig. 7a. Surprisingly, the error remains effectively constant over 10 h with no noticeable drift in terms of performance. Finally, Fig. 7b shows the raw output of our setup during these 10 h, the dark blue line shows the first response at the beginning of the 10 h, and the light blue shows the next responses over the duration of the experiment. The correlation between responses, i.e., consistency70, after learning is  ~99.3%, showcasing how even with our simple control schemes we can greatly minimize instabilities in our system.

Fig. 7: Long-term stability characterization after learning.
figure 7

a NMSE learning curve, after convergence, output weights are fixed and the output is mesured every 10 s over a period of 10 h. b Output time traces over 10 h, the first output is shown in bold while following outputs are in a lighter shade. The correlation between responses is  ~ 99.3%.

Full size image

It should be noted that, because the inference speed is orders of magnitude slower than the dynamics of the LA-VCSEL, we do not measure the dynamical stability of our device on the timescale of its inherent dynamics. We rather measure a time averaged or steady state response output of the LA-VCSEL, which together with injection locking further increases stability. In addition, we use a short MMF with a length of 1 m, which is inherently less susceptible to variations in its output speckle pattern compared to longer fibers. Indeed, longer MMFs are more vulnerable to external perturbations such as vibrations, temperature fluctuations, and mechanical stress. Moreover, the use of a DMD, providing only two different possible states for its pixels in the form of two different mechanical positions is also inherently less prone to drifts compared to analog valued weights as would be implemented with a liquid crystal on silicon Spatial Light Modulator. Finally, the entire experimental apparatus is constructed based on a cage-system, preventing component-independent drifts.

Conclusion

We significantly improved upon our previous LA-VCSEL based ONN, by demonstrating an efficient and low complexity method to implement ternary weights that is broadly compatible with Boolean weight based RC. Crucially, we report significant improvements in performance without physical changes to our experimental setup.

In addition, we introduced a improved in-situ optimization algorithm that is compatible with both Boolean and ternary weights. We provided a detailed hyperparameter study of said algorithm for two different tasks, and experimentally verified its benefits in terms of convergence speed and performance gain, resulting in an increase of 7% on average when going from Boolean to ternary weights. The scalability of our algorithm to a potentially trainable input layer remains an open question, and we are actively exploring this direction for future work. We also confirmed that weight resolution is the main limiting factor in our relatively low neuron-count ONN, offering a clear avenue for future improvements. Finally, we experimentally characterized the long-term inference stability of our ONN and found that it was extremely stable with a consistency above 99% over a period of more than 10 h. Our work is of high relevance in the context of in-situ learning particularly under restricted hardware resources.

Related Articles

Optical sorting: past, present and future

Optical sorting combines optical tweezers with diverse techniques, including optical spectrum, artificial intelligence (AI) and immunoassay, to endow unprecedented capabilities in particle sorting. In comparison to other methods such as microfluidics, acoustics and electrophoresis, optical sorting offers appreciable advantages in nanoscale precision, high resolution, non-invasiveness, and is becoming increasingly indispensable in fields of biophysics, chemistry, and materials science. This review aims to offer a comprehensive overview of the history, development, and perspectives of various optical sorting techniques, categorised as passive and active sorting methods. To begin, we elucidate the fundamental physics and attributes of both conventional and exotic optical forces. We then explore sorting capabilities of active optical sorting, which fuses optical tweezers with a diversity of techniques, including Raman spectroscopy and machine learning. Afterwards, we reveal the essential roles played by deterministic light fields, configured with lens systems or metasurfaces, in the passive sorting of particles based on their varying sizes and shapes, sorting resolutions and speeds. We conclude with our vision of the most promising and futuristic directions, including AI-facilitated ultrafast and bio-morphology-selective sorting. It can be envisioned that optical sorting will inevitably become a revolutionary tool in scientific research and practical biomedical applications.

Iron homeostasis and ferroptosis in muscle diseases and disorders: mechanisms and therapeutic prospects

The muscular system plays a critical role in the human body by governing skeletal movement, cardiovascular function, and the activities of digestive organs. Additionally, muscle tissues serve an endocrine function by secreting myogenic cytokines, thereby regulating metabolism throughout the entire body. Maintaining muscle function requires iron homeostasis. Recent studies suggest that disruptions in iron metabolism and ferroptosis, a form of iron-dependent cell death, are essential contributors to the progression of a wide range of muscle diseases and disorders, including sarcopenia, cardiomyopathy, and amyotrophic lateral sclerosis. Thus, a comprehensive overview of the mechanisms regulating iron metabolism and ferroptosis in these conditions is crucial for identifying potential therapeutic targets and developing new strategies for disease treatment and/or prevention. This review aims to summarize recent advances in understanding the molecular mechanisms underlying ferroptosis in the context of muscle injury, as well as associated muscle diseases and disorders. Moreover, we discuss potential targets within the ferroptosis pathway and possible strategies for managing muscle disorders. Finally, we shed new light on current limitations and future prospects for therapeutic interventions targeting ferroptosis.

Type 2 immunity in allergic diseases

Significant advancements have been made in understanding the cellular and molecular mechanisms of type 2 immunity in allergic diseases such as asthma, allergic rhinitis, chronic rhinosinusitis, eosinophilic esophagitis (EoE), food and drug allergies, and atopic dermatitis (AD). Type 2 immunity has evolved to protect against parasitic diseases and toxins, plays a role in the expulsion of parasites and larvae from inner tissues to the lumen and outside the body, maintains microbe-rich skin and mucosal epithelial barriers and counterbalances the type 1 immune response and its destructive effects. During the development of a type 2 immune response, an innate immune response initiates starting from epithelial cells and innate lymphoid cells (ILCs), including dendritic cells and macrophages, and translates to adaptive T and B-cell immunity, particularly IgE antibody production. Eosinophils, mast cells and basophils have effects on effector functions. Cytokines from ILC2s and CD4+ helper type 2 (Th2) cells, CD8 + T cells, and NK-T cells, along with myeloid cells, including IL-4, IL-5, IL-9, and IL-13, initiate and sustain allergic inflammation via T cell cells, eosinophils, and ILC2s; promote IgE class switching; and open the epithelial barrier. Epithelial cell activation, alarmin release and barrier dysfunction are key in the development of not only allergic diseases but also many other systemic diseases. Recent biologics targeting the pathways and effector functions of IL4/IL13, IL-5, and IgE have shown promising results for almost all ages, although some patients with severe allergic diseases do not respond to these therapies, highlighting the unmet need for a more detailed and personalized approach.

Targeting of TAMs: can we be more clever than cancer cells?

With increasing incidence and geography, cancer is one of the leading causes of death, reduced quality of life and disability worldwide. Principal progress in the development of new anticancer therapies, in improving the efficiency of immunotherapeutic tools, and in the personification of conventional therapies needs to consider cancer-specific and patient-specific programming of innate immunity. Intratumoral TAMs and their precursors, resident macrophages and monocytes, are principal regulators of tumor progression and therapy resistance. Our review summarizes the accumulated evidence for the subpopulations of TAMs and their increasing number of biomarkers, indicating their predictive value for the clinical parameters of carcinogenesis and therapy resistance, with a focus on solid cancers of non-infectious etiology. We present the state-of-the-art knowledge about the tumor-supporting functions of TAMs at all stages of tumor progression and highlight biomarkers, recently identified by single-cell and spatial analytical methods, that discriminate between tumor-promoting and tumor-inhibiting TAMs, where both subtypes express a combination of prototype M1 and M2 genes. Our review focuses on novel mechanisms involved in the crosstalk among epigenetic, signaling, transcriptional and metabolic pathways in TAMs. Particular attention has been given to the recently identified link between cancer cell metabolism and the epigenetic programming of TAMs by histone lactylation, which can be responsible for the unlimited protumoral programming of TAMs. Finally, we explain how TAMs interfere with currently used anticancer therapeutics and summarize the most advanced data from clinical trials, which we divide into four categories: inhibition of TAM survival and differentiation, inhibition of monocyte/TAM recruitment into tumors, functional reprogramming of TAMs, and genetic enhancement of macrophages.

Integrated proteogenomic characterization of ampullary adenocarcinoma

Ampullary adenocarcinoma (AMPAC) is a rare and heterogeneous malignancy. Here we performed a comprehensive proteogenomic analysis of 198 samples from Chinese AMPAC patients and duodenum patients. Genomic data illustrate that 4q loss causes fatty acid accumulation and cell proliferation. Proteomic analysis has revealed three distinct clusters (C-FAM, C-AD, C-CC), among which the most aggressive cluster, C-AD, is associated with the poorest prognosis and is characterized by focal adhesion. Immune clustering identifies three immune clusters and reveals that immune cluster M1 (macrophage infiltration cluster) and M3 (DC cell infiltration cluster), which exhibit a higher immune score compared to cluster M2 (CD4+ T-cell infiltration cluster), are associated with a poor prognosis due to the potential secretion of IL-6 by tumor cells and its consequential influence. This study provides a comprehensive proteogenomic analysis for seeking for better understanding and potential treatment of AMPAC.

Responses

Your email address will not be published. Required fields are marked *