A neural network-based synthetic diagnostic of laser-accelerated proton energy spectra

Introduction

Particle accelerators enable researchers to explore the fundamental structure of matter and develop new technologies for medicine, industry and defense. Laser-plasma-based accelerators generate ultrashort pulses of high-energy particles over short distances, driven by ultra-high field gradients^1,2. This enables the development of compact particle accelerators capable of delivering ultra-high dose rates. Proton and ion beams produced by laser-driven accelerators have potential applications in medical imaging and FLASH radiotherapy^3,4,5. Realising these applications requires new approaches to optimization and control of the acceleration process and beam properties.

The application of machine learning (ML) in laser-plasma accelerator development is an emerging and dynamic area, providing innovative solutions to critical challenges in optimization and control. Early demonstrations include Bayesian optimization approaches in electron and ion acceleration experiments^6,7,8 and modeling^9,10, in addition to deep neural networks (DNNs) for rapid deconvolution of laser-plasma diagnostics^{11,12,13,14,15}, diagnostics of intense laser focal parameters¹⁶, modeling laser wakefield acceleration of electrons^17,18, laser-solid interactions¹⁹ and ion acceleration^20,21,22,23. While DNNs have previously been used to analyze experimental laser-driven ion diagnostics, their application to predicting ion beam properties has thus far been restricted to models trained solely on simulation data. Despite advancements in laser-plasma ion source development, challenges such as optimization, stabilization and real-time diagnosis remain, where ML can play a transformative role. The growing availability of experimental data from high-repetition rate systems is accelerating greater integration of DNNs in this field.

In the realm of laser-plasma-based ion acceleration, Target-Normal Sheath Acceleration (TNSA) is the most reproducible mechanism, producing short ion pulses with energies reaching tens-of-MeV/nucleon^3,4. This method involves irradiating a thin foil target with relativistic laser intensities ( > 10¹⁸ Wcm⁻² for ~ 1 μm light) to produce fast electrons, generating a strong electric field (TVm⁻¹) for ion acceleration. Stabilizing these ion sources and consistently reproducing desired ion beam properties is inherently challenging due to their sensitivity to shot-to-shot variations in laser pulse parameters. Correlating output ion beam properties with input parameters is further complicated by the interplay of numerous interrelated factors. Direct experimental measurement of beam parameters is typically possible only when the ion beam is not being used for irradiation applications or when only a small portion of the beam is needed for that purpose, as direct characterization can disrupt the beam properties. Key applications in radiobiology^24,25 and radiation damage studies often require capturing and focusing the entire ion beam to deliver the required dose. This requirement complicates simultaneous measurement and application, highlighting the need for reliable modeling approaches. While high-fidelity particle-in-cell (PIC) simulations can model these interactions, they are time-consuming, necessitating simplified empirical models.

Recent advancements in high-power laser repetition rates and rapidly replaceable targets—such as tape-driven foils, droplets, clusters and cryogenic or liquid sheets—enable thousands of laser shots in short experimental campaigns²⁶. This capability facilitates extensive parameter scanning, generating large datasets for developing empirical models and training DNNs, which are particularly adept at modeling complex, nonlinear problems.

Here, we introduce a DNN-based synthetic diagnostic of the TNSA-ion energy spectrum, trained and validated with experimental data. A dense feed-forward network, combined with a β-variational autoencoder²⁷ (β-VAE) predicts the ion beam spectrum and its uncertainty using the laser input parameters and secondary laser-plasma interaction diagnostics. The β-VAE excels in dimensionality reduction to a smooth, interpolatable latent space, enhancing prediction accuracy even with limited, high-dimensional training data. This synthetic diagnostic approach offers real-time insights into ion beam properties without disrupting the beam, facilitating improved operational control.

Results

Model input data

The experimental data was obtained from 662 laser-target interactions using the Gemini high power laser²⁸, with 80% of the dataset used for hyperparameter tuning and model training, and 20% reserved as a holdout test set for performance estimation. The experiment, depicted schematically in Fig. 1, is described in detail in Methods.

**Fig. 1: Experimental schematic and model process.**

The objective of the DNN is to predict the energy spectrum of the beam of protons accelerated during the laser-target interaction. Therefore, the energy spectrum was measured for each interaction. The DNN inputs include the on-shot laser energy (E_L), pulse duration (τ_L), focal spot radius (r_L), the percentage of laser energy within the focal area (E_L), and the target distance from the focal point (Δz). In addition, the temporal offset of a ‘preheater’ laser pulse (Δt) is included. This preheater laser pulse generates a preplasma at the target front surface, which can significantly influence the evolution of the subsequent main pulse interaction^29,30.

As an in-situ diagnostic of the interaction, we measured the on-shot backreflected light from the laser-plasma interaction using a spectrometer, along with cameras filtered for both the laser frequency (ω) and the second harmonic (2ω). This reflected light inversely correlates with the energy absorbed during the interaction and is expected to encode information about the proton energy spectrum³¹.

The processed backreflection images and spectra for each shot consist of over 300 × 300 and 1700 pixels, respectively. The high dimensionality of this input data presents a significant challenge for a DNN, known as the curse of dimensionality³², especially given the limited training data available. To address this, we reduce the dimensionality by calculating the mean (M), standard deviation (S) and skewness (α) of the pixel intensity distributions in the images and spectra, which removes explicit spatial information. Both higher-order moments and image moments—where the images (or spectra) are treated as distributions to preserve spatial or spectral information³³—were tested during hyperparameter tuning. These methods led to poorer model performance, suggesting that these details do not contribute effectively to the model’s predictive capability.

The nine backreflection moments—comprising the mean, variance, and skewness of the backreflection spectra, laser-frequency-filtered images, and second-harmonic-filtered images—along with the six laser-target parameters, serve as the lower-dimensional inputs to the network. Since the proton spectra are high-dimensional arrays of size 2057, both dimensionality reduction and an inverse reconstruction process are required; these are discussed in the next section.

Model pipeline

Following the methodology in references^18,34,35,36, we employed a β-VAE for the dimensionality reduction of proton spectra. Autoencoders are composed of an encoder and a decoder. The encoder maps the input data to a lower-dimensional ‘latent space’ representation (LSR), while the decoder reconstructs the input data from this encoded representation. The β-VAE enhances this framework by having the encoder generate parameters of a Gaussian probability distribution instead of deterministic values, as illustrated in Fig. 1. By sampling from this distribution and incorporating a loss term that constrains it to approximate a standard normal distribution, a smooth latent space is learned³⁷. This process is regulated by the hyperparameter β, which weights the importance of latent space smoothness to the reconstruction error (see the β-VAE architecture section in Methods).

The β-VAE’s encoder reduces the dimensionality of the proton spectra by generating a mean LSR (μ) for each spectrum in the training dataset. A feed-forward DNN, or ‘predictor’, is then trained to predict the LSR from the laser parameters and backreflection moments. The decoder subsequently converts these predicted encodings into proton spectra, as illustrated in Fig. 1. The smoothness of the latent space imposed by the β-VAE architecture allows for reasonable reconstructions, even for slightly misaligned latent space predictions. This predictor-decoder pair therefore forms a synthetic diagnostic for laser-driven protons based purely on input laser parameters and backreflection measurement.

Uncertainty quantification

Quantifying uncertainties is an essential aspect of applying ML models to practical applications³⁸. ‘Aleatoric’ uncertainty stems from inherent randomness or unaccounted variations in the input and output data, such as uncorrected drifts in laser parameters during operation. In contrast, ‘epistemic’ uncertainty originates from limitations in the model design or the size and diversity of the training dataset³⁹. Aleatoric uncertainty of predicted spectra is estimated by the predictor network under two assumptions. First, the LSR uncertainty is a multivariate normal distribution. Second, each latent space dimension is statistically independent. The use of the β-VAE architecture supports these assumptions, and they are validated by the model’s performance in the Uncertainty prediction accuracy section. The predictor network outputs a mean value and the natural logarithm of the standard deviation (to ensure positive values) for each latent space dimension. The network is trained using a negative log-likelihood loss function, optimizing the mean and standard deviation prediction simultaneously⁴⁰. Sampling multiple times from the predicted distribution and decoding with the β-VAE decoder generates an empirical distribution of spectra, representing an estimate of aleatoric uncertainty.

For the ‘average’ prediction of a predictor-decoder pair, the predicted mean LSR ((hat{mu }) in Fig. 1) can be decoded directly or multiple samples from the predicted latent space distribution can be decoded and averaged. While the latter approach offers slightly better prediction accuracy, decoding the predicted mean directly is faster and thus more suitable for the high repetition rates of modern laser-driven ion sources. Therefore, the average proton spectrum is obtained by directly decoding the predicted mean LSR.

A widely used technique for estimating epistemic uncertainty and enhancing prediction accuracy is ensembling⁴¹. This involves training multiple neural networks and combining their outputs, ideally with diversity introduced during training, e.g., via varied hyperparameters or training data. During hyperparameter tuning, we found that β-VAE performance was not a limiting factor, so ensembling efforts focused on the predictor networks.

An ensemble of 100 predictor-decoder pairs was trained, consisting of 4 β-VAE decoders, each paired with 25 predictors. Diversity was introduced by training each member on random subsets of the training data, as detailed in the Ensembling section in Methods.

This ensemble, hereinafter referred to as the synthetic diagnostic ensemble (SDE), generates an uncertainty distribution for a predicted proton spectrum, incorporating both aleatoric and epistemic uncertainty. Multiple samples are taken from each predictor’s latent space distribution and decoded with the paired decoder, producing a distribution of spectra from which statistical information, such as confidence intervals, can be derived.

We use the median of the average predicted spectra from all ensemble members as the SDE prediction. This approach is computationally efficient, with only a 0.1% higher error than the best performing method, enabling rapid generation of average spectra while retaining uncertainty estimation through sampling.

Occasionally, the SDE generates spectra which contain non-physical negative values, particularly in low signal spectra. These values are set to zero, as they likely stem from poorly trained regions of the latent space, leading to unphysical output spectra after decoding. When producing the SDE average prediction, it was found that eliminating negative values either before or after calculating the median made negligible difference. In the implemented code, this clipping is performed after the median calculation.

Proton spectrum and total flux prediction accuracy

Since the ion spectrometer used in this experiment was not calibrated for absolute proton number, absolute error metrics are not particularly insightful for assessing the SDE performance. Instead, we use the percentage of unexplained variance (PUV), a relative error metric, calculated as the ratio of the mean squared error of the model predictions to the variance of the dataset, indicating the percentage of variance not explained by predictions (see the Metrics section in Methods). A naive model that always predicts the mean spectrum of the test dataset, irrespective of the inputs, would have a PUV of 100%. The SDE predictions have an average PUV of 13.5%, indicating high predictive power.

We also evaluated the reconstruction performance of the 4 β-VAEs in the SDE, finding an average PUV of 1.9%. The smaller reconstruction error compared to prediction errors indicates that most inaccuracies arise from the predictors’ ability to estimate the LSRs, rather than from the β-VAE reconstruction process.

To visualize model performance, Fig. 2a shows the measured spectra, the mean reconstructions by the β-VAEs and the SDE predictions for the test dataset (20% of the total dataset). These plots demonstrate that while β-VAE reconstructions are closer to the actual spectra, SDE predictions are accurate across most of the dataset.

Figure 2 b displays individual measured spectra, SDE predictions, and 1000 random samples from the predicted SDE distributions for the test dataset. This distribution estimates both aleatoric and epistemic uncertainty.

Interpreting the PUV of spectra is challenging due to nonlinear separation of proton energies in the ion spectrometer, which causes resolution to decrease as energy increases. Consequently, the mean squared error, and by extension, the PUV, is biased towards performance at low proton energies. To address this, we also evaluate the performance of the model in predicting the total flux of the spectra, Φ_P (the integral over energy), which appropriately weights energy bins based on their width. The PUV of these predictions is 16.4%, indicating low error across the spectral energy range.

For each spectrum, we sampled 1000 times from the SDE’s predicted uncertainty distribution to generate a Φ_P uncertainty distribution. The 68% confidence interval (16th to 84th percentile) is shown in Fig. 3a, plotted as a function of the measured Φ_P.

Figures 2 and 3a show good agreement between the average predictions and the actual values, within the predicted uncertainty range. The next section will explore the validity and practical value of these uncertainty predictions.

Uncertainty prediction accuracy

Whilst the uncertainty predictions of the SDE are plausible, their validity needs assessment. Calibration is a crucial aspect of uncertainty prediction, testing whether predicted confidence intervals (CIs) align with observed data. To evaluate this, we use a calibration curve, plotting the percentage of the test dataset (i.e. all energy bins of all spectra or all Φ_P values) that fall within a given predicted CI, against the corresponding CI size. Perfect calibration is represented by the line y=x, with values below the ideal line indicating overconfidence and values above indicating underconfidence^42,43.

We computed the percentage of test data within predicted CIs, centred on the median, from 1% to 99% in 1% steps. This is shown for both spectra and Φ_P predictions in Fig. 3b, displaying only small deviations from perfect calibration. Calibration error (CE), defined in the Metrics subsection in Methods, quantifies calibration quality, measured as the root mean square (RMS) deviation between the values of the CIs and the observed data within them. Our model achieves a CE of 3.35% for spectra and 5.8% for Φ_P predictions. We also calculate signed CE^44,45, which is positive for predominantly underconfident predictions. The signed CE is 2.1% for spectra and 1.8% for Φ_P predictions, suggesting that incorrect uncertainty estimates are generally conservative.

Additionally, we measure the sharpness of predicted distributions, defined in the Metrics subsection in Methods as the average RMS width of the predicted distributions. Lower values indicate narrower distributions, which are more insightful, assuming comparable calibration⁴⁶. The sharpness across predicted spectra distributions is 0.065, much narrower than the dataset’s average standard deviation of 0.19. Similarly, the predicted Φ_P distributions have a sharpness of 0.4 compared to a standard deviation of 1.2.

The high degree of calibration and sharpness of the predicted distributions highlights the power of this model architecture to be utilized as a synthetic diagnostic, capable of generating accurate and precise uncertainty distributions of on-shot spectra with training on a small dataset.

Training set size performance scaling

The SDE demonstrates promising performance with fewer than 700 training samples. DNNs typically improve with larger datasets, following a power law trend^47,48,49. This suggests potential error reduction with larger datasets, as planned for future experiments.

To extrapolate future performance, we retrained the SDE on random sub-samples of the full training dataset of varying sizes, repeating the process ten times to ensure robustness against data distribution changes and training variability. The spectra and Φ_P PUV, sharpness, and CE of the SDE predictions were evaluated after each training session.

The results, presented in Fig. 4a, show a substantial reduction in error as more data is added, following a power law trend. For example, with 3600 shots of training data, achievable in about an hour with a Hz-rate laser system, the PUV for predicted spectra is projected to reach ~6.5%.

**Fig. 4: Effect of training set size on various metrics.**

The PUV of the Φ_P predictions, shown in Fig. 4b, also follows a power-law trend but improves less significantly, reaching about 12.7% after 3600 training shots.

The sharpness of predicted spectra distributions decreases slightly with more data, from 0.07 with < 160 training samples to 0.065 with the full dataset, while the predicted Φ_P distribution sharpness remains almost unchanged, as shown in Fig. 4c. This indicates that adding more training data does not significantly sharpen the distributions. However, as previously noted, the distributions are already considerably sharper than the overall variation within the dataset.

The gradual improvement in the SDE Φ_P prediction error and negligible change of the Φ_P distribution sharpness can be attributed to a small number of, predominantly high-energy, high-flux, spectra where the model underperforms. In contrast, greater improvement is observed in overall metrics due to the bias towards low energy regions of the spectra, where predictions improve significantly. This is potentially driven by the low density of high-energy spectra within the dataset (as can be observed from Figs. 2a, 3a), limiting learning. Alternatively, it could indicate that the backreflection diagnostics do not encode enough information for precise prediction of these high-energy spectra, which may exhibit larger shot-to-shot fluctuations.

These findings indicate significant performance gains are possible with larger datasets and highlight steps for improved performance across a wide parameter range, particularly the need for a balanced dataset or additional diagnostics, supporting advancements in laser-driven proton experiments. Importantly, these results indicate that the scale of the datasets required for significant improvements are within reach on current and next generation Hz repetition rate laser systems.

Surrogate model

We also performed a dataset size scaling test on a full surrogate model (SM), which is analogous to the SDE but without on-shot backreflection, or other secondary diagnostic, measurements. This SM would enable both on-line, real-time output ion beam parameter predictions based on measured experimental input laser and plasma parameters, and off-line input parameter space exploration, facilitating rapid optimization.

As shown in Fig. 4a, b, the performance of the SM plateaus much more quickly than that of the SDE with the current model architecture. However, a more sophisticated model could potentially achieve enhanced performance with additional data.

Feature importance

In addition to the SDE’s ability to provide accurate, on-shot predictions of the proton spectra, we can also gain insights into the underlying physics through the interpretation of its outputs. While the black-box nature of DNNs poses significant challenges to interpretability^50,51,52, Permutation Feature Importance (PFI) provides a way to gauge the relative importance of specific inputs to the model performance. Originally developed for Random Forests⁵³, PFI is now widely used for neural networks^54,55,56. This method involves shuffling an input in the test set and measuring the change in prediction error, with large increases indicating strong dependency.

We applied PFI to the SDE ten times to account for random shuffling variation, as shown in Fig. 5, along with the standard deviation normalized to the range of each input parameter. The results highlight that laser pulse energy is the most critical feature for determining proton spectra, consistent with experimental studies⁵⁷. Related to this, encircled energy is also important, reflecting its role in determining the total energy transferred to protons⁵⁸.

**Fig. 5: Influence of input parameters.**

The preheater offset time emerges as the second most important factor. It determines the density profile of the expanding preplasma, which in turn influences the interaction between the main laser pulse and the plasma, and specifically the energy coupling to fast electrons²⁹. While existing analytical models of laser-driven ion acceleration do not fully account for the impact of the plasma density profile, our findings show it to be important, underscoring the advantages of this approach over more limited models. Focal spot size and Δz have some effect, though less significant, while pulse duration exhibits minimal influence within the tested range. As shown, these parameters are less critical compared to energy and preheater offset time, despite being varied similarly during the experiment over their tested ranges.

Applying PFI to the SDE motivates an approach to defining system performance requirements and designing datasets for future ML-based laser-driven ion acceleration experiments. These findings highlight the substantial benefits of ensuring detailed sampling and accurate measurement of laser energy compared to other parameters, such as pulse duration. Additionally, they highlight the necessity of developing a more precise diagnostic of the front surface plasma density profile, given its relative importance as indicated by the model.

Discussion

We have developed a DNN-based model that accurately predicts proton spectra from high-intensity laser-solid interactions while providing well-calibrated uncertainty estimates. The model employs an ensemble of predictor networks that use laser conditions and secondary backreflection measurements to predict reduced-dimensional spectra representations, which are subsequently decoded using β-VAE networks. By predicting Gaussian distribution parameters, the model generates robust uncertainty estimates that enhance prediction accuracy. The ensemble approach improves both uncertainty estimation and prediction error by combining diverse predictions from multiple predictor-decoder pairs, each trained on distinct subsets of the dataset.

This study presents a reliable framework for non-destructively determining the properties of laser-accelerated proton beams, enabling real-time diagnostics and optimization of future experiments. Traditional laser-plasma setups are constrained by bulky in-vacuum diagnostic systems, requiring large chambers and complex controls. Synthetic diagnostics offer a solution, significantly reducing system size and unlocking new applications (e.g., non-destructive testing and inspection), where source mobility and live diagnostics are essential⁵⁹.

Additionally, our methodology, incorporating techniques such as PFI, enhances our understanding of laser-driven ion acceleration and laser-solid interactions, providing valuable insights for guiding future experimental design and parameter optimization. The performance of the SDE demonstrates favorable scaling with increasing dataset size; however, the gradual improvement observed in the SDE Φ_P predictions highlights the importance of achieving a balanced distribution of spectra in future experiments to maximize performance with limited data. This observation also suggests the potential need for different diagnostics that capture a broader range of interaction information, enabling more precise predictions.

The occasional negative output values in SDE predictions have a minimal impact on model performance. However, this could be addressed more rigorously in future models by applying the rectified linear unit (ReLU) activation function to the decoder output instead of the current linear function. Implementing this change would necessitate retuning the model hyperparameters due to the significant change in the architecture.

The lack of increased prediction accuracy when applying alternative dimensionality reduction techniques to the backreflection data, including the preservation of spatial and spectral information through the use of image moments (as discussed in the Model input data section), may reflect unique characteristics of the backreflection diagnostic itself. A diagnostic more strongly correlated with the proton spectrum could potentially benefit from a dimensionality reduction approach that better preserves such correlations. Furthermore, potential enhancements to the model architecture could include adopting a non-parametric approach to predicting aleatoric uncertainty, removing the assumption that latent space representations must follow a Gaussian uncertainty distribution. This could result in improvements in calibration and sharpness⁶⁰.

Although the SM exhibits limited improvement with additional data, this may stem from model capacity constraints. Re-tuning hyperparameters or modifying the model architecture could enhance the predictor DNN’s accuracy, offering a method for efficiently and accurately scanning large parameter spaces and designing future experiments.

Since the model inputs consist of in-situ backreflection measurements from the interaction, the model is unlikely to generalize immediately to datasets from other experimental setups or laser facilities. However, transfer learning presents a promising avenue to address this limitation. By using a pre-trained SDE from one system, it may be possible to learn the complex correlations of a new system with reduced training time and smaller dataset requirements, enabling rapid adaption for future applications^61,62. This approach holds significant potential for current and emerging high-power laser facilities operating at 1 Hz repetition rate or greater, where an equivalent dataset size to that used in this work could be generated within tens of minutes. Consequently, training an SDE or surrogate model to diagnose and guide experiments could become a routine part of standard operating procedures.

In conclusion, the DNN architecture presented here provides accurate and insightful predictions of the spectral properties of laser-accelerated protons. This approach represents a significant advancement in non-destructive ion beam measurements and highlights the effectiveness of applying DNN models to laser-solid interactions. The technique has broad potential applications, ranging from diagnosing other radiation types (e.g., electrons, neutrons and x-rays) in high-power laser-plasma interactions to predicting outcomes and quantifying uncertainty in fields like nuclear fusion research and astrophysics, where high dimensionality and low data volumes present substantial challenges.

Methods

Experimental details

As illustrated in Fig. 1, pulses from the Gemini high power laser were focused to intensities up to 4.4 × 10²⁰ Wcm⁻² onto a 15 μm-thick copper tape-driven target. Prior to focusing, the pulses passed through a double plasma mirror system (not shown in the schematic). The timing of the arrival of a preheater laser pulse (focused to 1 × 10¹⁴ Wcm⁻²) prior to the main laser pulse, Δt, controlled the degree of plasma expansion. The parameter Δt was varied within the range [0.4, 3.68] ns (noting that when the preheater is not used, it is labelled as 0 ns in the dataset) using a motorized timing slide. Meanwhile, Δz, the spatial offset between the laser focus and the target front surface, was adjusted using motorized drives for the parabola, varying within the range of [−100, 100] μm. The energy of the preheater beam could not be measured for every shot during the experiment. Instead, it was characterized daily when the system was brought online. The energy variability is anticipated to be comparable to that of the main drive laser pulse, as both are derived from the same source. Since the preheater beam is not tightly focused, any potential wavefront changes are expected to have a minimal impact on the interaction. The wide range over which the preheater offset time was varied is likely the most significant parameter influencing changes in the preplasma.

Measurements of the laser pulse energy and resulting proton and optical radiation were made using a suite of diagnostic tools. The on-target laser energy, measured every shot using a calibrated calorimeter sampling energy from a laser pick-off, was within the range [0.65, 4.8] J. The laser focal spot was periodically imaged with the laser in low-power mode at 10 Hz. The focal spot radius (defined as the half-width at half-maximum) and the encircled energy within the focal radius were calculated for each image and found to be within the range [1.35, 2.11] μm and [21.5, 25.7]%, respectively (see the Data processing section). The pulse duration was adjusted within the range [36, 95] fs via an Acousto-Optic Programmable Dispersive Filter (Dazzler by Fastlite). Daily pulse duration measurements were performed over five shots using a GRENOUILLE⁶³, enabling both optimization and characterization of the pulse. The use of a double plasma mirror system required operations to be paused every 50 shots for plasma mirror replacement. Following each replacement, the laser focal spot was reoptimized and imaged to ensure consistency. This process effectively minimized the effects of drift in the laser wavefront. However, any unaccounted-for variations in the laser focus or spectrum could introduce noise into the focal spot or pulse duration input parameters.

The energy spectrum of the beam of accelerated protons was measured using a Thomson parabola ion spectrometer^64,65,66,67, which directed the ions onto a multi-channel plate with phosphor screen. The resulting light output was imaged with an Andor Neo CMOS camera. Backreflected laser light from the target front surface was measured by directing it onto a diffuse scatter screen. The scattered light was imaged at both the laser frequency (ω) and the second harmonic (2ω), in addition to the spectrum being measured across a range from 175 to 1169 nm.

Data processing

To extract the radii and encircled energies from the focal spot images, background reference images were subtracted and Gaussian curves fit in both x and y dimensions, centered on the image centroid. The focal spot radius was defined as the average half-width at half-maximum (HWHM) of these Gaussian distributions, and the encircled energy was calculated as the ratio of pixel intensity inside the area defined by the HWHM radius to the total intensity across the focal spot.

The accelerated ions are dispersed by the parallel electric and magnetic fields of the Thomson parabola spectrometer in parabolic traces, depending on their charge-to-mass ratio and energy. To reduce X-ray-induced noise, the images of the phosphor light were median-filtered, and custom code was used to extract the proton spectra from the ion traces by performing background subtraction and modeling the energy spread induced by the fields.

Images of the backreflected light were median-filtered, and a background level was estimated from regions outside the near-field, which was then subtracted; any negative pixel values set to zero. The images were cropped to match the near-field beam diameter and masked to eliminate pixels outside this area. The processed backreflection ω and 2ω images were saved as 320 × 328 and 326 × 334 arrays, respectively.

Backreflection spectra, stored as 1714-element arrays, were also median-filtered. Each spectrum had the mean of the lowest 10% subtracted to remove background noise and was clipped to ensure non-negative values.

Metrics

Several metrics are used to evaluate the performance of the model including the PUV, CE, and sharpness. These are defined as follows:

$${rm{PUV}}= , 100left(1-{R}^{2}right)\ = , 100frac{mathop{sum }nolimits_{i = 1}^{N}mathop{sum }nolimits_{j = 1}^{L}{({y}_{ij}-{hat{y}}_{ij})}^{2}}{mathop{sum }nolimits_{i = 1}^{N}mathop{sum }nolimits_{j = 1}^{L}{({y}_{ij}-{bar{y}}_{j})}^{2}}$$

(1)

where R² is the coefficient of determination and N is the number of spectra in the test dataset, each with length L. y_ij and ({bar{y}}_{j}) denote the jth element of the ith spectrum and the mean spectrum (defined as the mean value for each energy bin over all spectra in the test dataset), respectively. ({hat{y}}_{ij}) indicates the corresponding predicted value. The PUV, a commonly used performance metric in regression problems, is the ratio of the mean squared error of the model to the variance of the dataset, representing the percentage of the variance in the data not explained by the model. Alternatively, it may be interpreted as the error of the model relative to the error of a naive model that always predicts the dataset mean^{68,69,70,71,72}. For Φ_P predictions, the same formula applies, with L = 1 and ({bar{y}}_{j}) replaced with the mean Φ_P of the test dataset.

$${rm{CE}}=sqrt{frac{1}{99}mathop{sum }limits_{P=1}^{99}{({O}_{P}-P)}^{2}}$$

(2)

$${rm{Signed,CE}}=frac{1}{99}mathop{sum }limits_{P=1}^{99}({O}_{P}-P)$$

(3)

are the CE and signed CE respectively, where O_P is the percentage of the measured data (i.e., all measured energy bins of all spectra, or all Φ_P, in the test set) observed to be within a P% confidence interval^42,44,45. We have included an additional square root operation on the CE from the definition in ref. ⁴² to retain the units.

For a predicted empirical distribution with S samples, where each sample is given by ({widetilde{y}}_{ijs}), representing the uncertainty in the jth element of the ith predicted spectrum, the sharpness can be defined as

$${rm{sharpness}}=sqrt{frac{1}{NL}mathop{sum }limits_{i=1}^{N}mathop{sum }limits_{j=1}^{L}frac{1}{S}mathop{sum }limits_{s=1}^{S}{(, {widetilde{y}}_{ijs}-{bar{y}}_{ij})}^{2}}$$

(4)

Here, ({bar{y}}_{ij}) is the mean over s of ({widetilde{y}}_{ijs})^42,73.

During hyperparameter tuning, we used the Continuous Ranked Probability Score (CRPS) to evaluate model performance. CRPS integrates both calibration and sharpness into a single metric, making it effective for comparing the overall accuracy of predicted distributions^46,74. We estimated CRPS by calculating the average of the ‘pinball loss’⁷⁵ from the 1st to 99th quantiles of the predicted distributions⁷⁴.

Data splitting

To accurately evaluate the final model performance, a random 20% of the total data was set aside as a holdout test set, which was not used for data normalization, hyperparameter tuning, or model training. This approach ensures that the final performance metrics are not overly optimistic due to overfitting.

Each DNN model was trained on 90% of the remaining data, with the remaining 10% reserved as a validation set. This configuration enabled the use of the Keras EarlyStopping callback, which halts training if the validation loss does not improve after a specified number of epochs (the ‘patience’ hyperparameter), thereby preventing unnecessary training and overfitting⁷⁶. In addition, we used the ‘restore_best_weights’ parameter of EarlyStopping, restoring the model weights to those found on the training epoch with the lowest validation loss.

Data normalization

Data normalization, which has been shown to enhance model performance, especially in data-limited scenarios^77,78, was applied by mapping the training dataset inputs and outputs to the [0, 1] interval. The validation and test sets were scaled using the parameters obtained from scaling the training set, avoiding data leakage. Each β-VAE and predictor in the ensemble was trained on a different random sub-sample of the training data, necessitating unique scaling parameters for each ensemble member’s inputs and outputs. Predicted data was then scaled back to the original range to ensure consistent performance evaluation.

β-VAE architecture

The encoder and decoder of the β-VAEs are fully connected dense neural networks. All layers use rectified linear unit (ReLU) activation functions, except for the input and output layers, which are linear. During hyperparameter tuning, the hidden layers of both the encoder and decoder are constrained to be symmetric, with a maximum of 2 hidden layers each.

The β-VAE loss function for each proton spectrum during training is defined as follows⁷⁹:

$${{mathcal{L}}}_{beta -VAE}=frac{1}{2}mathop{sum }limits_{i=1}^{L}{({y}_{i}-{hat{y}}_{i})}^{2}+frac{beta }{2}mathop{sum }limits_{j=1}^{{L}_{{rm{z}}}}left({mu }_{j}^{2}+{sigma }_{j}^{2}-log ({sigma }_{j}^{2})-1right)$$

(5)

where L denotes the ‘length’ of the proton spectrum, y_i is the ith element of the ground truth spectrum, and ({hat{y}}_{i}) is the corresponding element of the reconstructed spectrum. L_z represents the length of the latent space, while μ_j and σ_j is the mean and standard deviation of the latent space generated by the encoder, respectively.

The first term on the right-hand side of equation (5) captures the reconstruction error from the decoder, whereas the second term represents the Kullback-Leibler (KL) divergence between the latent space distribution and the standard normal distribution.

The parameter β is a hyperparameter that regulates the trade-off between the reconstruction error and the KL divergence. Increasing β results in a smoother and more regularized latent space but may also increase the reconstruction error. During hyperparameter tuning, β was permitted to be set to 0, effectively disabling the KL divergence term. In this scenario, the encoder outputs only the mean LSR, reverting the β-VAE to a standard autoencoder.

Predictor architecture

The predictor network is a fully connected dense network that uses ReLU activation functions across all hidden layers, with the architecture allowing for up to three hidden layers during hyperparameter tuning.

As discussed in the Uncertainty quantification section, the predictor DNNs use a negative log-likelihood loss function to estimate the parameters of an independent multivariate normal distribution, reflecting the assumed distribution of latent space variables due to experimental noise. The loss function for each encoded spectrum is given by⁴⁰:

$${{mathcal{L}}}_{{rm{NLL}}}=frac{1}{2}mathop{sum }limits_{j=1}^{{L}_{{rm{z}}}}left(log ({hat{sigma }}_{j}^{2})+frac{{({mu }_{j}-{hat{mu }}_{j})}^{2}}{{hat{sigma }}_{j}^{2}}right)$$

(6)

where μ_j is the jth element of the encoded ground truth spectrum generated by a β-VAE encoder, and ({hat{mu }}_{j}) and ({hat{sigma }}_{j}) are the corresponding predicted mean and standard deviation of the predictor DNN, respectively.

Hyperparameter tuning

The combined β-VAE-predictor model has numerous hyperparameters that need to be tuned for optimal performance. Given the computational expense of training each DNN, we used Bayesian optimization via the scikit-optimize gp_minimize package^80,81,82.

Initially, we performed Bayesian optimization over broad parameter ranges for 221 iterations to identify regions of high performance. This was followed by a more focused optimization over narrower ranges for an additional 123 iterations. Hyperparameter tuning was conducted using k-fold cross-validation (k=10)^83,84,85, with nine folds used for training and one used for performance evaluation on every permutation of folds. The nine training folds incurred a further 90/10 random split into training and validation datasets for EarlyStopping. The β-VAE and predictor networks were trained for a maximum of 15,000 epochs, with EarlyStopping patience parameters tuned independently.

During tuning, the batch sizes for training the β-VAEs and predictors were set to 16, 32 or 64. To ensure uniform batch sizes, the training data for each fold was truncated to a multiple of 64. Additionally, the maximum moment order (MMO) of the backreflection diagnostics was treated as a hyperparameter, recognizing that not all moments may be essential for predictions. Including too many unimportant inputs could potentially reduce performance.

The full list of tuned hyperparameters along with their allowed ranges and optimal values found are shown in Table 1.

Table 1 Optimal DNN hyperparameters and tuning ranges

Full size table

We selected stochastic gradient descent as the optimizer for the networks during training because of its proven ability to achieve superior generalization performance compared to adaptive algorithms such as ADAM^86,87,88.

Given that the model predicts an uncertainty distribution rather than a single average spectrum, we needed a metric that evaluates the error across the entire distribution, not just the average. To achieve this, we generate 1000 samples from the predicted distribution for each ground truth spectrum and use an approximation of the CRPS. Minimizing this approximation enables us to fine-tune the model hyperparameters effectively, ensuring the model accurately captures the overall uncertainty distribution of the data.

Ensembling

During hyperparameter tuning, we found that the prediction performance of the predictor-decoder pairs was not constrained by the reconstruction error of the β-VAEs, as detailed in the section Proton spectrum and total flux prediction accuracy. Consequently, ensembling the β-VAEs was not essential, and we trained only four β-VAEs, each using a different 10% subset of the training data for validation and EarlyStopping.

For the predictor ensemble, each member is again trained on a random 90% subset of the training data, with the remaining 10% used as validation for EarlyStopping. In addition, a scan was performed to test the CE, sharpness and PUV of the ensemble predictions as a function of different techniques to induce variation between ensemble members. The two methods involved either randomly truncating or padding the training data for each member network to multiples of the batch size (after the validation split) and sampling these resized datasets with or without replacement. When sampling without replacement, padding was achieved by randomly appending samples from the training data, also in multiples of the batch size. This step was unnecessary when sampling with replacement, as it allows direct sampling from a dataset larger than the original. Consequently, both the fraction of the data used for training each network and its distribution were varied.

The optimal performance, in terms of prediction error, CE and sharpness, was achieved when each member was trained with the full training dataset, padded to a multiple of the batch size, and sampled without replacement. It remains unclear which method of padding or truncating is most effective in edge cases where the training data (after the validation split) is already a multiple of the batch size, as no padding or truncating would be applied in this scenario with the current design.

Furthermore, various definitions of the ‘average’ prediction were tested, including the mean and median of all ensemble members, as well as the mean and median of an empirical distribution generated from the ensemble. It was found that the median of the predicted distribution yielded the highest accuracy by a small margin. However, as noted in the Uncertainty quantification section, the median of the average prediction from each ensemble member is computationally more efficient while maintaining a similar error rate.