Machine learning empowered coherent Raman imaging and analysis for biomedical applications
Introduction
Biomolecules are the fundamental unit of molecular machinery, of which the distribution, concentration, dynamics, and interaction determine physiological processes. Tracking biomolecules in situ with a high spatio-temporal resolution has become the key to deciphering the mechanisms of life. Although conventional biochemical assays can measure most biomolecules, ensemble measurements result in limited temporal and spatial information. Besides, these methods require the collection of a population of cells, missing out on critical information hidden in a rare population or at a single-cell level. By tackling these challenges, imaging techniques promise to serve as an enabling approach for in situ measurement.
Molecular imaging techniques, such as magnetic resonance spectroscopy and positron emission tomography, allow the tracing of biomolecules with submillimeter-scale spatial resolution1,2. Meanwhile, optical imaging, especially fluorescence imaging, has been widely used throughout the life science and biomedical fields, as fluorophores or fluorescent proteins can be designed to target specific molecules and structures. Even so, for small molecule imaging, large and bulky fluorophores often interfere with the functions of the target molecules. In addition, due to photobleaching and cross-emission effect, multiplexed imaging to quantitatively track multiple targets is challenging3. Therefore, label-free or micro tag-labelled single-cell imaging technique provides a unique opportunity for biomedical research.
Vibrational spectroscopic imaging is one of the efforts to provide a non-destructive, label-free approach to tracking biomolecules with high spatial and temporal resolutions4. The label-free nature of this technique has made it attractive to clinical applications. As one of the spectroscopic modalities, Raman spectroscopy characterizes molecular vibrations by their inelastic photon scattering, thus providing information on the concentration and composition of chemical bonds in situ. The signal level is greatly improved through the development of coherent Raman scattering (CRS) microscopy5, allowing high-speed, selective visualization of biomolecules based on their vibrational spectroscopic features in living cells, tissues, and organisms with submicron spatial resolution6,7. However, the images obtained by CRS microscopy are often high-dimensional data such as hyperspectral, time-lapse, or volumetric datasets. Therefore, a major challenge arises from analyzing and processing these massive datasets into interpretable and actionable information.
Machine learning approaches, particularly deep learning, hold significant promise in addressing this challenge. The capability of deep learning to process complex datasets has been demonstrated extensively in cellular imaging8,9. Here, we will provide a review of how machine learning has been employed with different architectures for CRS imaging, from cellular to tissue level and from data acquisition and analysis to clinical diagnosis.
Basics of coherent Raman imaging and machine learning
CRS microscopy
Raman scattering phenomenon was discovered by C.V. Raman in 192810, which describes an inelastic scattering of photons by matter, where photons exchange energy with molecular vibrations. The photons that lose energy to molecular vibrations are shifted to longer wavelengths, known as the Stokes shifts, while the photons that gain vibrational energy are shifted to shorter wavelengths, known as anti-Stokes shifts. The vibrational transitions of chemical bonds generate narrow Raman scattering peaks, which can be used as fingerprints to identify chemical species without labels. However, the main technical difficulty for biomedical applications comes from the weak signal level and the corresponding long signal integration time.
CRS, utilizing two synchronized laser pulses to enhance the Raman scattering signal coherently, offers millions of times of signal enhancement. A resonance occurs when the difference between the pump beam (ωp) and the Stokes beam (ωS), namely ωp−ωS, is tuned to match the frequency of a Raman-active molecular vibration, Ω. The two most widely used CRS processes for microscopy are coherent anti-Stokes Raman scattering (CARS) and stimulated Raman scattering (SRS). First observed in 196511, CARS measures the anti-Stokes signal (ωaS) generated from the interaction of the pump, and the Stokes beams with molecules in the specimen. On the other hand, SRS is the signal extracted from the energy transfer between the pump and Stokes beams, the phenomenon first observed in 196212. CARS imaging of living cells was demonstrated with femtosecond pulses and a collinear beam geometry in 199913. SRS bio-imaging was shown in 2008 by using a high-repetition-rate picosecond laser source and a high-frequency modulation transfer detection scheme6,14. In addition to single frequency measurement, hyperspectral CRS imaging, either by scanning excitation wavelength or by multiplex detection from a broad spectral bandwidth, further resolves overlapping Raman bands for complicated biological specimens.
With the development of multi-color and hyperspectral CRS microscopy15,16,17,18, CRS imaging meets a series of challenges. First, the image acquisition speed is relatively slow, especially if hyperspectral measurement is needed with a high signal-to-noise ratio (SNR). Second, CRS imaging usually generates massive high-dimensional data such as 5D datasets (X, Y, Z, time, and vibrational energy dimensions), substantially expanding the complexity of the image analysis and visualization of the results. Third, rapid and precise diagnosis based on hyperspectral CRS images has been sought for clinical purposes. In summary, there is an urgent need for a robust CRS analysis method to produce easy-to-interpret outcomes.
Machine Learning towards imaging and diagnosis
Machine learning approaches have made breakthroughs for image interpretation and image analysis in biomedical applications19,20,21,22. Machine learning is the core of artificial intelligence, which refers to computers possessing intelligence imitating human beings to perform functions such as recognition, cognition, analysis, and decision-making. There are three main types of machine learning: unsupervised, supervised, and reinforcement learning23. The main difference between them is that the algorithms in supervised learning learn from well-labeled datasets to predict the output, whereas the algorithms in unsupervised learning look for hidden features from unlabeled or partially-labeled datasets and cluster them as the output24. Unlike the above, the data is not predefined in reinforcement learning, and the algorithms learn to react to an environment independently, following a trial-and-error method22,25. Widely used supervised methods include support vector machines (SVM)26, linear discriminant analysis (LDA)27, logistic regression (LR), linear classifiers28, k-nearest neighbors (kNNs)29, and decision trees30. Representatives of unsupervised learning methods include principal component analysis (PCA)31, autoencoders32, expectation-maximization33, self-organizing maps (SOM)34, k-means35, and density-based clustering36. Central algorithms in reinforcement learning include Q-learning37, Sarsa, actor-critic (AC), QV-learning, and AC learning automaton38.
Deep learning39 is a rising subset of machine learning methods based on artificial neural networks inspired by how humans think and learn. A convolutional neural network (CNN) is a deep learning model which operates through the convolution operation. Essential features in the input (such as images) are preserved in the feature maps generated from each convolution layer, forming a multi-layered feed-forward neural network by stacking many hidden layers in sequence. This type of architecture is susceptible to picking up patterns, making it widely adopted in computer vision and image classification. U-Net40 is a type of CNN with a semantic segmentation model architecture. The architecture resembles the letter “U”, consisting of a contracting path (encoder) to capture context and a symmetric expanding path (decoder) that enables precise localization. U-Net extracts features at different levels (edge, shape, etc.) through multiple downsampling, which can provide multi-scale feature information required by the target task. Then, the decoder fuses features layer by layer for decoding. Compared with the direct decoding of high-level feature information, the decoding process of U-Net integrates multilayer features and therefore, has stronger feature expression ability. In addition, U-Net requires only a small amount of training data to achieve good performance. More recently, algorithms based on ResNet41 and DenseNet42 have been developed for image segmentation, cell counting, and denoising tasks, demonstrating the potential for broad application in biomedical imaging processing. By introducing a residual module, ResNet allows the network to learn the residual between input and output instead of directly learning the mapping relationship. This design solves the vanishing gradient problem in deep networks by bypassing some layers through “skip connections”, making it possible to train very deep networks. This innovation of ResNet has greatly improved the ability of deep neural networks to handle complex tasks, especially in image recognition and classification. The core of DenseNet is that each layer is directly connected to all previous layers. Unlike the skip connections of ResNet, each layer of DenseNet receives the feature maps of all previous layers as input and outputs its own feature map to all subsequent layers of the network. This property makes DenseNet very powerful in feature extraction, especially in applications requiring fine-grained features, such as image segmentation and classification.
Machine learning is a promising method for mediating fast and accurate medical diagnosis. In recent years, it has contributed to identifying tumors and neurodegenerative diseases from different radiological imaging modalities such as x-rays, computed tomography, magnetic resonance imaging, and positron emission tomography imaging43. Besides, owing to the ability to reduce the annotations required by the expert and handle noisy labels simultaneously, machine learning has paved the way to optimize and improve the effectiveness of data usage in histopathology image analysis44. Overall, the combination of machine learning and medical imaging has great significance for developing prospects in clinical.
The development of machine learning techniques has resulted in a broader biomedical application of vibrational imaging. Various machine-learning models were designed and optimized for extracting features from spectroscopic data, allowing accurate and rapid classification of target cells based on the spectroscopic signatures in biological systems45,46,47,48. These studies demonstrate the potential of machine learning for fast, high-sensitivity spectroscopic imaging for biomedical applications. Combining with imaging capability, machine learning is becoming more critical in CRS microscopy through improving image acquisition and multi-dimensional image analysis. These led to the development of critical clinical applications. The machine learning algorithms for various purposes in CRS imaging are summarized in Table 1.
Machine learning-mediated data acquisition and processing
Denoising via machine learning to improve SNR for CRS imaging
The trade-off between SNR and imaging speed has been a fundamental limitation for microscopic techniques in general. Many endeavours have been made to prevent or reduce such a trade-off. The situation gets more complicated for CRS microscopy as there are more factors to account for, including SNR, speed, and spectral bandwidth, making it an even more challenging task. Although CRS imaging is capable of video-rate imaging of single Raman bands, it suffers from limitations in the speed of laser tuning rates, where SNR often determines the data acquisition time. It is anticipated that with machine learning, the physical limits restraining the relationship between these factors can be extended49.
As Raman scattering is an inefficient process, enhancing SNR has been one of the major tasks for technical development, and machine learning has made a significant contribution in this regard. Table 2 summarizes these methods for improving the SNR of CRS imaging. Low SNR situations often occur when the photon number reduces in the deep-tissue imaging or when the image speed increases, causing limited pixel dwell time. Spectral total variation (STV) denoising algorithm is used to effectively reduce noise in spectral images by minimizing an objective function that combines data fidelity and image smoothness while considering both the spatial and spectral dimensions of the image to preserve the integrity of image edges and spectral characteristics. As a result, the SNR of diluted dimethyl sulfoxide solution spectral images was improved by 57 times, and that of living Caenorhabditis elegans (C.elegans) under SRS microscopy was improved by 15 times (Fig. 1a)50.

a Denoising algorithm (left) and denoised stimulated Raman scattering (SRS) results (right) by spectral total variation (STV) algorithm of diluted dimethyl sulfoxide solution and Caenorhabditis elegans. Scale bar, 10 μm. Reproduced with permission from ref. 50. b (Left) SRS images at 2920 cm−1 of fixed HeLa cells. (Right) Two-color (lipids-green, proteins-blue) SRS images of coronal mouse brain slice at a depth of 175 μm. VST variance stabilization transform, DL deep learning. Reproduced with permission from ref. 51. c Three denoising models (left) and denoised coherent anti-Stokes Raman scattering results of nerves (right). W5 WIN5R, DN DenoiseNet, N2N Noise2Noise. Reproduced with permission from ref. 52.
Apart from the traditional machine learning approach, deep learning algorithms based on U-Net were developed to improve SNR in SRS imaging49,51,52 With minor optimizations on the U-Net architecture developed previously by Ounkomol et al.53, CNN was applied to denoise the SRS data to improve the quality of biological images under low SNR conditions, such as low laser power (Fig. 1b)51. Low and high SNR images were generated by SRS imaging of cells using low and high excitation powers, respectively. A small training set of 40 low SNR images was used with a randomized 10/30 test/train split, using the high SNR images as the ground truth. The resulting peak signal-to-noise ratio (PSNR), root mean squared error (RMSE), and correlation coefficient (CC) values significantly outperformed other denoising algorithms as indicative of image fidelity and correlation between the prediction and ground truth.
Taking into account that the noise features remain the same throughout the depth of the tissues for the SRS imaging, the application of U-Net CNN in deep-tissue imaging was demonstrated by applying the algorithm to denoise the SRS images of the ex vivo brain slice with imaging depth up to 175 μm (Fig. 1b)51. Denoising by deep learning models also enabled recovery of low SNR images obtained from high-speed imaging, which is especially critical for endoscopy. Among the deep learning models tested for denoising the CARS endoscopic images, Noise2Noise (N2N) showed the highest performance by improving the CARS endoscopic imaging of peripheral nerves by 5 times (Fig. 1c)52. N2N is a U-Net-based architecture that contains encoder (downsampling) and decoder (upsampling) parts, as well as skip connections between corresponding layers. A key advantage of N2N is that it does not require clean ground-truth images for training because it uses pairs of noisy images. The underlying signal in these pairs remains consistent, while the noise varies randomly. By minimizing the loss function between the noisy input and the noisy target image, the model learns to separate noise from the true signal. This approach allows N2N to achieve effective denoising while significantly reducing the need for clean, noise-free ground-truth datasets, which are often difficult to obtain, particularly in medical/clinical imaging. In conclusion, machine learning empowered bright prospects for data acquisition and processing against the physical limits in CRS imaging.
Machine learning to improve hyperspectral imaging speed
Speed and molecular specificity often are two conflicting factors in CRS microscopy. Spectral information offered by Raman spectroscopy or frame-by-frame hyperspectral CRS resolves molecules from the fingerprint or overlapping Raman bands. However, this capability comes with a price of long acquisition time, from several minutes to several hours per stack. The multiplexed scheme54,55,56,57 improves the spectral imaging speed, reaching a few seconds per image stack but with a lower signal level due to the reduced number of photons at a pixel. Theoretically, the nature of the spectral data is that the signature peaks take up only a small portion of the spectrum. Indeed, it was previously demonstrated that carefully selecting the spectral channels allows high-speed multi-spectral SRS imaging58. Besides, in hyperspectral Raman or CRS imaging, each pixel contains spectral information to form highly structured data with spatial and spectral correlations hidden within. Machine learning has been proven to be a capable approach to exploit these properties for addressing the speed and molecular specificity dilemma.
Breaking the speed limit of spontaneous Raman imaging bounded by signal integration time, a framework composed of two neural networks called DeepeR was developed59. Specifically, DeepeR sequentially applies 1D ResUNet denoising and hyperspectral residual channel attention network (variants of residual channel attention network) to fully consider the high degree of molecular compositional correlation between pixels, demonstrating high-speed, high-resolution molecular imaging by Raman microscopy with up to 160 times improvement in the speed. 1D ResUNet is a deep learning model based on U-Net architecture, which uses one-dimensional convolution and is specially used to process spectral data. This is different from the U-Net network using 2D convolutions for denoising. The hyperspectral residual channel attention network is designed based on the residual channel attention network architecture, which mainly includes the residual learning module, channel attention mechanism, and long and short-term skip connections. Through special design, it makes full use of the rich spectral information in the Raman spectrum to achieve more accurate reconstruction results.
For hyperspectral CRS, spectral tuning time mainly determines the imaging speed. Fingerprint hyperspectral SRS imaging with microsecond spectral acquisition time was demonstrated by developing an ultrafast spectral tuning approach and a spatial-spectral residual learning network (SS-ResNet)49. A 55-kHz polygon scanner and a Littrow-configured reflective grating are used for delay-line tuning in the spectral focusing scheme. This design allows the acquisition of the SRS spectrum within 20 μs with 10 cm−1 spectral resolution but significantly reduces SNR. This challenge is effectively addressed by SS-ResNet specifically designed for the small training size, Raman spectroscopic data. SS-ResNet is an architecture based on U-Net, including the structure of the encoder and decoder. Different from the commonly used 3D CNN filters, SS-ResNet uses two parallel filters, including a 1 × 3 × 3 space convolution filter and a 3×1×1 spectral convolution filter to maintain spectral continuity between adjacent frames, allowing SNR recovery comparable to 100 times averaging. In addition, SS-ResNet also applies a residual learning scheme to better train deep networks. With SS-ResNet, fingerprint SRS imaging of biomolecules in cancer cells, whole mouse brain tissue slices, and bacteria were demonstrated (Fig. 2a)49.

a Fingerprint spectroscopic SRS imaging of fixed Mia PaCa-2 cells (scale bar, 20 μm), mouse brain and Escherichia coli (scale bar, 10 μm) by single raw acquisition, spatial-spectral residual learning network (SS-ResNet) recovery, and 100 images averaging ground truth (GT). Reproduced with permission from ref. 49. b Comparison of data acquisition schemes between conventional frame-by-frame and sparse sampling schemes (left). Sparsely sampled and raster-scanned SRS images for living Candida albicans (right). Scale bar, 10 μm. Reproduced with permission from ref. 63. c DeepChem architecture (left) and the predicted subcellular organelle maps using DeepChem (right). Spectrally summed hyperspectral SRS (hSRS) image was used as input, and semi-manually segmented results from hSRS image were used as the ground truth. Bkg background, LD lipid droplet, ER endoplasmic reticulum, Cyto cytoplasm. Scale bar, 10 μm. Reproduced with permission from ref. 66.
Besides the denoising-based approaches to boost the hyperspectral imaging speed, reducing the sampling number (in the spatial or spectral domain) is another potential way to save data acquisition time. Inspired by matrix completion, a theory of recovering a low-rank matrix from a sparse subset of random observations60,61, it was demonstrated that the spectroscopic image stack could be reconstructed from a randomly sampled small portion of pixels62,63. Specifically, a three-dimensional sparse sampling approach was designed by measuring ~20% of pixels throughout the hyperspectral SRS image stack, significantly enhancing the acquisition speed. Then, a regularized non-negative matrix factorization (NMF) algorithm was developed to reconstruct the randomly sub-sampled image stack into concentration maps of biomolecules, achieving high-speed metabolic imaging of living specimens without sacrificing the spatial and spectral resolutions of the system (Fig. 2b)63.
Conversely, by focusing the energy at a single frequency, the imaging speed of CRS microscopy can reach up to the video rate64, although the molecular specificity becomes somewhat limited. Especially for the SRS imaging using femtosecond pulse excitation, although integration over a broad spectral bandwidth generates 10-fold higher SNR than picosecond pulse excitation65, it results in a poor spectral resolution, unable to resolve overlapping Raman bands. By learning the correlation between spectral and spatial features, Zhang et al.66 developed DeepChem, a DenseNet-based neural network, to reconstruct chemical maps from high-speed femtosecond SRS images. Unlike conventional CNN, each layer in the DenseNet is connected with every other layer in a feed-forward fashion67. Such a network promotes feature propagation and reuse, significantly reducing the number of parameters. After training with the spectrally summed hyperspectral SRS images and subcellular organelle maps generated by spectroscopic information provided by hyperspectral SRS imaging as the ground truth, DeepChem has successfully predicted the organelle maps of lipid droplets, endoplasmic reticulum, nuclei, and cytoplasm, with high fidelity (Fig. 2c)66. We summarize the comparison of these methods in Table 3. Overall, the synergistic combination of machine learning and hyperspectral imaging acquisition has pushed the imaging speed to the next level for broad biological applications.
Machine learning empowered multi-dimensional image data analysis
The CRS signal is known to be highly complex and can contain contributions from multiple chemical species, making it challenging to extract meaningful information. Traditional machine learning approaches, such as multivariate analysis tools, have been employed to analyze spectral data and generate compositional maps. One commonly used approach is PCA, a dimension-reduction technique that decomposes the original spectral data into several principal components. PCA is particularly adept at detecting subtle spectral differences within a dataset by identifying the most significant variance present in the data, thereby revealing hidden patterns. Another key technique for analyzing CRS data is multivariate curve resolution (MCR)68, which is a method that separates the contributions of different components in a mixture and retrieves the pure spectra and concentration profiles for each element. Zhang et al. utilized MCR to resolve overlapped Raman bands for quantitative mapping of biomolecules69, demonstrating the potential of hyperspectral stimulated Raman loss microscopy for quantitative mapping of chemical components in complex biological systems, such as breast cancer cells and fat tissues. Spectral unmixing algorithms, such as vertex component analysis (VCA) and the least absolute shrinkage and selection operator (LASSO), have also been applied to analyze hyperspectral SRS imaging data49,70. The core purpose of VCA is to extract extreme components from multivariate data, which represent pure spectral information or another form of building blocks in a dataset. LASSO achieves this by introducing a penalty term in the loss function, which limits the absolute size of the model coefficients, pushing some coefficients to shrink exactly to zero.
Besides these individual models, ensemble machine-learning algorithms have been shown to provide higher accuracy, robustness, and stability. The core idea is to combine multiple learning algorithms that integrate the prediction results of multiple algorithms to improve prediction performance over any single learning algorithm. For example, a machine learning framework was constructed based on VCA, in combination with k-means clustering analysis (KMCA) and random forests classifier to analyze expressed human meibum, revealing spectral features that potentially correlate with meibum health quality (Fig. 3a)71. Others have constructed and utilized an ensemble machine-learning analysis pipeline using a combination of clustering validation and VCA protocols to quantitatively analyze lipid particles in C.elegans, identifying subgroups based on their compositions (Fig. 3b)72. Furthermore, cell segmentation and classification based on the metabolic signatures of each cell obtained from hyperspectral SRS images was achieved by an SVM-based machine learning pipeline (Fig. 3c)73, validating the approach for high-throughput, semi-automatic analysis for basic research and clinical applications.

a Vertex component analysis (VCA) described the spectra in terms of three end members, coarse k-means clustering analysis (KMCA) determined the background and the protein spectra, and fine KMCA discerned various levels of lipid-protein mixtures in meibum samples. Image size is 235.5 × 235.5 μm. Reproduced with permission from ref. 71. b Coherent anti-Stokes Raman scattering (2845 cm−1) image and the result after applying the ensemble machine-learning method. PLP lipoprotein particle. Scale bar, 10 μm. Reproduced with permission from ref. 72. c Machine learning-based method for single-cell classification in hyperspectral SRS image dataset. Scale bar, 20 μm. Reproduced with permission from ref. 73.
Compared to manual or semi-automatic methods, deep neural networks offer an automated, robust framework for learning complex representations from data. U-Net74 is a widely used deep learning architecture for image segmentation in CRS image analysis. It has been trained on multi-spectral SRS image data that includes lipid, protein, and target drug molecules, allowing it to identify cell features and provide a quantitative platform for in vivo pharmacokinetic studies at the cellular level (Fig. 4a)75. U-Net has also been used to segment single cells in human brain tumor specimens (Fig. 4b)76 and single bacterium for clinical applications77. The U-Net architecture utilized in these studies bears a strong resemblance to the U-Net framework deployed for denoising, with the primary variation being the adjustment in the number of convolutions. This highlights the versatility of the U-Net method when applied to a spectrum of imaging challenges. However, for distinct research endeavors, it is imperative to refine and tailor the approach based on specific requirements to ensure optimal performance outcomes.

a Representative SRS lipid images (leftmost column), the output of the U-Net analysis (second column), and overlays over the original image (third column) from stratum corneum (SC), sebaceous gland (SG) layers. (Right) pharmacokinetics profile patterns for lipid- and water-rich areas of the images. Scale bar, 20 μm. Reproduced with permission from ref. 75. b An overview of the cell counting framework that can provide clinical support for image-guided brain tumor surgery in the operating room. Reproduced with permission from ref. 76. c Input SRS images, ground-truth fluorescence results and predicted fluorescence results based on U-within-U-Net (UwU-Net) are shown for nuclei, mitochondria and endoplasmic reticuli. Scale bar, 25 μm. Reproduced with permission from ref. 78.
To better handle hyperspectral image datasets, U-Net has been amended by adding a separate U-structure for spectral channel information, generating a new architecture named U-within-U-Net (UwU-Net), capable of performing segmentation and classification from various types of hyperspectral image datasets (Fig. 4c)78. Specific examples of the application of CNN for CRS imaging in clinical settings are discussed in the section below. Overall, these methods and approaches have been successfully demonstrated in various CRS imaging of biological systems, highlighting the potential for advancing our understanding of spatial-temporal dynamics of biomolecules and enabling clinical applications.
Machine learning toward clinical applications
Machine learning has swiftly expanded within clinical applications, showing particular promise in stimulated Raman histology (SRH)79,80,81. SRH, a label-free chemical imaging method, harnesses SRS signals to create contrast, producing virtual H&E images familiar to clinicians. However, interpreting histopathologic images can be time-intensive, and observer discrepancies can arise. The integration of CRS imaging with machine learning has emerged as a pivotal strategy for achieving rapid and accurate clinical diagnoses, which extend to diverse clinical scenarios, from non-invasive diagnosis of cancers18,76,80,82,83,84, nerve segmentation necessary for nerve-sparing surgeries utilizing endoscopes85 to drug uptake imaging and tracking in the skin75. We summarize these methods for clinical application in Table 4.
One notable application of such a combination is intraoperative diagnosis, where timely and precise assessments are vital for safe and effective surgical procedures. Orringer et al. introduced a clinically compatible SRS system using a fiber laser for label-free tissue imaging in the operating room (Fig. 5a)18. This system generates SRH images from lipids and proteins/nucleic acids signals, which are processed and analyzed in real-time during surgery. A supervised machine learning algorithm, specifically a multilayer perceptron (MLP), was developed to interpret SRH images, assigning probabilities to four diagnostic categories (non-lesional, low-grade glial, high-grade glial or non-glial tumor), which are crucial for decision-making during brain tumor surgeries18. Although this approach achieved an impressive prediction accuracy of 90%, its reliance on manual feature engineering associated with traditional machine learning renders it somewhat less robust in clinical contexts.

a (Above) SRS microscope in the operating room. (Below) Probability heatmaps overlaid on the stimulated Raman histology (SRH) mosaic images indicate the multilayer perceptron (MLP)-determined probability of class membership for each field-of-view across the mosaic image for the four diagnostic categories. Reproduced with permission from ref. 18. b (Left) SRH mosaic of a specimen collected at the brain tumor interface of a patient diagnosed with glioblastoma. (Right) Three-channel RGB convolutional neural network-prediction transparency overlaid on stimulated Raman histology image using convolutional neural network. Scale bar, 50 μm. Reproduced with permission from ref. 80. c U-Net-based femto-SRS imaging with recovered chemical resolution (above) and automated diagnosis with convolutional neural networks on femto-SRH images (below). Scale bar, 50 μm. Reproduced with permission from ref. 82. d (Left) Input feature matrix (PC1, PC2, cell number of Cluster 1) of peritoneal metastasis (PM) positive and negative specimens by the K‐means cell clustering and principal component analysis algorithm (K‐PCA) method. (Right) Confusion matrix, positive probability with cross-validation, and receiver operating characteristic curve by the K-PCA method with a threshold to get the best sensitivity and specificity. CY conventional cytology. Reproduced with permission from ref. 88.
The combination of CNNs with SRH has been explored in pursuit of greater robustness, leading to a parallel workflow for near real-time diagnostic predictions (Fig. 5b)80. SRH images were processed through a dense sliding window algorithm, generating high-resolution patches for CNN training and inference. In the prediction stage, these individual patches traverse a trained Inception-ResNet-v2 network, a deep CNN architecture integrating inception modules (This module uses convolution kernels and pooling layers of different sizes in parallel to capture image information) and residual connections86, producing semantic segmentation for brain tumor predictions within 150 seconds, significantly faster than conventional techniques. To accomplish rapid diagnosis of azoospermia, Huang et al. specially designed a lightweight network based on a new CNN, LiteBlendNet. Employed with a multimodal platform, they collecting both SRS signal for chemical information and second harmonic generation signal for structural information, successfully achieving an outstanding 100% sample-level accuracy and 96.2% patch-level accuracy in classifying azoospermia, surpassing conventional imaging modalities87. Recent advancements have further elevated imaging speed through the U-Net-based recovery of dual-channel SRS images from single-shot images of femtosecond SRS containing integrated spectral intensity within the C−H stretching region (Fig. 5c)82, achieving imaging speed of fresh gastroscopic biopsy within 60 seconds. The U-Net model used here is similar to the typical U-Net design. Coupled with a diagnostic CNN (Inception-ResNet-V2) tailored for tissue segmentation, this work demonstrates the potential for immediate diagnosis and automated assessment of resection margins in intraoperative histopathological diagnosis.
Stimulated Raman molecular cytology (SRMC) is a newly established, intelligent cytology method that employs SRS microscopy in conjunction with deep learning algorithms. This innovative approach significantly enhances the detection of peritoneal metastasis (PM) in gastric cancer by analyzing exfoliated cells collected from ascites fluid of patients (Fig. 5d)88. SRMC leverages deep learning-based segmentation algorithms, such as Stardist89 (A segmentation model based on U-Net), to identify and meticulously analyze individual cells. It extracts a variety of critical features from these cells, essential for evaluating their characteristics. Techniques like PCA and K-means clustering are utilized to manage and simplify complex, high-dimensional data. Following data preprocessing, various machine learning models—including SVM, LDA, and LR—are employed. These models are trained to accurately predict the presence of metastasis based on the features derived from the cells. The combined use of these sophisticated imaging and machine learning techniques enables SRMC to offer rapid and accurate diagnostics.
Outlook
Machine learning has emerged as a promising tool to address the intricate challenges in spectroscopic imaging, enabling efficient data acquisition, processing, and advanced information extraction. By bridging the gap between speed, image quality, and information extraction, machine learning transforms spectroscopic imaging into a robust and versatile technique for both research and clinical applications. From the current standpoint, certain techniques designed based on the U-Net, ResNet, and DenseNet architectures, such as Noise2Noise, SS-ResNet, and DeepChem, have shown remarkable performance in denoising and rapid imaging of biomedical images. A derivative network of U-Net, named UwU-Net, has demonstrated its strong capability in segmenting and classifying various types of hyperspectral image datasets. These U-Net-based methods have also demonstrated excellent effectiveness in clinical diagnosis, benefiting from powerful feature representation achieved through progressive feature fusion layers. It is highly anticipated is the possibility of new foundational frameworks emerging in the future, which could lead to an entirely new wave of technological updates. This article concludes with an outlook on the future advancements and integration of machine learning in spectroscopic imaging, highlighting its potential to promote the field further.
Nonetheless, the potential for further improvement in machine learning techniques remains crucial in promoting the development of spectroscopic imaging. Specifically, while deep learning models have demonstrated exceptional performance in providing structural insights, it is important to note that their effectiveness is often limited to the specific context in which they were trained and relies on the size and quality of the training data. Consequently, an algorithm developed for one spectroscopic imaging system may not be readily applicable to another, thus impeding the versatility and capability of such a technique. This challenge is particularly pronounced in CRS imaging, which encompasses diverse modalities such as femtosecond/picosecond pulse excitations and CARS/SRS modalities, generating datasets with relatively high variances4. The wide range of spectral coverage, from the high-wavenumber C−H stretching, and cell-silent, to fingerprint region, presents additional complexity for machine learning algorithms90, which must effectively capture and interpret the diverse spectroscopic features within the data. Effective transfer learning may further mitigate such challenges.
Unlike modern machine learning techniques benefiting from large data generated in the recent digital age, another crucial obstacle for spectroscopic imaging arises from the scarcity of accessible datasets for algorithm development, compounded by the time-intensive nature of procuring suitable training datasets for CRS microscopy applications. The lack of publicly available datasets restricts the ability of researchers to compare different algorithms and hampers the collaborative development of robust and generalizable spectroscopic imaging approaches. Therefore, to address these challenges, it urges for coherent and concerted efforts from the scientific community, in which researchers should prioritize the establishment of standardized benchmark datasets that cover diverse imaging modalities and spectral ranges. Open-source initiatives that facilitate data sharing and collaboration among researchers will greatly contribute to the development of more universal and powerful machine-learning algorithms for spectroscopic imaging.
In conclusion, the fusion of spectroscopic imaging and machine learning holds great promise. Applying machine learning to spectroscopic imaging techniques is envisioned to extend beyond CRS microscopy, encompassing other spectroscopic modalities like advanced IR spectroscopic imaging91,92 and nonlinear microscopy such as transient absorption and harmonic generation imaging and analysis93. A remarkable feature would be the development of a generalized framework trainable with the limited available data. Particularly, minimal data sufficing for effective training would be greatly preferred for specific downstream tasks such as segmentation and denoising. Moreover, it is worth expecting in the future, with the improvement in the interpretability of explainable deep learning, it will be possible to obtain more profound technical explanations for improvement in applications such as segmentation, classification, and more to deepen the understanding of underlying mechanisms for various processes such as diseases or metabolism. By collectively addressing these challenges, the scientific community can push the boundaries of machine learning in spectroscopic imaging, enabling enhanced data analysis, interpretation, and ultimately, the broad application of this empowering technology.
Responses