TransPeakNet for solvent-aware 2D NMR prediction via multi-task pre-training and unsupervised learning

Introduction

Nuclear magnetic resonance (NMR) spectroscopy has emerged as a versatile tool with widespread applications across diverse scientific domains, including chemistry, environmental science, food science, material science, and drug discovery by unraveling molecular dynamics and structures^1,2,3,4. The primary information of an NMR spectrum arises from the chemical shift, which is determined by the local environment of a nucleus and influenced by interactions through chemical bonds and space. This mechanism yields unique “fingerprints” corresponding to diverse functional groups or molecular motifs, thereby facilitating the streamlined deduction of atomic connectivity and arrangement.

Interpreting NMR spectra requires following essential guidelines, often referred to as “rules of thumb”, where specific chemical shifts are associated with distinctive functional groups⁵. The determination of molecular structures from varying chemical shifts on NMR spectra generally requires the expertise of experienced organic chemists. To facilitate the interpretation of NMR spectra, significant efforts have been directed towards computational simulation of NMR spectra⁶. Early computational approaches, like the Hierarchically Ordered Spherical Environment (HOSE) codes⁷, aim to encapsulate atom neighborhoods in concentric spheres, utilizing a nearest-neighbor approach to predict NMR shift values. A recent HOSE approach⁸ yields Mean Absolute Errors (MAEs) of 3.52 ppm for ¹³C NMR and 0.29 ppm for ¹H NMR on the nmrshiftdb2⁹ dataset. Concurrently, significant efforts have been devoted to the ab initio calculation of NMR properties.^10,11 Density Functional Theory (DFT)-based methods were developed for certain small organic molecules, achieving MAEs of 2.9 ppm for ¹³C NMR and 0.23 ppm for ¹H NMR¹². However, the accuracy of these DFT-based methods relies heavily on the choice of the basis functions, which often require meticulous case-by-case manual tuning for each molecule. Moreover, the time-intensive nature of DFT calculations limits their applications to comprehensive and large datasets. Recently, the rise of Graph Neural Networks (GNN) and their successes in predicting molecular properties^{13,14,15,16,17} has prompted initiatives to employ GNNs for predicting peaks in NMR spectra^5,18,19. The application of GNN to molecules is intuitive, as a molecular structure can be naturally represented as a graph, with each atom as a node and its chemical bonds as edges. On 1D NMR data, a GNN-based model achieves MAEs of 1.355 ppm for ¹³C NMR and 0.224 ppm for ¹H NMR on the nmrshiftdb2 dataset¹⁸. While considerable efforts have been made in developing predictive models for 1D NMR, the prediction of 2D NMR remains underexplored.

Heteronuclear Single Quantum Coherence (HSQC) spectroscopy²⁰, a sophisticated 2D NMR technique, is an important tool for elucidating atomic connectivity within complex molecules where conventional 1D NMR may prove insufficient^21,22. By correlating the chemical shifts of hydrogen nuclei with those of heteronuclear nuclei, typically carbon or nitrogen, via scalar coupling interactions, HSQC facilitates the comprehensive mapping of interatomic connections within a molecule. This mapping yields crucial insights into chemical bonding, molecular conformation, and intramolecular interactions. As molecular structures become more complex, 1D NMR spectra tend to display increasingly overlapping peaks, making 2D NMR techniques such as HSQC essential for elucidating local structures. A notable stride in this domain utilizes the ML approach to establish correlations between DFT-simulated HSQC spectra and empirical data to identify molecules²³. However, the accurate prediction of HSQC spectra using ML techniques remains elusive, primarily due to the scarcity of large-scale, high-quality datasets, as well as the labor-intensive and time-consuming peak annotation process. While numerous annotated 1D spectra are available for training ML models, combining these results to reliably generate 2D NMR data is far from straightforward. As an example, a recent study introduced a method to integrate the state-of-the-art predictions of proton and carbon 1D spectra into HSQC spectra²⁴, achieving MAEs of 0.16 ppm for ¹H and 2.64 ppm for ¹³C. This study highlights the inherent difficulties in accurately predicting 2D NMR chemical shifts, as even though the selected 1D NMR models and methods achieve low error individually, such success cannot be transferred to HSQC cross-peak prediction. Moreover, beyond the challenge of prediction, associating each HSQC peak with its corresponding local molecular structure is an inevitable task that requires expert domain knowledge and is often time-consuming. Proper calibration is typically necessary to ensure precise alignment between 1D and HSQC data before interpreting the HSQC spectra.

In light of the aforementioned challenges and opportunities in interpreting HSQC spectra, we propose TransPeakNet: a Transfer learning-based Peak prediction and assignment with unsupervised learning, illustrated in Fig. 1. This framework enables end-to-end training and testing on experimental data. Once trained, the TransPeakNet model generates cross-peak predictions directly from a SMILES representation and the solvent environment used in the experiment. Additionally, it simultaneously associates each cross-peak signal with its corresponding carbon-proton pairs. Alongside a Graph Neural Network (GNN) module capturing structural nuances, the model incorporates a solvent encoder to effectively account for the impact of solvent environments on chemical shifts, which is essential for delivering accurate cross-peak prediction and peak assignment of HSQC spectra. To tackle the lack of annotated HSQC data, we designed a two-step transfer learning process. First, the model is pre-trained on a labeled 1D NMR dataset via Multi-Task pre-Training (MTT), enabling it to learn a wide range of C–H interactions. Then, we implement an unsupervised learning strategy that uses the unlabeled HSQC dataset to refine the model’s ability to accurately discern and label HSQC cross peaks.

**Fig. 1: Illustration of the model design and training strategy of TransPeakNet.**

The model is pre-trained using ~24,000 annotated 1D NMR dataset from NMRShiftDB2⁹, and finetuned on ~19,000 experimental HSQC spectra from HMDB²⁵ and CH-NMR-NP²⁶. The model is thoroughly evaluated and compared to traditional tools like ChemDraw²⁷ and Mestrenova²⁸ using expert-annotated test datasets. On test dataset, the model achieves MAEs of 2.05 ppm and 0.165 ppm for ¹³C shifts and ¹H shifts respectively. We also demonstrate that our model effectively considers the impact solvent has on chemical shift, when making the prediction. When compared with the traditional tools, our model shows promising improvements, especially as the molecular size becomes larger.

Results

Performance on HSQC cross peak prediction and assignment

Figure 2 summarizes the performance of our model on the tasks of HSQC cross-peak prediction and peak assignment, using an expert-annotated test dataset consisting of 500 molecules, with an average molecular weight of 398.98 Da., and an average number of 56.32 atoms. The annotation process involved three experienced experts with extensive knowledge in organic chemistry. For each molecule, two experts independently linked the observed cross-peaks from experiments to C–H bonds. If they agreed, the annotation was finalized. In cases of disagreement, the third expert reviewed and validated the annotations. Samples with poor quality, such as those with insufficient experimental resolution, were excluded from the test dataset for model evaluation, resulting in 479 high-quality annotations. For chemical shift prediction, our model achieved an MAEs of 2.05 ppm for ¹³C shifts and 0.16 ppm for ¹H shifts. In terms of annotation accuracy, our model accurately annotated all peaks in 456 out of 479 molecules (95.21%). An example of the model prediction and annotation of a small molecule is visualized in Fig. 2C for simplicity. Additional examples for medium and large molecules are included in Supplementary Information Section 1, Supplementary Figs. 1–4. For those 23 molecules that our algorithmic annotations do not fully agree with the experts, 81.56% of the peak annotations still align. Examples of partially correct annotations are visualized in Supplementary Information Section 2, Supplementary Figs. 5–9. Notably, the peak assignment algorithm operates without a shift discrepancy threshold, allowing it to align all ground truth peaks with predictions. Given the low MAE of the predicted shifts (2.05 ppm for ¹³C and 0.165 ppm for ¹H), the model provides a solid foundation for accurate assignment. Even in rare cases where a few atoms in a molecule have relatively large prediction errors, the annotation remains reliable, highlighting its robustness.

**Fig. 2: Model prediction and alignment accuracy.**

To evaluate our model’s ability to capture solvent effects, we tested different solvent encoders for carbon and hydrogen atoms. Empirical results indicate that a solvent encoder with a dimension of 32 is most effective at capturing proton shifts, while adding additional embeddings for carbon does not yield significant benefits. This partially aligns with expert expectations, as protons are known to be more sensitive to their solvent environment due to their high exposure to the surrounding molecular environment. By contrast, carbon atoms are less directly influenced by solvent interactions, as they are more deeply embedded within the molecular structure and shielded by surrounding electron clouds. Additionally, carbon atoms are not directly involved in hydrogen bonding, and their larger mass and lower sensitivity to external magnetic fields make their shifts less responsive to subtle solvent changes. Furthermore, since the 1D NMR data used for pre-training does not include solvent information, and some solvent groups (e.g., benzene, acid, etc.) in the HSQC data are underrepresented, as shown in Fig. 3A, the influence of solvent on carbon shifts may be too subtle to capture at this stage. This remains as our ongoing area of investigation for further insights. To assess the effect of solvents on proton shifts, we report model performance using the true experimental solvent, a random solvent condition, and the “unknown” solvent condition, with the comparison shown in Fig. 3B. The results demonstrate that using the correct solvent input yields the lowest prediction error, aligning most closely with experimental observations. This improvement in prediction is particularly significant for the CCl₃, DMSO, and methanol solvent classes, likely due to their prominent presence in the dataset. The effect of water, on the other hand, despite constituting only 1.02% of the data, can be effectively captured by the solvent encoder. This could be explained by water’s similarity to methanol in its behavior as a hydrogen-bond donor or acceptor. Both solvents can strongly deshield protons in solutes, causing their NMR peaks to shift higher. The predictability of these hydrogen-bonding interactions may enable the model to generalize well, even with relatively few training examples for water. These findings underscore the promise of incorporating solvent environments into peak prediction models and highlight the need for more high-quality data with solvent information.

**Fig. 3: Solvent analysis and impact on predictions.**

Comparison with traditional tools

In organic chemistry, simulating HSQC spectra is crucial for analyzing experimental HSQC spectra, as it assists researchers in assigning the observed cross peaks to the C–H bonds in target molecules. Traditional approaches, including software solutions such as ChemDraw²⁷, and Mestrenova²⁸, have long served as the primary resources for this task. We compared our model with ChemDraw and Mestrenova and the results are shown in Fig. 4, which clearly demonstrate the superiority of our model. It is labor intensive to align predictions from these traditional tools with ground truth. Moreover, these tools could yield unstable predictions as molecular size and complexity increases. Hence, we randomly selected 150 peaks across different molecular weight categories for comparison. Because the HSQC prediction is not inherently implemented in ChemDraw, the cross peaks were constructed and assigned using 1D predictions (¹³C and ¹H chemical shifts) based on bond connectivity. We also provided two examples with different molecular sizes to visualize the comparison.

**Fig. 4: Performance comparison between TranPeakNet and traditional methods.**

Performance by segmentation

To comprehensively evaluate the robustness and generalizability of our model, we conducted a segmentation analysis in the test dataset, assessing its performance across various molecular subcategories. Segmentation is crucial in this context because it allows us to determine whether the model performs consistently well across diverse molecular characteristics, rather than excelling in only a specific subset. We selected categories based on molecular weight and the presence of saccharides. These categories were chosen because NMR prediction gets increasingly challenging as the molecule gets larger, while saccharides represent a distinct chemical group with unique structural features. By examining the model’s performance within these defined segments, we aim to demonstrate its universal applicability and robustness across a wide range of molecular types.

Molecular Weight (MW) is a general indicator of a molecule’s complexity, encompassing varied geometries, bonding patterns, and the presence of isomers. These factors contribute to increased intramolecular interactions, resulting in spectral complexity such as closely spaced peaks or overlapping signals. Additionally, the increased number of spin-spin interactions within larger molecules necessitates more advanced NMR techniques to achieve sufficient resolution²⁹. Furthermore, solubility issues can lead to weak signals, further complicating spectral analysis. Consequently, interpreting HSQC spectra for medium and large molecules is challenging. Therefore, there is a pressing need for a model that can effectively predict and analyze HSQC spectra for these complex molecules.

Figure 5A showcases the stratified performance on this category, where the test molecules are grouped into three categories: small (MW < 500 daltons), medium (500 <= MW < 1000 daltons), and large (1000 daltons <= MW). On the task of predicting ¹H shifts of HSQC cross peaks, the model performs comparably across all groups, achieving excellent MAE of 0.16–0.19 ppm. On the task of predicting ¹³C shifts, the model achieves an MAE of 1.93 ppm for medium-sized molecules. Our model demonstrates a good generalization power on large molecules and achieves an MAE of 2 ppm on predicting ¹³C shifts for this category, despite the training data containing only a small proportion of large molecules (~2%).

**Fig. 5: Model performance comparison on different segmented categories.**

Saccharides, or carbohydrates, play critical roles in various biological processes involved in energy source and storage, cell signaling, cell adhesion, cell recognition, structural integrity of cells and tissues, as well as cognitive functions and metabolic regulation^30,31,32,33. Despite their importance, elucidating the structures of saccharides is challenging due to their inherent structural complexity and diversity. This complexity arises from the diverse arrangements of monosaccharide units, varied anomeric configurations, and variable glycosidic linkages. Additionally, carbohydrates often lack the crystallinity required for high-resolution X-ray diffraction, unlike the well-defined crystalline structures of small molecules or proteins. Consequently, NMR spectroscopy, particularly through techniques such as HSQC, has emerged as an indispensable tool in unraveling the detailed structures of carbohydrates^34,35. Forecasting HSQC cross peaks and aligning them with experimental data can assist in comprehending saccharide connectivity and stereochemistry, thus aiding in structural determination.

Our model demonstrates excellent performance in predicting HSQC cross peaks for saccharides molecules, yielding MAEs of 1.78 for ¹³C shifts and 0.16 for ¹H shifts (see Fig. 5(B)). This level of accuracy is consistent with the overall model performance, which demonstrates the model’s robustness in handling complex saccharide structures. Figure 6 shows the performance of our model on a few exemplar saccharides. These saccharides feature multiple ring structures and numerous stereogenic centers, contributing to the intricate nature of their HSQC spectra. Despite these inherent complexities, our model exhibits high accuracy in predicting the HSQC cross peaks for these molecules. This robust performance underscores our model’s capacity to navigate the complexities associated with saccharides, thereby emphasizing its versatility and effectiveness across various applications in the field.

**Fig. 6: Exemplary demonstration of our model’s performance on saccharides.**

Effects of pre-training and fine-tuning

After pre-trained via MTT on the 1D NMR dataset, the model achieved the validation performance with MAEs of 0.210 ppm for ¹H NMR prediction and 2.228 ppm for ¹³C NMR prediction. This success can be attributed to MTT which allows the model to effectively learn atomic latent features as well as local structural information by simultaneously performing ¹H and ¹³C NMR shift predictions. This helps us surpass the problem with limited annotated HSQC data. However, when directly deploying the pre-trained model on HSQC test dataset, the model MAEs increase to 1.397 ppm and 2.822 ppm for ¹H and ¹³C shifts, respectively. These relatively large MAEs are expected as the data distribution of the HSQC dataset (76.34% small molecules and 90.33% non-saccharides) differs significantly from that of the 1D NMR dataset (98.80% small molecules and 99.95% non-saccharides). In addition, the HSQC cross peaks involve interactions beyond simple pairings of 1D ¹³C and ¹H shifts, requiring a deeper understanding of interactions between atoms. Finally, the frequent absence of solvent labels in the 1D NMR dataset prevents the model from learning solvent effects.

Nevertheless, the pre-training via MTT offers a robust foundation for fine-tuning the model via unsupervised transfer learning. With each iteration, we observed a reduction in model errors. The performance improvement is more pronounced during the initial iterations and gradually diminishes. By the fifth iteration, the improvement became marginal, indicating the convergence of fine-tuning. Finally, the fine-tuned model achieves MAEs of 0.165 ppm and 2.05 ppm for ¹H and ¹³C shifts, respectively. Throughout the transfer learning process, the model was trained to gain a more profound understanding of solvent effects and complex C–H interactions due to intricate molecular structures.

Discussion

In this study, we introduce a framework to develop machine learning techniques for predicting C–H cross peaks in HSQC spectra, The framework enables us to tackle two major challenges in this avenue. The first challenge is the scarcity of annotated HSQC data for training machine learning models. The second challenge is that collecting large volumes of annotated HSQC data is labor-intensive and requires highly trained personnel. In implementing our framework, we developed a model combining a GNN with a solvent encoder. The GNN is trained to generate atomic embeddings that encapsulate both the local and global chemical environments of each atom, which is crucial for accurate chemical shift predictions. The atomic embeddings are combined with the solvent embedding produced by the solvent encode, which allows our model to learn the influence of solvent on chemical shifts. The combined embeddings are mapped by the Multi-Layer Perceptron (MLP) modules to HSQC chemical shifts. Our framework employs a two-stage transductive strategy to train the model while addressing the aforementioned challenges. In the first stage, we use a large amount of annotated 1D NMR data to pre-train the model via Multi-Task learning. This enables the model to adeptly grasp the intricate relationship between atomic interactions and NMR signals, laying a robust foundation for the subsequent stage. Next, the model is refined on a set of unlabeled HSQC spectra via Iterative Unsupervised Learning, enhancing the model’s capability in predicting and interpreting HSQC spectra. Our final model achieves MAEs of 0.165 ppm and 2.05 ppm for ¹H and ¹³C shifts respectively, while accurately assigning cross peaks. It demonstrates a consistent performance across various molecular weight and saccharide categories, significantly outperforming the traditional methods, and shows convincing generalization capabilities to less represented samples from the training dataset. In the future, we plan to refine our model by developing 3D-GNN models that are able to consider 3D structural information such as spatial orientation and conformational flexibility. This enhancement should enable us to handle other 2D NMR spectra, such as Correlation Spectroscopy and Nuclear Overhauser Effect Spectroscopy, thus broadening its applicability and making a more substantial contribution to the field of chemical analysis.

Methods

In this section, we explain the components of our model and the training strategy in detail.

Data

The pre-training dataset used in the MTT process is a 1D NMR dataset from NMRShiftDB2⁹, which contains ~24,000 annotated NMR spectra collected from 22,663 distinct molecules. The datasets used in the unsupervised transfer learning process consist of a training dataset containing ~19,000 experimental HSQC spectra and a validation dataset containing ~5000 HSQC spectra, collected from HMDB²⁵ and CH-NMR-NP²⁶. All data was accessed in September 2023, with no evidence of potential bias. To prevent data leakage in the validation dataset, duplicated spectra and molecules were removed. RDKit package in Python is used to perform sanity check for all SMILES strings to generate valid molecular topology graph. To quantitatively evaluate our model, we built a test dataset by randomly selecting 500 spectra and manually annotating them to establish the ground truth (see Section “Performance on HSQC cross peak prediction and assignment” for the annotation process). Additionally, to compare our model with two conventional tools (ChenDraw and Mestrenova) in chemistry, we randomly selected several molecules from this test dataset, consisting ~150 cross-peaks. Since it is labor-intensive to derive HSQC shifts from molecular formulas using these established tools, stratified sampling was used to select these samples, ensuring the coverage of different molecular weight groups (0–499 Dalton, 500–999 Dalton, and 1000+ Dalton). The comparison results are presented in Section “Comparison with traditional tools”.

2D NMR prediction model

As illustrated in Fig. 1A, our model contains a GNN component for encoding molecular features and a solvent encoder component for embedding solvent information. The GNN component learns atomic embeddings that capture both the local and global chemical environments of each atom, which are essential for understanding the observed NMR chemical shifts. The learnt atom representations are expanded by the solvent embedding, and then are mapped to ¹³C and ¹H cross peaks by a MLP component.

GNN

A molecule can be represented by a graph G = (V, E), where V is the node set representing atoms and E is the edge set representing chemical bonds. Three features are provided for each node: atomic type, chirality, and hybridization. Also, two features are considered for each edge: bond type and bond direction. Bond types include Single, Double, Triple, and Aromatic, each reflecting a distinct configuration of electron sharing between atoms. Bond direction includes None, EndUpRight, and EndDownRight, primarily representing stereochemistry in double bonds. Each atom’s feature vector is embedded into a representation vector by a learnable encoder. Similarly, each edge’s feature vector is embedded into a representation vector of the same length by another learnable encoder. Then, a GNN model^{36,37,38,39,40} utilizes the message passing mechanism to iteratively refine the representation of each node based on information from its neighbors and connected edges. This mechanism allows the learnt node representation to effectively capture structural context, reflecting the foundational principles of atomic interactions. Our implementation of the message passing mechanism is illustrated in Fig. 7. It iterates for a predefined number of layers L, facilitating the propagation of information throughout the graph. Consequently, each node can gradually accumulate information from a wider neighborhood across successive layers. This allow the final representation of each node to capture both local and global structural information. Our model features 5 GNN layers, with an atomic embedding dimension of 512.

**Fig. 7: Illustration of message passing and node representation updates in a GNN layer.**

Solvent encoder

Since the solvent has a profound impact on NMR chemical shifts, we incorporated a trainable solvent encoder component into our model to accurately capture this influence. We identified the following 9 principal solvent groups based on their prevalence in our dataset and domain-specific understandings of their distinct impacts on NMR shifts. These groups include trichloromethane, dimethyl sulfoxide, acetone, acids, benzene, methanol, pyridine, water, and an additional category to encompass any unspecified solvents from our dataset (termed “unknown”). The solvent encoder transforms each discrete solvent group i into a unique, dense feature vector ({S}_{i}^{d}), where d is the embedding dimension. These learnable vectors are optimized alongside other model parameters during training, resulting in representations that accurately reflect the impact of each solvent class. Given the different sensitivities of carbon (C) and hydrogen (H) nuclei to solvent environments, different embedding dimensions d can be chosen to tailor the solvent effect modeling for each nuclei type. A larger embedding dimension d allows the embedding to more effectively capturing the nuanced influence of solvents on NMR shifts. In our implementation, the solvent embedding dimension for hydrogen (H) is set to 32.

Atomic NMR shift prediction

Finally, the embedding of each atom ({h}_{v}^{(L)}) and the solvent embedding ({S}_{i}^{d}) for each solvent class i are concatenated to produce a holistic representation of the atom within the context of its molecular structure and the given solvent. This combined representation is subsequently processed by a MLP network to predict the NMR shifts for the atom:

$${y}_{v}=,{mbox{MLP}},({{{bf{h}}}}_{v}^{(L)}oplus {{{bf{S}}}}_{i}^{d})$$

(1)

where y_v is the predicted chemical shift of atom v, ({h}_{v}^{(L)}) is the atom level embedding produced by GNN, S^d is the solvent embedding, and ⊕ is the concatenation operation. By integrating solvent embedding and atomic embedding, the model effectively combines intrinsic molecular properties and solvent effects, enhancing its ability to predict atomic NMR shifts accurately.

Two separate MLP modules are used for predicting ¹³C and ¹H shift in the cross peak predictions, respectively. Each C atom can bond up to 4 H atoms. When bonded to one, three, or four H atoms, a C atom typically shows only one cross peak in an experimental spectrum. However, when a C atom is connected to two H atoms, up to two cross peaks may be observed, depending on the chiral center. Consequently, a C atom can exhibit at most two ¹³C and ¹H cross peaks. In light of this observation, one MLP module is dedicated to predicting the ¹³C shifts and another MLP module for the corresponding ¹H shifts. For cross peak predictions, the ¹³C shifts are predicted using the embeddings of C atoms. The corresponding ¹H shift predictions for each C atom incorporate aggregates of embeddings from all bonded H atoms, resulting in two predictions that are typically very similar when only one cross peak is theoretically possible. This design enhances the model’s accuracy in predicting ¹H shifts by leveraging the C atom-centered aggregation of the H atom context. By integrating the contextual dynamics around each C atom, the model provides a more detailed and accurate mapping of hydrogen environments, crucial for pinpointing precise cross peaks in complex HSQC spectra. In our implementation, we used 2 MLP layers, with the hidden dimensions to be 128 and 64 respectively. The dropout mechanism, which randomly deactivates a subset of neurons during the forward pass, is employed during training to prevent overfitting. By reducing the model’s dependence on specific neurons, dropout encourages the model to learn more robust and generalized patterns. During inference, dropout is typically disabled to ensure deterministic predictions. However, when enabled during inference, dropout can be leveraged to estimate uncertainty by calculating the standard deviation across multiple predictions¹⁸. An example of this uncertainty estimation is provided in Supplementary Information, Section 4, Supplementary Table 1.

Training strategy

The cross peaks are notably sparse in an HSQC spectrum, where typical resolutions for ¹³C and ¹H shifts are 0.1 and 0.01 ppm, respectively. A typical HSQC spectrum can include 20,000 readings, covering ¹³C shifts from 0 to 200 ppm and ¹H shifts from 0 to 10 ppm. However, almost all of these readings are zeros, with only a small fraction representing the potential cross peaks of C–H bonds, crucial for molecular structure analysis. Moreover, the scarcity of annotated HSQC data, particularly the labor-intensive annotations that link cross peaks to C–H bonds, makes model training difficult. To deal with this issue, we deployed MTT to pre-train the model using an extensive annotated 1D NMR dataset (Fig. 1B). This step acclimates the model with a broad range of molecular structures and their chemical shifts, and enables it to capture the intricate interplay between molecular structures and their NMR characteristics. Subsequently, we utilize an unsupervised strategy to refine the model iteratively on the HSQC dataset (Fig. 1C). Through iterative cycles of prediction, annotation, and re-training, the model progressively enhances its understanding of the complex relationships and patterns within the HSQC spectra, thus improving its predictive accuracy and providing precise cross peak alignments. By combining the MTT and unsupervised transfer learning, we extend our annotation capabilities from 1D to 2D data, thereby enhancing the model’s predictive power and utility as a robust tool for NMR spectra analysis.

Pre-training on 1D NMR data

In the pre-training phase, we utilized approximately 24,000 annotated 1D NMR data points. Among these, around 22,000 samples exclusively feature ¹³C shifts, approximately 400 samples solely exhibit ¹H shifts, while roughly 1600 samples contain both ¹H and ¹³C shifts. To train the model effectively for predicting both ¹H and ¹³C shifts, we adapt the MTT approach, which enables simultaneous training on multiple related tasks. When the input data contains ¹³C shifts, the model predicts only carbon shifts and assesses the errors between the predicted and actual values. Conversely, when the data sample contains ¹H shifts, the hydrogen shift prediction module is activated. In both scenarios, the embeddings of ¹³C and ¹H atoms in the GNN module are updated simultaneously, benefiting from the message passing mechanism. Therefore, the learnt representations implicitly contain a basic understanding of C–H relationships, essential for the interpretation of HSQC data. However, the relative scarcity of ¹H shift data, due to the difficulties in accurately obtaining and extracting peaks ¹H from experimental data, complicates the training process as focusing extensively on one type of shift could compromise the model’s ability to accurately predict the other. To handle this problem in the MTT training, we performed over-sampling on a subset of data that contain both ¹H and ¹³C shifts, and those containing only ¹H shifts. Consequently, the learned representations develop a fundamental understanding of C–H relationships, crucial for interpreting HSQC data effectively. This integration of learned atomic relationships streamlines the transition to HSQC cross peak predictions, thereby enhancing the model’s accuracy and efficiency in analyzing HSQC spectra.

Unsupervised fine-tuning on HSQC data

The model pre-trained on the 1D NMR dataset has limited ability to predict HSQC cross-peaks from molecular structures due to the differences in data pre-processing and data distribution. First, in the 1D NMR data from NMRShiftDB2, the chemical shifts of non-singlet peaks are averaged as ground truth. For example, whether the group is methine (-CH), methylene (-CH₂), or methyl (-CH₃), the proton shifts may be averaged into a single value. In contrast, HSQC data captures C-H bonds and typically displays two cross-peaks for prochiral methylene groups due to the different environments of the two hydrogens. Additionally, the proton chemical shifts of HSQC or HMQC cross peaks represent ¹³C-bound protons, whereas the signals in the ¹H mainly represent ¹²C-bound protons, potentially leading to subtle differences in chemical shift values⁴¹. Second, the molecule distributions in our 1D NMR data and HSQC data exhibit significant differences (see Supplementary Information, Section 3, Supplementary Fig. 10, for data distribution plots). The HSQC dataset comprises 76.34% small molecules and 90.33% non-saccharides, whereas the 1D NMR dataset contains 98.80% small molecules and 99.95% non-saccharides. Lastly, solvent information is not available in 1D NMR dataset, and is recorded as “unknown” in the modeling framework, whereas most molecules in the HSQC dataset are associated with known solvent environments. This makes the fine-tuning step essential for the success of our solvent-aware framework. However, the HSQC dataset is not annotated. In response, we implement an unsupervised training strategy (Fig. 1C), which iterates between (a) aligning cross peak prediction from the model with the experiment observations to annotate the HSQC data and (b) using the newly acquired annotations to fine-tune the NMR prediction model, until convergence.

Pseudo-annotation of HSQC

At the end of each round in the unsupervised learning process, the model’s predicted signals are aligned with the experimental observations to create pseudo-labels. In straightforward cases where the number of C–H bonds in a molecular graph matches the observed HSQC cross peaks, the Hungarian algorithm^42,43 is used. This classic optimization technique solves assignment problems by minimizing the cost of matching a set of predictions to a set of observations. In the context of NMR analysis, the “cost” is defined as the discrepancy between the predicted chemical shifts and the actual shifts observed experimentally. By systematically reducing these differences, the Hungarian algorithm achieves an optimal one-to-one correspondence between predicted shift pairs and experimental signals, even in complex scenarios with potential signal overlap.

However, in most cases, the number of C–H bonds within a molecule exceeds the number of signals recorded, making peak alignment more difficult. This mismatch in numbers arises from several factors: firstly, rotational equivalence can reduce the number of signals, with a single peak representing all three C–H bonds for methyl groups; secondly, symmetrical molecular structures can result in a single detectable signal for multiple symmetric C–H bonds, as seen in benzene molecule where only one peak represents all six C–H bonds; lastly, in highly complex molecules, overlapping signals obscure some peaks, reducing the detectability of individual C–H bonds from experiments.

To overcome this issue, we utilize the graduated assignment algorithm^16,44, which facilitates matching between graphs of different node counts, making it particularly suitable for this scenario. In this algorithm, our model’s predicted C–H shifts ({({C}^{i},{H}^{i})}_{i = 0}^{N}) and the observed C–H signals ({({C}^{j},{H}^{j})}_{j = 0}^{M}) of each molecule are conceptualized as points on a 2D plane, where N and M are numbers of predicted and observed C–H shifts respectively. These points are then treated as vertices in two fully connected graphs, G₁ for predicted shifts and G₂ for observed signals. The similarity between nodes is defined as the inverse of differences between predicted chemical shifts (node in G₁) and observed shifts (node in G₂). Specifically, for each predicted shift, we compute its difference with every observed shift, where a smaller difference indicates a higher similarity. The derive the assignment matrix A where each element A_uv ∈ {0, 1} indicates whether node u in G₁ matches with node v in G₂, the algorithm first finds the soft matching matrix that relaxes the binary constraint A_uv ∈ {0, 1} to a continuous range [0, 1], then converts it into hard assignment in a greedy way, enabling one-to-many matching.