An active representation learning method for reaction yield prediction with small-scale data

Introduction

The optimization of chemical reaction is a fundamental task that has numerous important applications in synthetic chemistry^1,2. For example, we often need to adopt some new structures to investigate the classic reactions, such as Buchwald–Hartwig coupling, for discovering novel molecules with new functions³. But the real performance caused by those new structures are usually unreported and also not easy to predict. In traditional optimization process, chemists usually need to consider a “reaction space” which contains a set of reaction combinations with several critical conditions, such as different catalysts, ligands, additives, solvents and other components (Fig. 1A). In the well-known Suzuki coupling dataset provided by Pfizer’s group⁴, for example, the size of the reaction space is 5760 = 15 × 12 × 8 × 4. As an important optimization objective, reaction yield has been widely studied for evaluating experimental performance^5,6,7,8,9,10, since it can reflect the quality of a reaction and reveal the underlying principles in chemistry. Therefore, when studying a new reaction system, it is urgently needed to understand the patterns of reaction yield and explore high-yield reaction combinations. However, this process often needs to take a large amount of experimental time, and moreover, the performance heavily relies on the expertise of the experimenter. As a consequence, some potentially viable reaction conditions are very likely to be overlooked. For example, Buchwald’s group recommended only a limited range of conditions on Buchwald–Hartwig coupling in their early research^11,12, but they further discovered a series of important new combinations in the following study¹³.

**Fig. 1: Reaction yield prediction framework for small-scale data.**

The emerging high-throughput experimentation (HTE) technology, which is able to run a large number of reactions in parallel^14,15, has attracted a lot of attention for accelerating the traditional reaction optimization process^16,17. HTE can significantly promote the experimental efficiency and reduce the workload for exploring a new reaction system. The favorable parallelization of HTE can also generate a sufficiently large amount of experimental data to support reaction optimization, and thus greatly decreases the chance of missing high-yield reaction combinations.

To effectively utilize the experimental data generated from HTE, several different machine learning (ML) based methodologies have been proposed. Machine learning is an active research subarea of Artificial Intelligence¹⁸. In general, the goal of machine learning is to learn some effective model from observed data, where the model can be applied to solve various tasks, e.g., classification and regression. In particular, machine learning techniques have been successfully applied to different fields of scientific research^{19,20,21,22,23}. For example, we can exploit machine learning-assisted methods to address a variety of chemical tasks such as molecular design^24,25, reaction prediction^26,27,28, retrosynthetic analyses^29,30, reaction condition optimization^31,32, selectivity prediction^33,34, etc. In the area of yield prediction, Doyle’s group provided an open-source chemical reaction optimization tool based on the Bayesian method, and they validated their approach through the experiments involving Palladium-catalysed C-N coupling and C–H functionalization system, Mitsunobu reaction and deoxyfluorination reaction of alcohols⁶. Recently, Denmark’s group designed a machine learning tool to predict substrate-adaptive conditions for Palladium-catalyzed C–N couplings, where the tool can be used to optimize reaction combinations within a large reaction space of 450,000 possible reactions¹⁷.

Most of the current studies on reaction yield prediction rely on large amounts of data provided by HTE equipment. HTE, while powerful, its high cost makes it unaffordable for most research laboratories in the world. Consequently, the limited data and budget unfortunately hinder many chemists from leveraging ML methods to predict reaction yield or guide reaction screening. Therefore, our major task in this paper is to design a method that can provide guidance for yield prediction and optimization with small-scale experimental data. Through this method, we aim to interpret the entire reaction space and then achieve an overview of the yields across the space, which could help us to discover more potential reaction pathways that might otherwise be overlooked.

Our idea is to design some effective sampling strategy to approximate the reaction space. Moreover, since our focus is on the scenarios with limited experimental resource and time budget, the sampling strategy should be carefully crafted and economical on the sample size. We are aware of several existing machine learning techniques that are built upon small-size data, e.g., active learning^35,36 and few-shot learning^37,38. The recent method “DeepReac + ”³⁹ combines representation learning with active learning to reduce the number of experiments on three public HTE datasets of Buchwald–Hartwig C–N coupling reaction, Suzuki–Miyaura C–C coupling reaction and asymmetric N,S-acetal formation reactions^4,5,34, where it still needs to take about 30–50% of the reaction combinations for training. Few-shot learning usually requires a pre-trained meta model to support learning a new task with small-size data. Nevertheless, a universal meta-training dataset for chemical reaction prediction is not easy to achieve. The US Patent Office (USPTO) extracted dataset is the largest chemical reaction dataset and is also a common dataset for reaction prediction⁴⁰. But due to its imperfect data quality and weak correlations among reactions, recent research found that it is hard to build an effective universal machine learning model on it^7,41. Moreover, even within a system consisting of similar reactions, it is still challenging to transfer the domain knowledge from one reaction to another. This is because the representation of the compounds involved can vary significantly under current molecular descriptors. These variations may lead to different predictions by ML models, making it difficult to apply knowledge gained from one reaction directly to another, even if they appear similar at a high level⁴².

Our initial idea comes from an algorithmic technique called “coreset”, which is widely used for approximating large-scale datasets in machine learning^43,44,45. Roughly speaking, a coreset is a small-size representative subset sampled from the original data; we can use this coreset to approximate the whole dataset and thus the total computational cost can be greatly reduced. Since coreset can be used for compressing large-scale data, a natural question is whether this approach can be also applied to approximate a large reaction space in chemistry. To answer this question, we propose a coreset based reaction space approximation technique called the “RS-coreset”, as shown in Fig. 1B. From the high-level perspective, our framework follows the traditional active learning manner: in each iteration, we query a small sample from the reaction space to evaluate their yields, update our prediction model, and then turn to next iteration; after several such iterations, our model becomes stable and outputs the final prediction result. But a key difference with the most previous active learning approaches is that, we utilize the RS-coreset technique to guide the query, and meanwhile the queried yield information can effectively help us to learn a multi-source based representation for the reaction space. We argue that this representation for reaction space can significantly enhance the performance for yield prediction and reduce the overall query complexity (e.g., in our experiments, taking only 5% instances of the reaction space can achieve promising prediction result).

To shed some light, we illustrate our result on the public Buchwald–Hartwig (B-H) coupling dataset⁵ as an example in Fig. 2. We can achieve promising prediction results by constructing the RS-coreset for the large B-H reaction space including 3955 reaction combinations. For example, over 60% of the predictions have the absolute errors less than 10%, where we only query the yields for 5% of the reaction combinations. We then extend the method to the Suzuki-Miyaura (S-M) reaction system and also achieve relatively accurate prediction on the yields of 5760 reaction combinations. A new B-H dataset provided by Denmark’s group¹⁷ is also used as an out-of-sample test for our model. To further validate our method, we apply it in a realistic experimental scenario of the Lewis base-boryl radicals enabled dechlorinative coupling reactions. The method can help us to effectively predict the yields and even discover several feasible reaction combinations that were perceived as non-reactive before⁴⁶.

**Fig. 2: Prediction results on B-H dataset by RS-Coreset.**

Results and discussion

Model development

Chemists often choose only a small subset of reaction combinations to conduct experimental campaigns, thus it is challenging to fully understand the entire reaction space. In this work, we try to construct a small-size RS-coreset as an approximation of the reaction space (“RS” represents “Reaction Space” for short). By this way, the prediction model only needs to be trained with the yield information of this RS-coreset. In other words, the experimenter just needs to evaluate a small set of yields rather than explore the full reaction space.

From the algorithmic perspective, constructing the RS-coreset is an iterative procedure and each iteration needs to consider two problems: which combinations to pick and how to represent them in an appropriate metric space. In particular, these two problems are closely related and interact with each other. Namely, a high-quality selection of the combinations can improve the representation for the reaction space and meanwhile, an accurate representation can effectively guide our selection for the combinations in next iteration. After several iterations, we can complete our construction of the RS-Coreset and then obtain the yield prediction model for the whole reaction space.

The complete framework is depicted in Fig. 1B. When considering a new type of reaction, we should predefine the scopes of reactants, products, additives, catalysts, and so forth. Then, the reaction space is constructed, and we initially select a small set of reaction combinations uniformly at random or based on some prior knowledge (e.g., the literatures on similar reactions or the expertise of the experimenter), from which the experimenter can evaluate their corresponding yields. Subsequently, our framework iteratively conducts the representation learning and data selection. In each iteration, the framework updates its prediction on the yield distribution and further guides us to select a set of new reaction combinations for evaluation. Specifically, each iteration of the framework contains the following three steps:

(i)

Yield evaluation: The chemist performs the experiments on the selected reaction combinations and records their yields.
(ii)

Representation learning: The model updates the representation space by using the yield information obtained from the experiments conducted in step (i).
(iii)

Data selection: Based on a max coverage algorithm, it selects a set of new reaction combinations that are the most instructive to our model, and return them to step (i) in next iteration.

After running the above algorithm by several rounds, a random forest regression prediction model can be trained on the learned representation space.

To illustrate our workflow, we take the Pd-catalyzed Buchwald-Hartwig (B-H) couplings as the testing example. B-H couplings are among the most important C-N bond forming reactions, but it is a challenging task to search the high-yield combinations of the full reaction space in routine synthesis campaigns^47,48. In this reaction space, the number of the combinations of available substrates, catalysts, solvents, etc., can be as large as several thousands. Moreover, B-H couplings are sensitive to reactant structure⁴⁹, and this also increases the challenge for determining the high-yield combinations. In 2018, Doyle et al. used the HTE technique to develop a B-H dataset⁵, which comprises 3955 reaction records and provides the high-throughput yield validation for the combinations of 15 aryl and heteroaryl halides, 4 ligands, 3 bases, and 23 isoxazole additives. As mentioned above, we consider the scenario in which HTE is unavailable, so only a small portion of the B-H dataset is used as our training data (we set the portion to be 5% in this test). Yield prediction across the entire reaction space can be regarded as a regression task, so we use the classical regression criterion R² to measure the predictive performance (the higher the better). Below, we elaborate on the development of our proposed method.

We first discuss how to use the existing molecular fingerprints to build the representation. In the past, several different vectorized chemical descriptors or molecular fingerprints were proposed, such as the well-known MorganFP⁵⁰, Mordred⁵¹, and AvalonFP⁵². Recently, some fingerprints harnessing the capabilities of deep learning models have also been proposed^53,54. Most previous machine learning based approaches choose only one particular descriptor or fingerprint, but the choice could seriously influence the yield prediction performance. Figure 3A visualizes the two representation spaces of the B-H dataset obtained from MorganFP and Mordred, respectively. It can be seen that each representation reveals certain degree of clusterness (e.g., the instances form a number of small dense clusters in the space), and this observation implies the possibility of existing hidden structures in the B-H reaction space. In addition, because they are generated by different sources of information, these two representations look quite different in the figure. These observations inspire us to consider integrating different representations so as to form a comprehensive model. We believe that this approach should be more effective for extracting the multi-source physicochemical and structural information. The extracted information could provide some important guidance to design our following sampling algorithm in the reaction space.

**Fig. 3: The representation space of the B-H dataset visualized by t-SNE.**

Our approach of the representation learning is partly inspired by the soft clustering method of Xie et al. ⁵⁵. In a nutshell, their approach is to compute a soft clustering distribution based on the given features and an experimentally designed target distribution, where the Kullback–Leibler (KL) divergence between these two distributions is minimized via training a new representation space in deep neural network. We generalize this approach to the case with two representations, where we optimize the KL divergence between their soft clustering distributions. In our experiments, we find that integrating two different representations, say one is derived from structure and the other is derived from physical chemistry, often tend to yield better performance. When the two representations are trained simultaneously, their difference should gradually decrease, and a new balanced representation is obtained eventually.

However, different representations may be quite diverse and simple minimization on their KL divergence cannot guarantee a satisfying balanced representation. To illustrate this issue, we conduct an experiment on the B-H dataset with MorganFP and Mordred; we randomly sample 5% of the reactions for training and perform random forest regression upon the new representation space to predict the yields for the remaining 95% reactions. The experiment is repeated by 10 times. The R² scores of using only MorganFP or Mordred are 0.633 and 0.598, respectively (“Morgan” and “Mordred” in Fig. 4). On the other hand, the direct integration of the two representations yields an even lower score 0.408 (“Simple Fusion” in Fig. 4). This phenomenon motivates us to consider designing a more sophisticated integration method.

**Fig. 4: The R² values on the prediction for unseen reactions by different methods.**

Our first attempt: yield guided fusion

Revisiting Fig. 3A, we can see that many reactions with inconsistent yields are mixed within the clusters. That is, there is no clear correlation between spatial location and yield in the representation spaces. Therefore, our first attempt is to take the advantage of the yield information to guide the representation fusion. Suppose we already have a small set “Q” of combinations with known yields, and then add a penalty item to the optimization objective such that the combinations with similar yields of Q should be encouraged to locate closely in the representation space. The guidance added to Q also implicitly influences the representation of other reaction combinations through minimizing the overall KL-divergence. Figure 3B illustrates the two representation spaces with the yield information guided training (for comparison, Fig. 3C shows the two representations without using yield information). The gray dashed lines in Fig. 3 further exhibit the evolution of the representation spaces using yield information in the training process. By incorporating this strategy, the integrated representation reveals significant correlation between yield and spatial location. We apply this training strategy to the aforementioned experiment on the B-H dataset with MorganFP and Mordred. The obtained average R² value of the predictive model grows to 0.734 (“Yield Guided” in Fig. 4), which is greatly improved upon the previous results.

From yield guided fusion to RS-coreset

Now, we consider constructing our RS-coreset based on the above developed integrated representation learning method. In the previous tests, we simply use uniform random sampling to extract the set Q for yield evaluation (i.e., the experimenter conducts the reaction combinations of Q to test their real yields). But the experimental performance is not quite stable as shown in Fig. 4 (“Yield Guided”), where the R² score ranges from 0.689 to 0.773. To enhance the stability, a straightforward idea is to enlarge the sample size. Nonetheless, because we consider the scenario that HTE is unavailable, the affordable amount of reaction combinations for sampling should be relatively small.

To resolve the above issue, we develop an iterative sampling method for generating Q (i.e., our proposed RS-coreset construction method). Initially, we construct the set Q by uniformly sampling a small set from the reaction space. Then we consider to augment Q through an economical way (because we cannot add too many samples to Q due to the lack of HTE). As shown in Fig. 3B, the learned new representation reveals a strong relationship with the yields. Namely, the combinations located close to each other usually have similar yields in the space. Therefore, we can approximately predict the yield of a reaction combination based on its neighbors in the space. Inspired by this geometric observation, we consider to augment Q by using covering balls in the reaction space (Fig. 5). Suppose we have a sampling budget (s > 0) and a fixed radius (r > 0), and then we try to cover the whole representation space as much as possible by using (s+{|Q|}) balls with radius r; where the ball centers include the combinations from Q and the newly selected s combinations. Here, we apply the max coverage algorithm⁵⁶ to implement this idea, where the details are shown in “coreset construction algorithm” of the Methods section. Intuitively, these selected s ball centers can be viewed as the most informative subset to supplement Q (this step corresponds to the “data selection” in Fig. 1B). After the selection, we return these s combinations to the experimenter to evaluate their yields (i.e., the “yield evaluation” step in Fig. 1B), add them to Q, and also use the obtained yields to update our representation (i.e., the “representation learning” step in Fig. 1B). We repeat this strategy to augment the set Q by several iterations and eventually achieve our RS-coreset.

**Fig. 5: An illustration for different sampling methods.**

In the initial stage of our simulation, we randomly select 2.5% combinations of the whole reaction space of B-H dataset. Then we alternatively perform the representation learning and data selection by 4 iterations, where the sampling budget is 0.625% for each iteration. So, our ultimate constructed coreset size is 5%, and we use it to predict the yields for the remaining 95% instances. As shown in Fig. 4, the prediction performance and stability are both improved to certain extent, as the average R² is increased to 0.772 with a lower range (0.751–0.792) and variance (0.00014).

In our preliminary experiment designed above, the experimenter just needs to evaluate the yields of the reaction combinations contained in the RS-coreset, which is only 5% of the full reaction space. Together with the prediction model random forest regression, we can learn an approximate mapping from the reaction space to the yield domain, where the experimental load and period are significantly reduced.

Experiments of Buchwald–Hartwig coupling dataset

In the previous section, we conduct the preliminary tests on the B-H dataset to validate our RS-coreset method. To further demonstrate the effectiveness of our proposed approach, we compare the performance of RS-coreset with several recently published models trained on HTE datasets. Moreover, we also implement our approach in a realistic experiment on Lewis base-boryl radicals enabled dechlorinative coupling reactions. In addition to R², we also consider Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) as the metrics to evaluate the performance (the lower the better). The results of MAE and RMSE are quite similar, so we only show the results of MAE in this section and place the results of RMSE to extended data (Tables S1 to S3).

We illustrate more experiments on the B-H dataset in this section. We try two different thresholds 2.5% and 5% for the coreset size, i.e., the budget for the yield evaluations. For each setting, the experiment is repeated by 10 times. We compare the results with several recently proposed yield prediction methods. YieldBERT⁷ is based on the transformer deep learning model “Bidirectional Encoder Representations from Transformer (BERT)”⁵⁷, which treats the SMILES of reactions as strings and encodes them by BERT. YieldBERT-DA⁵⁸ is an extension of YieldBERT with data augmentation, and the specific representation method is same as before. Another baseline method Uncertainty-Aware adapts a graph neural network (GNN)⁵⁹ to encode molecules, and exploits uncertainty-aware learning to make predictions⁶⁰. For fair comparison, we also use the same budgets 2.5% and 5% for these baselines. As shown in Fig. 6A, our model achieves better results than the baselines on both the two metrics (except for the average MAE at 2.5% budgets, which is slightly higher than the best baseline but exhibits a smaller standard deviation).

**Fig. 6: Performance on two public datasets.**

Experiments of Suzuki–Miyaura cross-coupling reaction

Suzuki coupling is a significant cross-coupling reaction in organic synthesis, which has been extensively studied since the first Suzuki-type cross coupling reaction was published by Suzuki and Miyaura in 1981⁶¹. In 2018, Pfizer team developed a fully automated system for HTE screening, by providing a Suzuki–Miyaura (S-M) HTE dataset consisting of 5760 reaction yields of 15 couplings of electrophiles and nucleophiles, 12 ligands, 8 bases and 4 solvents⁴. We conduct the experiment on S-M to study the transferability of our proposed method for a different reaction system other than B-H.

As same as the previous experiment, we also set the training data size to be 2.5% and 5% of all 5760 reactions (i.e., 144 and 288 reactions, respectively), and perform random forest regression to predict the yields for the remaining reactions. Initially, the model randomly selects half of the reactions of the total training budget (72 reactions for 2.5%, and 144 reactions for 5%), and then samples the remaining reactions through our RS-coreset strategy by 4 iterations. For each setting, the experiment is repeated by 10 times. It is worth noting that we replace the Morderd descriptor by AvalonFP because the Morderd descriptor does not support some reaction compounds in the S-M dataset. We compare the performance with the methods YieldBert, YieldBERT-DA, Uncertainty-Aware, and DeepReac+ ³⁹.

As can be seen in Fig. 6B, our model achieves more promising performance on both the two metrics, which suggests the potential transferability of our approach to different reaction systems. Comparing with the results shown in Fig. 6A, the models (RS-coreset, YieldBERT, YieldBERT-DA and Uncertainty-Aware) all have slightly worse performance, where we think one possible reason is that the S–M dataset is more complicated than the B-H dataset.

Experiments of B-H dataset provided by Denmark’s group

Recently, Denmark’s group released a machine-learning tool to predict substrate-adaptive conditions for Pd-catalyzed C–N couplings, and built a diverse dataset containing 3359 experiments (24 conditions and 121 reactant pairs) that were selected from 450,000 possible reactions (180 conditions and 2500 reactant pairs)¹⁷. They trained their machine learning models with these 3359 experiments. To evaluate the performance of their models, an out-of-sample test is conducted with given a set of 187 reactions yields. All of these 187 reactions involve one or both reactants that are not seen in the previous 3359 experiments; they are divided into 13 different groups according to their products, which are labeled by the letters from “a” to “m”.

We follow their setting to conduct our out-of-sample experiment. Since the training set is already given in this test (i.e., the 3359 experiments), we do not need to construct our RS-coreset, and only apply our representation learning method to complete the test based on MorganFP and AvalonFP. It is worth emphasizing that an experimenter may not have the expertise to design the useful molecule descriptors when confronting a new reaction prediction task. So, unlike the method proposed by Denmark’s group¹⁷, we did not add any specifically designed molecule descriptor in our training process on this new B-H dataset. Our goal for this part is to test the generalization of our prediction method when confronting with new molecular substructure. The results are shown in Fig. 7. In this test, our model achieves comparable results with their reported results for most of the products, and the average performance is also slightly better. The groups from “i” to “m” are called “challenging extrapolations” (they are regarded as difficult groups to predict in their paper), as they contain new types of structures that may have different reactivity patterns than those in the training dataset. For example, the group “l” represents the coupling of tert-butyl pyroglutamate with 2-(3-bromophenyl)-1,3-dioxolane. Both the reactants are out-of-sample and the pyroglutamate (i.e., contains an ester group) is substantially different from molecules in the dataset. Our model achieves better performance in these groups (except for “j”), indicating that the capability for out-of-sample prediction could be enhanced to a certain degree by our representation learning approach.

**Fig. 7: Performance on the B-H dataset provided by Denmark’s Group.**

A realistic experiment on Lewis base-boryl radicals enabled coupling reaction

In this part, we further validate the performance on a real-world yield optimization process on Lewis base-boryl radicals enabled coupling reaction. We recently studied the Lewis base-boryl radicals promoted dechlorinative coupling reaction of activated trichloromethyl groups and alkenes⁴⁶. This study outlines a straightforward three-step process utilizing easily accessible activated trichloromethyl groups as carbon sources, selectively functionalizing three C−Cl bonds to introduce three alkyl chains. Simple derivatization was used to obtain diversely substituted quaternary carbon center products. The incorporation of various olefin traps in each step facilitates the convenient assembly of extensive product libraries with potential applications in drug synthesis.

In our experiment on Lewis base-boryl radicals enabled coupling reaction, we consider the reaction space of 1920 possible reaction combinations, including 4 Lewis base-boranes, 20 pairs of reactants, 2 additives, 4 solvents, and 3 radical initiators (Fig. 8). Next, we construct the RS-coreset containing 93 reaction combinations (about 5% of the 1920 possible combinations). In each iteration of the coreset construction procedure, we select 10-15 reactions to evaluate their yields in our lab, where the selection is based on the guidance from our model and the expertise of the experimenter.

**Fig. 8: Components of the reaction space of Lewis base-boryl radicals enabled coupling reaction.**

To validate the performance of RS-coreset, we select 12 reaction combinations with high predicted yields ( > 50%) from the remaining 1827 reactions. We compare our prediction results with the observed yields (as shown in Fig. 9A), and find that our model achieves an impressive performance (for which the MAE is less than 7.3%). Moreover, 10 of the 12 recommended reactions exhibit high observed yields in the chemical experiment. We also select 10 reaction combinations with low predicted yields (< 50%) for further validation (as shown in Fig. 9B). In this test, all the 10 reactions exhibit low yields, and the MAE is only 4.82%.

**Fig. 9: Results of the 22 testing instances.**

Interestingly, our method also discovers several feasible reaction combinations that were overlooked in previous articles before. Based on the reported results in the past research⁴⁶, DMF was perceived as a non-effective solvent, since the reaction with a trichloroacetate reactant conducted in DMF gave no product (entry “1” in Table 1). If following the common strategy in the experiment of synthetic chemistry, other combinations containing DMF should not be subsequently attempted. However, we observe that this empirical prediction is not correct. After building our RS-coreset of 93 combinations, we further run the max coverage algorithm again and the algorithm returns 6 combinations that use DMF as solvent, where 5 of them are predicted to be feasible by our model. We evaluate their yields and present the results in Table 1 (entry “2” to “7”). Somewhat surprisingly, we discover that all these 5 combinations are indeed feasible in practice and 2 combinations even have the yields larger than 40%. This discovery strengthens the confidence of our approach for yield prediction, and also implies its potential for assisting yield optimization in synthetic chemistry.

Table 1 The overlooked reaction combinations using DMF

Full size table

Conclusion

Reaction yield is an important optimization objective for evaluating experimental performance, because it can reflect the quality of a reaction and reveal the underlying principles in chemistry. However, understanding the patterns of reaction yield and discovering high-yield reaction combinations often need to take a long experimental time. This process also heavily relies on the expertise of the experimenter, and therefore some potentially viable reaction conditions are very likely to be overlooked.

In this article, we present an effective framework to predict yields for reaction space, which is particularly useful for the scenario that HTE is unavailable, i.e., only a small experimental budget is affordable. Inspired by the coreset approach in machine learning, we try to address this problem through approximating the reaction space. Our proposed method is named “RS-coreset”, where the construction algorithm is an iterative procedure consisting of deep representation learning and data selection from the reaction space. We utilize the RS-coreset technique to guide the query, and meanwhile the queried yield information can effectively help us to learn a representation for the reaction space. This representation for reaction space can significantly enhance the performance for yield prediction and reduce the overall query complexity.

To prove its capability for yield prediction, we conduct the preliminary tests on the B-H dataset and then extend our method to the S-M dataset for further validation. Following Denmark’s work, we have verified the out-of-sample prediction capability of our approach. Moreover, we exploit this framework to aid in the real-world study of the Lewis base-boryl radicals enabled dechlorinative coupling reactions, and the results demonstrate that our model can provide useful guidance to uncover overlooked reaction conditions. As the future work, we plan to validate the effectiveness of our framework in more complex reaction systems, guiding a broader range of synthesis work. Additionally, it is deserved to study how to extend our method for predicting other chemical properties that may be helpful in the fields like pharmaceutical industry.

Methods

Encoding for reaction

In organic reactions, there are typically various components involved, such as reactants, solvents, catalysts, ligands, and other elements. To encode a specific combination, we first calculate the molecular fingerprints of each component and subsequently concatenate them. More specifically, we express the encoding of a given experiment as ({x}_{1}oplus {{{x}}}_{2}oplus cdots oplus {{{x}}}_{{{{rm{m}}}}}), where x_i denotes the fingerprint of the i-th component and (oplus) denotes the concatenation operation. In our experiments, we employ several descriptors and fingerprints, including Mordred, MorganFP and AvalonFP. These descriptors and fingerprints can be generated from SMILES notations via Python libraries such as RDKit⁶².

Representation learning

As discussed above, we would like to mix two different molecular fingerprints or chemical descriptors in a learning manner to obtain the mapping from original representation space to the improved representation space (i.e., the “representation learning” step in Fig. 1). For ease of presentation, we use AvalonFP and MorganFP as examples. Given a reaction combination x, let ({x}^{left(Aright)},{{{rm{and}}}},{x}^{left(Mright)}) denote its encoding under AvalonFP and MorganFP, respectively. To mix these two representations, we adopt two separate networks to map them to the same newly learned representation space. The network architecture is the classical feedforward neural network, and the outputs are denoted by ({f}_{A}left({x}^{left(Aright)}right),{{{rm{and}}}} , {f}_{M}left({x}^{left(Mright)}right)).

The part of network training procedure incorporates the ideas from the DEC⁵⁵. Initially, we perform k-means clustering on the yields, and select k representative reaction combinations (i.e., the centroids) from the training set, denoted by μ₁, …, μ_k. To obtain the appropriate value for k, we first select an initial value based on empirical knowledge (which can be the square root of the number of combinations known yields). Then, based on the clustering result, we automatically adjust the value k until the intra-cluster difference of the yields is in a suitable interval (in our experiments, we set the range to be [3, 5]).

Then, using the spatial information of the newly learned representation space, we calculate the soft clustering distribution of each reaction x, where each dimension indicates its similarity to centroid μ_j. Since AvalonFP and MorganFP correspond to two separate networks, resulting two different representation spaces, we need to calculate the soft clustering distributions under these two different spaces individually. We apply the Student’s t-distribution as the kernel to measure the similarity:

$${{p}_{j}}^{(s)}({x})=frac{{(1+Vert ,{f}_{s}({x}^{(s)})-{f}_{s}({{{{mu }}}}_{j}^{(s)})|{|}^{2})}^{-1}}{{sum }_{j{‘}=1}^{k}{(1+Vert ,{f}_{s}({x}^{(s)})-{f}_{s}({{{{mu }}}}_{j{‘}}^{(s)})|{|}^{2})}^{-1}} , (s={{{rm{A}}}},{{{rm{M}}}}).$$

(1)

({p}_{j}^{left(Aright)}left({x}right),{{{rm{and}}}},{p}_{j}^{left(Mright)}left({x}right)) are the similarities in the representation space learned from AvalonFP and MorganFP, respectively. Thus, ({P}^{(s)}({x})=(,{p}_{1}^{(s)}({x}),ldots ,{p}_{k}^{(s)}({x})),(s={{{rm{A}}}},{{{rm{M}}}})) are the soft clustering distributions obtained from x using the spatial information induced by the new representation. In the fusion, we want the outputs of the same reaction combination in both networks to be consistent. Therefore, we adopt the sum of the KL divergence between ({P}^{left(Aright)}left({x}right),{{{rm{and}}}},{P}^{left(Mright)}left({x}right)) for all reaction combination x as part of the loss function, i.e.,

$${L}_{f}^{left(Aright)}= mathop{sum }_{{x}in X}{{mbox{KL}}}, left({P}^{left(Aright)}left({x}right){{{rm{||}}}}{P}^{left(Mright)}left({x}right)right)$$

(2)

and

$${L}_{f}^{left(Mright)}= mathop{sum}_{{x}in X}{{mbox{KL}}}, left({P}^{left(Mright)}left({x}right){{{rm{||}}}}{P}^{left(Aright)}left({x}right)right) ,$$

(3)

where X denotes the entire reaction space.

Moreover, we utilize the yield information of the training set to guide the fusion. For a reaction combination x, let (yleft({x}right)) denote its observed yield if it has been evaluated. Given a centroid μ_j, the likelihood that instance x and μ_j belong to the same cluster is determined by the difference of their yields; namely, the higher the difference, the lower the likelihood. We design this likelihood as

$${w}_{j}left({x}right)=max left{2-frac{1}{100}{(y({x})-y({{{{mu }}}}_{j}))}^{2},0.0001 right }.$$

(4)

Thus (wleft({x}right)=left({w}_{1}left({x}right),ldots ,{w}_{k}left({x}right)right)) can be viewed as the soft clustering distribution of x generated from the yield information. It is worth noting that (wleft({x}right)) can only be computed for those combinations with known yield. To enhance the correlation between the yield distribution and the spatial distribution, we adopt the sum of the KL divergence between ({p}^{left(iright)}left({x}right),{{{rm{and}}}},{w}left({x}right)) for all reaction combination x with known yield as the other part of the loss function, i.e.,

$${L}_{y}^{left(sright)}= mathop{sum }_{{x}in S}{{mbox{KL}}}, left({P}^{left({{{rm{s}}}}right)}left({x}right){||w}left({x}right)right) , left({{s}}={{{rm{A}}}},{{{rm{M}}}}right) ,$$

(5)

where S is the set of the reaction combinations with known yield.

During the training process, the two networks corresponding to AvalonFP and MorganFP are trained separately and simultaneously. The loss functions both consist of two components, which are expressed as

$${L}^{left(sright)}={L}_{f}^{left(sright)}+{L}_{y}^{left(sright)} , left(s={{{rm{A}}}},{{{rm{M}}}}right) ,$$

(6)

where L_f denotes the KL divergence of the clustering results between different representations (as the representation fusion), and L_y denotes the KL divergence of the clustering results between representation and yield (as the guidance from the yield information). In each iteration, we use backpropagation to simultaneously optimize both the representation space and the centroids.

Coreset construction algorithm

As shown in Fig. 5, the sampling method is crucial in the coreset construction. We run the max coverage algorithm⁵⁶ in the new representation space to select the representative combinations. Let ({B}_{{{{rm{delta }}}}}left({x}right)) denote the ball centered at x with radius δ in the representation space. The idea of the max coverage algorithm is to find (b , > , 0) representative combinations from the dataset to cover as many instances as possible. Then, these b reactions will be added to the coreset and evaluated by the experimenter. The selection problem can be formalized as follows:

$$mathop{{rm{argmax}}}_{Lsubset {X;}left|Lright|=b}Pleft(mathop{bigcup }_{{x}in L}{B}_{delta }left({x}right)right).$$

(7)

The selection is implemented by a greedy process on the representation space⁵⁶. First, specify a radius and construct a directed graph based on the containment relationship; if an instance x covers another one ({{x}}^{{prime} }), then connect a directed edge from x to ({{x}}^{{prime} }). Then select the reaction combination with the highest out-degree and add it to our RS-coreset, and also delete it from the graph together with its covered combinations. Additionally, there are some implementation details needing to explain here. First, we alternatively execute the representation learning and coreset construction in each iteration. By this way, the selected reactions can capture the features of the new representation space and help us to further improve the representation in next round. Second, we also need to pay attention to the choice of the radius δ in Eq. 7. We use the average intra-cluster distance obtained by k-means clustering to set this radius as it approximately reflects the range covered by each yield class. The whole coreset construction algorithm is implemented in Python.

Prediction method

After constructing the RS-coreset, we can predict the yields for the entire reaction space. Since the coreset nicely approximates the reaction space, we can utilize its yield information to train a regression model on the new representation. For the remaining combinations, the model can predict their yields. In our simulation experiments on the B-H dataset, we compare various regression methods (implemented by the sklearn library⁶³), including linear regression, Ridge, LASSO, k nearest neighbor, and random forest; we finally choose random forest regression as the prediction method. The random forest is an ensemble of multiple decision trees, where the features and samples used for constructing each tree are randomly selected. The regression result is obtained by averaging the predictions from those trees. The random forest method can not only capture the complex nonlinear relationship but also reveal good robustness to missing values and outliers.