Modelling and design of transcriptional enhancers

Introduction
The control logic underlying the transcriptome of a cell is encoded in its genomic sequence, that is, in non-coding regions called enhancers and promoters (see ref. 1 for a detailed review and refs. 2,3,4 for historical reviews). Gene expression can be regulated by multiple enhancers in cis, located upstream or downstream, or in introns. Enhancers regulate gene expression by forming a nucleoprotein complex with transcription factors (TFs) and co-factors (Fig. 1a). In this paradigm, multiple TFs bind at clusters of transcription factor-binding sites (TFBSs; the TP53 responsive elements are a counterexample5). Subsequently, through interaction with the basal transcriptional machinery (BTM) at the transcription start site (TSS), gene expression is induced. In this Review, we distinguish the concept of core-promoters and enhancers, where the former overlap with the TSS and interact with the BTM and their effect on gene expression is usually not cell-type specific6. By contrast, enhancers are cell-type specific and depend on the interaction with a core promoter to influence target gene expression.

a, Representation of a genomic locus containing a core-promoter bound by RNA polymerase II and an enhancer bound by multiple transcription factors. The transcription factors interact with co-factors that, in turn, interact with RNA polymerase II at the promoter to regulate gene expression. The underlying mechanism of this interaction is beyond this Review’s scope; hence, the relevant proteins are represented in grey scale. Note that both the enhancer and promoter are nucleosome depleted. b, Representation of an enhancer with nucleotides scaled based on their importance to cell-type-specific enhancer activity. Groups of nucleotides with positive importance coincide with transcription factor-binding sites (TFBSs) of transcriptional activators (dotted lines). Groups of nucleotides with negative importance coincide with TFBSs of transcriptional repressors. The other nucleotides are spacer or background sequences with little importance to enhancer activity. Additional features (enhancer grammar) with potential importance for enhancer activity are highlighted using bars. These include the combination of TFBSs for a specific combination of transcription factors (both activators and repressors), the number of such binding sites, their affinity, the distance between individual binding sites, and whether the binding site is at the flank of the enhancer (near the nucleosomes) or not. An enhancer has the size of a single nucleosome-depleted region (l; ~200 bp) and the distance between individual, non-overlapping, binding sites (d) can range from 3 bp (ref. 80) up to the length of the enhancer. c, Illustration of a plasmid with enhancer, minimal core-promoter (fluorescent) reporter gene and barcode that can be used for enhancer reporter assays. Top: illustration of massively parallel reporter assay where the number of barcodes over the number of plasmids is quantified through next-generation sequencing (NGS). Bottom: illustration of in vivo enhancer reporter assay where enhancer activity is measured as the level of fluorescence using microscopy.
Once the TFBSs of an enhancer are determined (Fig. 1b), their necessity and sufficiency can be examined by generating synthetic enhancers and testing them using reporter assays. For instance, homotypic clusters of binding sites for TFs downstream of signalling pathways are sufficient to build reporters for JAK, cAMP and WNT signalling7,8,9. In addition, five copies of the heterotypic cluster consisting of TFBSs for CREB, MEF2, SRF and TCF is sufficient to drive expression in response to neuronal activity10.
From these early synthetic enhancer examples, it is tempting to state that deciphering the enhancer code ‘simply’ boils down to identifying the combination of TFs that co-bind enhancers. However, given that neither enhancers nor TF binding are discrete biological entities and that both the expression of TFs and their interactions are cell-type specific, genome-wide identification and interpretation of enhancers is technically challenging (see ref. 11 for an excellent comparison of deciphering the protein and cis-regulatory code). Indeed, the activity of synthetic WNT reporters is specific to certain biological contexts only, requiring additional cell-type-specific TFBSs to drive activity in others9. Furthermore, certain grammar rules, such as distance constraints, presence of weak TFBSs and absence of repressor sites, might apply to form functional enhancers (Fig. 1b).
To address the intricacies of enhancers, massively parallel reporter assays (MPRAs)12 (Fig. 1c) that test a variety of rules using brute force have been used to determine the features underlying enhancer activity13,14,15,16,17,18,19,20,21,22,23. Complementary assays that measure proxies to enhancer activity, like chromatin accessibility and histone modifications, have provided a wealth of information in a variety of cell types24,25,26,27,28,29,30. To link DNA sequence features (Fig. 1b) to (proxies of) enhancer activity, computational models such as deep neural networks (DNNs) are used. We refer to these models as sequence-to-function models as they predict high-throughput genomic assays in the cell type of interest using the DNA sequence as input. Accurate sequence-to-function models can serve as ‘oracles’ for biology, partially resolving the need for additional high-throughput screens to test and find new hypotheses. Naturally occurring or fully synthetic DNA sequences can be optimized according to the output of the oracle, facilitating the development of synthetic enhancers with tailored functional properties.
In this Review, we explore the design of synthetic enhancers employing DNNs as predictive oracles. The first section focuses on sequence-to-function models, comparing conventional machine learning techniques with deep learning approaches. Subsequently, we provide an overview of methods used for synthetic enhancer design. Finally, we discuss future perspectives and potential applications of synthetic enhancers in various biological and biomedical contexts.
Sequence-to-function models
Conventional machine learning models
Enhancers were first discovered in the early 1980s as a viral genomic fragment enhancing transcription of a nearby gene31,32,33,34. Soon after, equivalent eukaryotic elements were discovered35,36,37,38,39,40,41,42,43 and it was shown that enhancers consist of TFBSs44,45,46,47,48,49,50,51 and have cell-type-specific activity48,52,53,54,55. A well-described example of an enhancer is the even-skipped (eve) stripe 2 enhancer56. During Drosophila development, eve is expressed in seven transverse stripes along the embryonic anterior–posterior axis. The eve stripe 2 enhancer regulates the second stripe and has been studied in detail using site-directed mutagenesis and TF mis-expression experiments56,57,58.
Attempts have been made to reconstruct the eve stripe 2 enhancer based on identified TFBSs59,60. For example, a computational model was trained on orthologous enhancer sequences of eight Drosophila species60. Subsequently, this model was used to design synthetic enhancers with varying degrees of similarity to the naturally occurring sequence60. Sequences close to the wild-type enhancer produced expression in the correct segment whereas sequences with more edits lost their specificity. The model’s failure to predict correct enhancer activity of more divergent sequences might be caused by the limited amount of training data.
To increase the amount of training data and build better models, rather than focusing on a single enhancer, information from multiple enhancers active in the same cell type can be used. In this regard, a logistic regression model was fit to classify human skeletal-muscle enhancers61. The model used position weight matrix (PWM) log-likelihood scores (representing the likelihood that a given sequence is generated by the PWM model over a background model) of known muscle TFs. With a training set of 29 skeletal-muscle enhancers and ~2,000 randomly sampled negatives, it reached 60% sensitivity at a 4% false positive rate61. In another study, over 700 features were used to classify human heart enhancers62. These included known PWM log-likelihood scores from TFs from a variety of tissues, log-likelihood scores of de novo PWMs and Markov model scores with orders up to six. A Lasso regression model was fit to classify enhancers and, using a training set of 77 validated enhancers and 1,000 randomly sampled negatives, it reached a sensitivity of 70% at a 50% false positive rate62. Furthermore, making use of the model’s weights, PWMs of important heart-specific TFs could be identified. Finally, the model was used to scan the genome, identifying novel heart-specific enhancers with an accuracy of 74%62. Together, these studies highlight that, even with a small sample size, accurate sequence-to-function models can be built.
Mechanistic insights into gene regulation can be gained through thermodynamic models. Thermodynamic models use the DNA sequence and TF concentration to model gene expression. A prominent example is Gemstat63, which predicts gene expression in a two-step procedure, modelling both the interaction of TFs with DNA and the interaction of bound TFs with the BTM. By selecting the model that best fits the data, mechanistic insights can be acquired. For instance, Gemstat was fit on 44 enhancers with varying anterior–posterior expression profiles in the Drosophila blastoderm along with the TF protein levels. By selecting the model that best explained the data, a case for synergistic interactions between TFs and short-range repression was made63.
Both the thermodynamic and classic regression models rely on features based on pre-existing knowledge (for example, a given set of PWMs). However, features can also be learned de novo directly from the DNA sequences. This requires a much larger sample size and therefore genome-wide assays that identify hundreds to thousands of candidate enhancers have been of tremendous importance. In this regard, more than 30,000 chromatin accessibility peaks were used to train a gapped k-mer support vector machine (gkm-SVM) classifier64, distinguishing chromatin accessibility peaks from randomly sampled genomic background sequences. The gkm-SVM model uses vectors of all possible 10-mers with 6 informative bases (that is, gaps of 4 bp are allowed) as features. It reached a sensitivity of ~70% at a false positive rate of ~5% using MPRA data as ground truth64.
In conclusion, sequence-to-function models leveraging information that is shared between enhancers active in the same cell type can reach high accuracies to classify functional enhancers.
Deep neural networks
Since the mid-2010s, there has been a boom in sequence-to-function models employing DNNs (Table 1). Technically, these models take candidate regulatory sequences as input and predict biological properties in either a regression or classification task. For instance, a model might predict ‘whether’ a region is active as an enhancer (classification) or ‘to what extent’ it is active (regression). A single input can have multiple output labels; a single genomic region may be active in several cell types (Fig. 2a–c). The main architectures used for DNN sequence-to-function models are convolutional neural networks (CNNs), recurrent neural networks (RNNs) and transformers65,66,67 (Table 1 and Fig. 2b).

a, A dataset of genomic regions is split into training, validation and test sets. Each genomic region has a label. This label can be binary (for example, a region is accessible or not represented by 0/1) or can be a scalar value (for example, the level of chromatin accessibility, represented by natural numbers). A label can be an array of values, representing a genomics measurement at multiple genomic bins over the input sequence (such as the chromatin accessibility profile). Finally, each sequence can have labels for multiple classes, representing different cell states. This is indicated by stacked boxes. The type of input data used by various genomics sequence-to-function models is indicated. b, Labelled data is used to train deep neural networks. In genomics, supervised sequence-to-function models make use of convolution optionally combined with dilation, long short-term memory (LSTM) or attention architectures. c, A loss function needs to be defined to map the output of the model to a number representing the error of the model, this error is minimized during the training step. An illustration of binary cross entropy (BCE) with the log probability of data labelled as 1 (green) and 0 (red) and cosine similarity is shown. The performance of the model is evaluated using various metrics, Pearson correlation coefficient (PCC) and precision recall (PR) curves are illustrated. d, After the training step, predictions can be explained using explainable artificial intelligence techniques. Top, representation of attribution scores where the scale of the nucleotides is proportional to its importance for a particular prediction of the model. Bottom, representation of in silico saturation mutagenesis (ISM) where each nucleotide is mutated to each of the three other possible nucleotides and the effect on model output is plotted.
CNNs use convolutional filters that are tuneable during training. These filters are scanned over the one-hot encoded input (a numerical representation in which each nucleotide is transformed into a binary vector with four positions, where one position is ‘1’ indicating the presence of a specific nucleotide, and the other positions are ‘0’) producing a local weighted sum of the input features66. For the first convolutional layer, this is analogous to scanning a PWM over a DNA sequence; in the first layer each nucleotide is seen in the context of other nucleotides, the number of which is determined by the filter size. However, filter sizes are generally small to match local patterns like TFBS. Therefore, many convolutional layers need to be stacked together to reach a large context size, which increases the number of trainable parameters and consequently makes the model more prone to overfit with limited training data. To overcome these limitations, dilated convolutions68 or pooling layers can be adopted, increasing the context size by either skipping or aggregating consecutive input features.
RNNs are networks specifically designed for sequential data66. In genomics, RNNs often make use of long short-term memory (LSTM). The LSTM block maintains an internal state, which the model updates sequentially as it processes outputs from the previous layer. At each step (representing a window of the input sequence), the internal state is updated based on the current input, effectively capturing the dependencies between successive steps66. LSTM blocks can be preceded by a convolutional layer to first capture local patterns and later model the dependencies of these patterns25,27,28,69,70. Owing to the sequential nature of RNNs, parallelization within a single sample of the LSTM step is ruled out, limiting its efficiency. This mainly concerns the applicability of RNNs to longer input sequences where memory constraints limit batching across samples71.
The transformer architecture is more efficient compared to RNNs and has a greater capacity to represent large sequence lengths71. In transformers, each input segment is embedded in some embedding space, for example, using a block of convolutional layers72,73. These embeddings are dynamically adjusted based on the contextual relationship formed by the entire input sequence. The degree to which one segment influences another is determined by the attention mechanism, a learnable component that assigns varying levels of importance to different segments during training71, enabling the modelling of a much larger sequence context (Table 1).
Explainable artificial intelligence can be used to obtain gene regulatory insights from DNNs74 (Fig. 2d). CNNs can capture short sequence motifs relevant for regulation within their first convolutional layer27,74. Additionally, probing DNNs through both forward and backward propagation enables inference of the effect of each nucleotide on the prediction74. One effective method is in silico saturation mutagenesis, where each nucleotide in a region is systematically mutated to observe changes in the model predictions. Additionally, the model’s gradient with respect to an input can be evaluated to assess the effect on the prediction of infinitesimally small input changes. Explanations can be visualized using plots, where the height of each nucleotide is adjusted based on its contribution to the model’s prediction. This visualization offers an intuitive representation of how individual nucleotides influence the model’s output, potentially identifying important TFBSs (Fig. 2d).
From a biological standpoint, sequence-to-function models employing DNNs can be subdivided into three categories: those that predict a single modality in a single cell type, those that predict a single modality across multiple cell types, and large models that predict many modalities across many biological conditions.
Modelling a single modality in a single cell type
DeepBind75, one of the first genome-based CNNs, takes 14–101-bp DNA (or RNA) sequences as input and predicts TF binding affinity based on protein binding microarray data. Inspired by DeepBind, one of the first ‘niche’ DNNs was successfully trained on chromatin immunoprecipitation with high-throughput sequencing (ChIP–seq) data to predict functional TP53 enhancers5.
Similarly, BPNet68 predicts TF binding but its output is more sophisticated than DeepBind; it takes 1-kb DNA sequences as input and separately predicts ChIP–nexus binding profile shape and coverage at single base pair resolution. Like DeepBind, BPNet is a CNN but makes use of dilated convolution to increase its receptive field. BPNet can predict the binding profiles of SOX2, OCT4 (also known as POU5F1), KLF4 and NANOG in embryonic stem cells68. A distance-dependent bias toward cooperative binding could be revealed by inserting TFBSs at varying relative distances in random background sequences and examining model predictions68. Indeed, the use of simulations provides a powerful way to learn cis-regulatory rules and is conceptually close to enhancer design.
Finally, DeepSTARR76 is a CNN that takes 249-bp sequences as input and predicts developmental and housekeeping enhancer activity as measured by genome-wide STARR sequencing in Drosophila S2 cells. In this work, the authors selected ~200 sequences with variable prediction scores and measured their activity in vitro. The activity of these sequences was consistent with predictions by the model, highlighting the power of DNNs to make accurate predictions even on left-out or random data76.
Modelling a single modality across multiple cell types
A second category consists of models that predict chromatin accessibility (or another epigenomic track) across multiple cell types. Owing to the multi-class, multi-label nature of these models, they can learn features important for cell-type-specific chromatin accessibility in the context of other cell types.
DeepMel27, DeepMel270, DeepFlyBrain25, DeepLiver28 and DeepBrain77 fit in this category (we will refer to these models as ‘DeepTopic’ models). These models take 500-bp DNA sequences as input and perform binary multi-label classification of cell-type-specific chromatin accessibility peaks using either a combination of convolutional and LSTM layers25,27,28,70 or only convolution (in the case of DeepBrain77). DeepLiver adds another type of variation: after training on accessibility data, it uses transfer learning on MPRA data to discern additional features related to enhancer activity28. Regarding the training data, DeepTopic models do not learn directly from raw single-cell ATAC sequencing data but rather from a latent representation that is learned beforehand using topic modelling78,79. Each topic represents both a combination of regions that are co-accessible across cells and a combination of cells with a similar chromatin accessibility profile78,79. These models can accurately classify chromatin accessibility peaks across cell types of human melanoma states27,70, the Drosophila brain25, the mouse liver28, the mouse and human cortex77, and the chicken telencephalon77. Even though these models were not directly trained on enhancer activity, they can learn features relevant to enhancer activity80.
Basset81 is a binary classifier of bulk chromatin accessibility measured across hundreds of cell lines using 600-bp DNA sequences as input. The follow-up model scBasset82 is a binary classifier at the single-cell level with 1,344-bp DNA sequences as input. scBasset has a bottleneck layer to generate a lower dimensional representation of the cells and can infer cell-type-specific importance of individual TFBSs, denoise single-cell ATAC sequencing data and perform batch effect correction.
In contrast to the DeepTopic and (sc)Basset classification models, AI-TAC83 is a regression model that takes 251-bp DNA sequences as input and predicts the level of chromatin accessibility across 81 mouse immune cells83. Using this model, a combinatorial set of motifs underlying immune cell states was identified83.
Modelling many modalities across various biological contexts
The third category consists of models with a larger number of parameters (for example, 30–250 million) trained on multiple modalities across many different biological contexts. In practice, these models are trained on large publicly available data repositories such as ENCODE84,85,86,87, Roadmap epigenomics88 and Cistrome89,90. DeepSEA91, DanQ69 and Sei92 are multi-label binary classifiers predicting whether a region (of 1–4 kb) is bound by a TF, has a certain histone mark on its surrounding nucleosomes or is accessible. Basenji93, Enformer72 and Borzoi73 are sequence-to-function models with progressively larger input sizes of 131 kb, 200 kb and 523 kb, respectively. Similar to DeepSEA, DanQ and Sei, they predict multiple functional modalities from DNA sequences including TF binding, histone modifications and chromatin accessibility. However, Basenji, Enformer and Borzoi stand out by predicting the levels of these modalities through regression rather than classification tasks. Moreover, these models predict gene expression levels as either cap analysis of gene expression (CAGE) sequnecing at the TSS72,93 or RNA sequencing across the gene body73. To achieve the large receptive field crucial for accurately modelling gene expression, Basenji employs dilated convolution techniques93. By contrast, Enformer and Borzoi leverage the transformer architecture72,73; notably, analysing the attention weights learned by the transformer models provides insights into candidate enhancers that might be regulating gene expression72,73. Based on Enformer, Epiformer29 was trained to specifically predict cell-type-specific chromatin accessibility in the human brain29.
DNA language models
DNA language models are gaining traction in genomics94,95,96,97,98,99. These models are trained on vast amounts of unlabelled data in a self-supervised fashion. Specifically, masked genomic language models are trained directly on the DNA sequences by predicting masked segments, or ‘tokens’, from the surrounding context. This self-supervised approach enables the models to learn and reconstruct intricate dependencies within the genome99. These learned dependencies are encoded in high-dimensional embeddings, which can serve as rich features for efficient training of supervised models on specific tasks (fine-tuning)100. For example, SegmentNT100 fine-tuned the Nucleotide Transformer95 to segment the genome in several regulatory categories, including the prediction of enhancers and promoters99,101.
Designing enhancers and promoters
Sequence-to-function models as oracles
DNN sequence-to-function models can generalize to unseen data
Typically, cross-validation is used to evaluate the performance of a model on left-out data. Moreover, sequence-to-function models can generalize to data outside the genome. Enformer, DeepMel, DeepLiver and gkm-SVM efficiently predict the effect sizes of mutations from in vitro saturation mutagenesis assays of enhancers27,28,64,72. The predictive accuracy is particularly high at locations of TFBSs where changes can substantially alter gene expression. Similarly, deepSTARR can accurately predict the activity of hundreds of random DNA sequences76. Furthermore, DeepMel was used to identify orthologous enhancers across species27; by analysing evolutionary changes, DeepMel provided insights into which specific alterations contribute to differences in regulatory activity27. Similarly, DeepBrain77 could map orthologous cell types between human and chicken, AI-TAC83 could predict chromatin accessibility across human and mouse immune cells, and using an SVM, binding of Twist across fly species could be classified102.
Gene expression variation is another task where models can be evaluated on their performance to predict the effect of unseen data. In that line, Enformer72 and Borzoi73 can predict the influence of genetic variants on gene expression to a certain extent; these models begin to distinguish expression quantitative trait loci from negative controls72,73 but still require improvement to fully model cross-individual variation103,104. Together, these experiments show that high-fidelity sequence-to-function models, capable of generalizing to unseen DNA sequences, can function as powerful biological ‘oracles’. By formulating a cost function, based on predictions by the oracle, sequences can thus be optimized towards achieving target enhancer activity (Fig. 3). Of note, enhancers likely represent sequences close to local optima of the landscape represented by all possible DNA sequences. Below, we provide an overview of recent research where DNNs are used to design synthetic enhancers or promoters.

An identical seed sequence is optimized towards two different cell types (left vs right). For each cell type, a different cost function is formalized representing separate cell-type-specific optimization landscapes with enhancers represented by circles close to local minima. Nucleotide importance scores for enhancer activity in the two cell types are illustrated along the optimization process together with modifications compared to the starting seed (*). At the start, the seed sequence has binding sites for a repressor unique to each cell type (represented by negative importance scores, that is, letters facing downwards). The repressor binding sites are removed using a single nucleotide change. Activator-binding sites (positive importance scores, that is, letters facing upwards) are created by additional changes. Eventually, the activity of the synthetic enhancer is illustrated as a green signal in an embryo.
Using the oracle output to design promoters and enhancers
The type of training data used to fit the oracle model is critical for designing synthetic cis-regulatory elements as it must be closely related to the desired output (Supplementary Table 1). In simpler model systems like yeast, this alignment between training data and the target biological outcome is often straightforward. The unicellular nature of yeast allows for direct measurement of promoter activity; for example, a CNN was trained to predict reporter gene expression based on the expression induced by random 80-mers105. Subsequently, the vast landscape of all possible 80-mers was navigated by a genetic algorithm using the output of the trained model as a fitness function resulting in synthetic promoters exhibiting substantially higher reporter expression levels compared to natural promoters105.
In mammalian or Drosophila cell lines, although more complex than yeast, it is often still possible to measure enhancer activity directly. For example, randomly generated sequences that score high with the deepSTARR model were experimentally validated as active enhancers, being as strong as the strongest genomic enhancer in Drosophila S2 cells106. In mammalian cell lines, cell-line-specific enhancers were designed for K562, HepG2 or SK-N-SH cells107,108; for this, a multi-task regression CNN was trained on MPRA data and used in a cost function to maximize enhancer activity of the cell line of interest while minimizing off-target activity107,108. Through several search space optimization methods (AdaLead109, Simulated Annealing110, Fast SeqProp111 and Deep Exploration Networks112), thousands of candidate enhancers were designed and tested using another round of MPRAs. The optimized sequences were both stronger and more cell-line specific compared to genomic sequences. To improve the enhancers, the CNN was further trained on the additional data acquired from the synthetic enhancers. Using this iterative approach, the resulting enhancers had an even larger average difference in activity across the cell lines108.
In more complex multicellular organisms, where it is often more challenging to directly measure enhancer activity, chromatin accessibility, either alone or in combination with a limited amount of enhancer activity data, is sufficient for the design of cell-type-specific enhancers. For example, separate CNNs were first trained one for each of five Drosophila melanogaster embryonic tissues (central nervous system, epidermis, gut or muscle) that predicted pseudobulk chromatin accessibility levels across the fly genome. Next, using transfer learning, a second set of CNNs was trained that classified genomic regions as ‘active’ or ‘inactive’ enhancers within the same tissues106. Subsequently, the output of these sequence-to-enhancer activity models served as a cost function for a ‘shotgun’ optimization approach. Here, three billion random DNA sequences were generated and evaluated and, for each tissue, 8 out of the top 3,000 sequences with the highest predicted enhancer activity were selected based on visual inspection of motif content and were tested using in vivo enhancer reporter assays. Notably, around 70% drove cell-type-specific activity in the target tissue106.
Enhancer models trained on cell-type-specific chromatin accessibility data alone were also sufficient to decipher and design enhancers80 using DeepFlyBrain25. Two greedy search algorithms were employed for this purpose: in silico evolution and motif embedding80. Both start with randomly generated seed DNA sequences matching the local guanine-cytosine content of genomic candidate enhancers and optimize DNA sequences. In silico evolution optimizes DNA sequences by iteratively making single nucleotide changes that increase the predictions score the most and motif embedding entails implanting essential TFBS for the desired cell type one by one, at the specific location that is predicted to cause the greatest increase in prediction score. These approaches successfully generated enhancers for Kenyon cells and perineurial glial cells in the fly brain80. Only 10 to a maximum of 20 mutations were sufficient to achieve cell-type-specific enhancer activity80. Similarly, using DeepMel270, enhancers specific for the melanocytic melanoma state in human were designed80.
In the genome, enhancers do not function alone113,114,115; therefore, to consider the potential effect of surrounding regulatory elements, models with a larger receptive field are needed and designed elements must be tested in their genomic context. Such models are still limited in performance103,104,116; yet, as a proof of principle, genomic edits were designed to the locus of PPIF to either maximize or minimize the expression of this gene or cause differential expression between two cell lines (THP-1 and Jurkat cells)117. Either the THP-1 or Jurkat CAGE head of Enformer72 were used and a total of 185 edits ranging from 4 to 10 bp were tested and integrated into the genome of either of two cell lines 20–100-bp upstream of the gene’s TSS. The expression of the gene could substantially be altered by these changes: specifically, edits creating ATF4-binding sites increased expression by up to 200%, whereas edits introducing ZEB-binding sites or FOX-binding sites resulted in a decrease of 80% and 100%, respectively117. Notably, edits creating CEBPA-binding sites, a TF specific to THP-1 cells, led to a twofold difference in PPIF expression in THP-1 cells compared to Jurkat cells117.
Fundamental insights gained through enhancer design experiments
Besides testing whether sequence-to-function models have the necessary and sufficient knowledge to generate synthetic enhancers, fundamental insights into enhancer function can be gained from enhancer design experiments. In particular, they confirm that all information for cell-type-specific enhancer activity is encoded by particular combinations and arrangements of TFBS. Furthermore, it highlights the importance of having a thorough understanding of the ‘background’ DNA sequence used to design enhancers.
The notion that all information needed for cell-type-specific enhancer activity is encoded by combinations of TFBSs and, more precisely, their identity, copy number, arrangement and affinity is validated by several findings. First, enhancers that were specifically active in either Kenyon cells or perineural glia in the fly brain were designed starting from identical seed sequences with few mutations that only generated TFBS of activator TFs or destroyed TFBSs of repressor TFs, specifically expressed in the cell type of interest80. Second, embedding of essential TFBSs of the cell type of interest is sufficient to generate an active enhancer, although the distance between TFBSs and their individual affinity are important76,80. Third, based on the learned distance rules, a minimal enhancer was generated consisting solely of Mef2 positioned 5-bp upstream of Eyeless and Onecut placed 3-bp downstream, and this enhancer fully recapitulated Kenyon cell enhancer activity80. Similarly, designed enhancers minimal in length (down to 50 bp) were shown to have activity equivalent to longer enhancers (145 bp)108. Fourth, attempts to optimize already functional synthetic enhancers revealed that this is only possible by generating additional TFBSs108. Moreover, efforts in destroying enhancer activity by making sequence edits to functional synthetic enhancers without touching TFBSs showed that this was only possible by generating TFBSs for repressors specifically expressed in the cell type of interest80. Fifth, enhancers active in two cell types were generated by supplementing enhancers active in one of the cell types with TFBSs of TFs specifically expressed in the other80. Finally, enhancers tend to need more than one TFBS, except for certain cases where a single TFBS is sufficient, for example, enhancers consisting of a single TP53-binding site5,108. Thus, these studies illustrate how synthetic design can be leveraged to understand enhancer rules.
Another insight is that a critical challenge in enhancer design lies in the selection of the starting seed sequence. Random DNA sequences often contain unintended binding sites for repressor TFs expressed in the target cell type. For instance, randomly generated sequences for melanoma enhancers almost always harbour ZEB2-binding sites80 (5-bp consensus site, occurring 1/1,024 bp by chance). Therefore, it is crucial to have a thorough understanding of the background sequence using models that consider both activator and repressor sites for accurate predictions.
If the right combination of TFBSs is all it takes to generate an enhancer, one might ask why artificial intelligence is needed in the first place. In this context, it has been argued that enhancer activity is dependent on the combinatorial binding of TFs; therefore, to decipher the enhancer code, TFBSs should be modelled in their cis-regulatory context (that is, the context of other TFBSs)118. Furthermore, even though TFBS arrangement (for example, order, spacing and orientation), also referred to as the motif syntax, might be important for enhancer activity, this is often a ‘soft syntax’, that is, there is a specific preference of TFBS arrangements but this arrangement is not exact. For these reasons, artificial intelligence is well suited for modelling enhancers because it considers entire enhancer sequences and has the potential to capture soft syntax rules118, which are otherwise difficult to discover in the first place. We can nevertheless anticipate that the relative simplicity of enhancer logic discovery through deep learning models and enhancer design can lead to improved mechanistic (that is, ‘white box’) models in the future.
Using the gradient of the oracle to design enhancers
The enhancer design techniques described so far all make use of the oracle output. Given that DNNs are inherently differentiable, their gradients with respect to the input can also be harnessed. This allows for the computation of the direction in which the input should be adjusted to maximize the model’s output — known as gradient ascent. However, gradient ascent cannot be applied directly to DNA sequences because they are represented by an array of categorical variables with classes ‘A’, ‘C’, ‘T’ and ‘G’. Thus, categorical DNA sequences must be approximated by a continuous distribution (for example, the Gumbel–Softmax distribution119, whereby the values for each nucleotide are referred to as logits). The continuous approximation can then be directly optimized using gradient ascent and a discrete DNA sequence can be regenerated from this approximation (often by simply taking the nucleotide with the highest value for each position, that is, argmax operation). Based on this principle, a method named Ledidi120 was developed to design a small set of edits on a sequence to change the model’s output. Using this technique, edits were proposed that eliminate the binding of JunD in a cell-line-specific manner. For this purpose, Basenji93 was used as an oracle and a loss function was formulated that minimizes the number of sequence edits while minimizing the predicted binding of JunD120.
However, two problems have been identified with the approaches that directly use a continuous approximation of the input for gradient ascent111,121. The first problem is that sequence-to-function models are only trained on discrete inputs; therefore, there is no guarantee that they will also function on a continuous approximation. Furthermore, an optimized continuous input does not necessarily relate to an optimal discrete input. To account for this discrepancy, Ledidi updates the sequence using the continuous approximation and selects the iteration at which the discrete DNA sequence is most fit120. A second problem lies in the use of the Softmax function to approximate discrete DNA sequences, which can lead to vanishing gradients when the difference between the nucleotide logits becomes large (that is, the model becomes more ‘certain’ about the nucleotide at a certain position).
To address the challenge of continuous versus discrete inputs, a straight-through estimator method named SeqProp was proposed121. In SeqProp, instead of directly using the continuous approximation as input to the model, a discrete sequence is first sampled from this approximation. This sampled sequence serves as the model’s input for loss function computation, whereas the continuous approximation is updated based on the gradient of the loss121. To address the problem of vanishing gradients, a normalization strategy111 was implemented in a method named Fast SeqProp111. Using these two adaptations, more fit local optima could be obtained with fewer iterations compared to using the continuous input directly, either with or without normalization, or using sampling without normalization111.
Employing a variety of search space optimization methods or the model’s gradient with respect to the input helps DNN sequence-to-function models serve as accurate oracles that can be used to design functional enhancers. Another strategy is generative artificial intelligence.
Generative artificial intelligence for enhancer design
Designing synthetic enhancers using cost functions from oracles still bears limitations: first, a greedy search approach, such as in silico evolution, can generate sequences that fall outside the natural distribution of functional biological sequences (Supplementary Table 1). This drawback could yield unexpected results, where designed enhancers do not function when tested experimentally even though the oracle predicts their functionality. Second, traditional optimization techniques often focus on local improvement, potentially missing out on more optimal solutions that require a broader exploration of the sequence space. Further investigation through large-scale enhancer reporter assays is needed to test the extent to which these are concerns for enhancer design. Tackling these potential limitations, generative artificial intelligence approaches are emerging as promising methods for enhancer design. State-of-the-art generative models make use of a variety of architectures, including generative adversarial networks (GANs)122, diffusion models123,124, flow matching125 and autoregressive language models126,127,128 (Fig. 4a–c).

a, Representation of training and synthetic design of cis-regulatory elements (CREs) using a generative adversarial network. During training, both genomic sequences (examples of enhancers active in the cell type of interest) and synthetically generated sequences by a generator network are used as input into a discriminator trained to discriminate genomic sequences from synthetic ones. Both generator and discriminator are trained at the same time to produce a generator that can produce sequences indistinguishable from genomic examples, generating synthetic CREs. b, Representation of training and synthetic design of CREs using a diffusion model for discrete DNA sequences as a probability distribution on the standard 3-simplex. During training, noise is gradually added to the example CRE sequences while a model is trained to remove this noise. At inference, the denoising network generates novel synthetic CREs. c, Representation of training and synthetic design of CREs using a DNA language model. During training, sequences are tokenized (for example, 3-mer tokenization), part of the sequence is masked and (optionally) a prompt, representing biological features of the sequence, is appended. A model, often using the transformer architecture, is trained to predict the masked tokens. At inference, a synthetic CRE is iteratively generated by the model optionally starting from a prompt.
Generative adversarial networks
GANs122 consist of two neural networks, a generator and a discriminator (Fig. 4a). The generator creates sequences whereas the discriminator evaluates them with the aim of producing realistic sequences that are indistinguishable from genomic ones. For example, training a GAN on a dataset of 14,098 experimentally verified promoters enabled the design of 50-bp synthetic bacterial promoters from which up to 70% were experimentally functional129. Additionally, a GAN was used to design realistic background sequences for promoters containing user-defined TFBSs130. Extending to yeast, ExpressionGAN131, a network capable of generating regulatory DNA with prespecified target mRNA levels, was introduced: a generator of cis-regulatory elements was coupled to a predictor trained to model gene expression using both regulatory and coding regions as input131. Synthetic constructs across three orders of magnitude of predicted gene expression levels were designed and showed a high correlation (Spearman rho 0.7) with measured levels of expression in yeast. Moving to complex model organisms, GANs were used to generate synthetic enhancers in the Drosophila brain and melanoma cell lines80. The GAN-generated sequences were subsequently evaluated using DeepFlyBrain25 and DeepMel2, respectively. Only the synthetic enhancer sequences that received high scores from these computational ‘oracles’ demonstrated substantial cell-type-specific activity.
Although successful, GANs are still limited: they need a reliable oracle (discriminator) for training, frequently fail to converge and can suffer from mode collapse, an issue where the generator produces a limited variety of outputs132,133. These problems pose limitations in synthetic sequence design, resulting in the replacement of GANs by diffusion models132.
Diffusion models
Diffusion models123,124 are a class of generative models that learn to produce data samples through a series of incremental transformations (Fig. 4b). The training of diffusion models includes a forward and a reverse process. The forward process involves a sequential addition of noise to the input (diffusion) whereas the reverse process aims to undo the diffusion process to reproduce the original input. When applied to a dataset of candidate enhancer sequences, the model can learn to reverse the noise addition steps, recovering the original enhancers. Once trained, the diffusion model generates new enhancer sequences by starting with random noise and iteratively applying the learned reverse process. However, diffusion models cannot directly be applied to discrete data such as DNA. Possible solutions include modifying the input or modelling space134 or mapping the discrete input into a continuous latent space135,136,137.
Modifying the input or modelling space is exemplified by the Dirichlet diffusion score model (DDSM)134. DDSM represents individual nucleotides as the vertices of the standard 3-simplex (that is, a regular tetrahedron with unit lengths) with its interior representing all possible probability distributions over the four nucleotides modelled by the Dirichlet distribution134 (Fig. 4b). DDSM enables diffusion in this probability simplex space134. One example involves generating 1,024-bp human promoter sequences, conditioned on the transcription initiation signal profile obtained from CAGE data. The promoters were then validated in silico based on the predicted H3K4me3 level by an independent model, Sei92.
BitDiffusion135 maps discrete DNA sequences into a continuous representation by a transformation that casts binary bits as real numbers (‘analogue bit’). These analogue bits are used as the input and the output of the diffusion model, and a simple thresholding operation is performed to regenerate discrete data. Using BitDiffusion, DNA-Diffusion137 generated cell-type-specific regulatory sequences based on chromatin accessibility data across three cell lines (GM12878, K562 and HepG2)137. For this purpose, the model was trained on a DNase I hypersensitive site index dataset30. During training, DNase I hypersensitive site peaks were provided with the corresponding cell-type labels to condition the model to generate cell-type-specific enhancers. Once trained, 100,000 candidate enhancers were generated per cell type. The synthetic sequences were evaluated across various genomic characteristics, including TFBS composition, predicted cell-type-specific chromatin accessibility and cell-type-specific enhancer activity using state-of-the-art sequence-to-function models (that is, ChromBPNet138 and Enformer72). These in silico validations revealed that the generated sequences recapitulate endogenous properties of genomic enhancers. DiscDiff136 makes use of a variational autoencoder, upstream of the diffusion operation, to map the discrete input into a continuous latent representation136. Thus, 50,000 candidate promoter sequences were generated across 15 species with properties of genomic promoters illustrated by their motif composition, latent distribution distance to genomic examples and the predicted chromatin profiles from Sei92.
Extending the in silico validation, a comprehensive suite of evaluation metrics was introduced to assess the functional similarity, sequence similarity and regulatory composition of generated sequences139. Eventually, DNA discrete diffusion (the implementation of discrete diffusion140 for genomic sequences) was benchmarked on three high-quality functional genomics datasets spanning human promoters and fly enhancers139. The authors demonstrated that DNA discrete diffusion outperforms existing methods (such as BitDiffusion and DDSM) in capturing the diversity of cis-regulatory grammars and generating sequences that more accurately reflect the properties of genomic regulatory DNA139.
Flow matching
Generative models based on flow matching125 have emerged as another approach for DNA sequence design with promising potential141. These models are conceptually similar to diffusion models; by leveraging normalizing flow142, they model the underlying unknown probability density distribution of data by learning a set of transformations (‘flow’). The process starts from a simple and known distribution (for example, a Gaussian) and produces a probability distribution that fits the data by learning a set of operations that map the simple distribution to the one that fits the data. After training the model, realistic samples can be generated by sampling from the starting distribution and applying the set of learned transformations. Like diffusion, typical implementations of flow matching cannot be directly applied to discrete data. To address this limitation, the Dirichlet flow matching143 approach was introduced; like DDSM, this approach relaxes the input data by representing it as a mixture of probability distributions on the simplex, resulting in an improvement over the promoter design task of DDSM134 (in terms of being able to generate synthetic promoters that recapitulate H3K4me3 profiles of genomic promoters as predicted by Sei92; mean squared error of 0.0269 versus 0.0334 for DDSM). In addition, effective class conditional generation was demonstrated via guided Dirichlet flow matching to design cell-type-specific enhancers and was evaluated in silico by measuring the distance in embedding space of the generated samples to enhancer sequences from the genome25,70.
Autoregressive language models
Autoregressive language models learn the conditional probability of the next token based on previous ones (Fig. 4c). For example, regLM144 is a framework to design synthetic promoters and enhancers with desired properties, such as high, low or cell-type-specific activity, using a DNA language model144. The authors label-encoded biological activity using ‘prompt tokens’ that were prefixed to DNA sequence and trained a HyenaDNA model97 to perform next-token prediction starting from the biological prompts. Once trained, the model can generate promoters or enhancers with the prompted level of activity. However, a sequence-to-function model, trained on MPRA data, was still used as an oracle to select promising candidates from the generated sequences.
Generative artificial intelligence provides a promising avenue for the generation of novel synthetic promoters and enhancers. Nevertheless, employing the most advanced models for synthetic design does not compensate for the need for experimental validation of synthetic DNA sequences to fully appreciate the performance of the model.
Evaluation of synthetic enhancer activity
Two aspects are important when evaluating synthetic enhancers: specificity and strength. The designed enhancer must be specifically active only in the target cell types. Additionally, the level of transcription the enhancer induces must be strong enough to drive sufficient gene expression without being too high to disrupt normal cellular processes. The latter is especially important in the context of gene therapy, where a therapeutic product potentially needs to be expressed at a specific level. Currently, there is a trade-off between both aspects: either synthetic enhancers are tested using fluorescent reporter assays, where the activity of the enhancer is visualized through fluorescence microscopy80,106, or synthetic enhancers are tested using MPRAs107,108. On the one hand, using microscopy, the cell type (and/or developmental time) specificity of the enhancer is easily evaluated, especially when the enhancer is integrated into the organism’s genome at an early developmental stage, avoiding mosaicism. On the other hand, these assays are limited in quantifying enhancer strength, as fluorescence intensity may not linearly correlate with transcriptional output, and it is difficult to control for enhancer copy number. MPRA provides a quantitative measurement of enhancer strength, controlling for copy number but is typically performed in bulk populations of cells and using mostly a handful of cell types at a time (limiting its ability to measure enhancer specificity). Furthermore, not all cell types are easily accessible to MPRAs. A challenge remains in combining the strengths of these two approaches. Future developments in the field may involve the use of single-cell MPRAs145,146 and the use of primary cells to fully quantify both the specificity and strength of synthetic enhancers.
Some studies evaluate synthetic enhancers computationally134,137,139,143,147, which evaluates enhancers mostly based on sequence diversity, motif content and similarity to genomic enhancers134,137,139,143,147. In some cases144, these relatively simple metrics are combined with prediction scores from independent sequence-to-function models. Despite being valuable (for example, as a first-quality filter), they should be used in concert with experimental validations.
Outlook
Classical genetics studies and high-throughput genomic assays combined with computational modelling have resulted in a comprehensive understanding of the rules underlying enhancer activity. From these studies, the location, orientation, affinity, action (activating versus repressing), number of TFBSs, distance between them and their identity are the most important features. The necessity and sufficiency of these rules can be assessed through synthetic enhancer design, using a model as a guiding oracle, and testing cell-type-specific activity experimentally. The workflow of design and evaluation using sequence-to-function models is an important tool in the toolbox of genomics researchers facilitated through easy-to-use software packages (Box 1).
Genomic models have already shown many successes in enhancer interpretation and design, also for fly genomes80,106, where data is more limited due to their smaller genome size (compared to mammals), which can be marginally improved using phylogenetic augmentation148. However, De Boer and Taipale11 have argued that the performance of models trained on genomic sequences is in general overestimated due to data leakage caused by sequence homology11. Alongside, genomic sequence diversity is insufficient to explore all possible features underlying enhancer activity. Using randomly synthesized DNA sequences, after testing their activity, as training data instead poses a possible solution11. This provides interesting prospects, not only owing to the increased number of diverse training samples but also by designing specific sequences to further improve model performance in an active learning paradigm. Although interesting, this is limited to cell types in which enhancer activity can be measured in a high-throughput manner (that is, mostly cell lines that are easily transfectable, proliferative to produce a high number of cells and homogeneous to measure activity in bulk); therefore, further optimization of single-cell and in vivo MPRAs is needed145,146,149.
In line with this reasoning, it seems intuitive that large genomic datasets comprising as many different cell types and states as possible would result in better models for cell-type-specific enhancer design. Nevertheless, ‘niche’ models trained on a specific system28,77 or even a single cell type68 already provide sequence interpretations with deep insight. The transition of training data from bulk to single cells is beginning to provide fine-grained models in this regard. A short-term challenge is to train foundation models that cover a larger universe of cell types while maintaining high-resolution interpretations for each individual cell type.
Besides training data, there are important technical aspects to consider while designing a sequence-to-function model for enhancer design, for instance, the choice between regression or classification. Classification models are more prone to learning discrete sets of features that optimize the decision boundaries between classes rather than a non-linear continuous function that maps input to output variabilities as in regression. In practice, this might result in the exploitation of TFBSs that are not part of the cell type enhancer logic by attributing a negative feature importance to them as they decrease the likelihood of the positive class.
In this Review, we focused on the interpretation and design of single enhancers; however, genes are usually regulated by multiple enhancers that interact either additively113 or multiplicatively114,115, potentially conferring robustness150. Furthermore, expression is dependent on intrinsic promoter activity151, enhancer–promoter interactions151,152 (which are mostly non-linear113,152,153), enhancer–promoter distance115,153,154,155 and enhancer–enhancer distance115,155. Therefore, modelling the expression profile of a target gene is more challenging than modelling individual enhancer activity. Specialized architectures, for example, the combination of the transformer architecture and U-net in Borzoi73 or Scooby156, are needed to reflect both local features (individual TFBSs) and global features (individual enhancers). Next to cis-regulatory interactions, the steady-state transcriptome is further influenced by multiple biological aspects, including promoter-proximal pausing, RNA processing and degradation, thus adding complexity to modelling gene expression157. To address part of this multifactorial process, transcriptional initiation was modelled based on precision nuclear run-on-cap data, revealing the rules of transcription initiation and features important for the strand-specific location of the gene’s TSS and total initiation activity separately157. We envision that sequence-to-expression models can be used to generate artificial genomic loci driving specific levels of gene expression, potentially leading to the design of an entire novel cell state with a designed transcriptome. However, the use of sequence-to-expression models for locus design is currently still limited both because of the technical challenges related to producing and testing large DNA sequences and because these models do not always generalize well103,104 and often miss important distal regulatory elements116.
Sequence-to-function models are now also widely established for protein modelling158. Models that can accurately predict protein–protein and protein–DNA interactions, such as AlphaFold 3 (ref. 159), will provide further insights into genomics. For example, such models could measure the effect of non-coding and coding mutations on TF binding, cooperative binding of multiple TFs and how TF binding is affected by co-factors. These models could be included into a fitness function when designing synthetic DNA sequences or even be used to jointly design enhancer sequences with synthetic TF complements.
Next to the fundamental insights that can be gained from modelling and designing enhancers, synthetically designed enhancers have practical applications. Being programmed to be active in a specific cell type or state, synthetic enhancers are invaluable for gene therapy applications (Box 2). In the context of biomedical research, enhancers can be designed to serve as markers for specific cell types by driving a fluorescent reporter gene, enabling the visualization and tracking of cell types in vivo. In line with this, enhancers could be designed to only drive gene expression after a certain cell-state change, for example, neuronal activity, cell–cell signalling hypoxia or malignant cell-state switching.
Thus, we anticipate that improvements in data (for example, single-cell MPRAs potentially on large libraries of random DNA sequences), improvements in models (such as sequence-to-gene expression models that generalize well and that can be applied to single-cell datasets156) and improvements in benchmarking studies (for example, to evaluate whether generative artificial intelligence produces higher quality synthetic enhancers compared to oracle-based optimization methods) will provide further advances in the field of enhancer design.
Responses