Modelling and design of transcriptional enhancers

Introduction

The control logic underlying the transcriptome of a cell is encoded in its genomic sequence, that is, in non-coding regions called enhancers and promoters (see ref. ¹ for a detailed review and refs. ^2,3,4 for historical reviews). Gene expression can be regulated by multiple enhancers in cis, located upstream or downstream, or in introns. Enhancers regulate gene expression by forming a nucleoprotein complex with transcription factors (TFs) and co-factors (Fig. 1a). In this paradigm, multiple TFs bind at clusters of transcription factor-binding sites (TFBSs; the TP53 responsive elements are a counterexample⁵). Subsequently, through interaction with the basal transcriptional machinery (BTM) at the transcription start site (TSS), gene expression is induced. In this Review, we distinguish the concept of core-promoters and enhancers, where the former overlap with the TSS and interact with the BTM and their effect on gene expression is usually not cell-type specific⁶. By contrast, enhancers are cell-type specific and depend on the interaction with a core promoter to influence target gene expression.

Modelling and design of transcriptional enhancers — **Fig. 1: Enhancers consist of well-arranged TFBSs.**

Once the TFBSs of an enhancer are determined (Fig. 1b), their necessity and sufficiency can be examined by generating synthetic enhancers and testing them using reporter assays. For instance, homotypic clusters of binding sites for TFs downstream of signalling pathways are sufficient to build reporters for JAK, cAMP and WNT signalling^7,8,9. In addition, five copies of the heterotypic cluster consisting of TFBSs for CREB, MEF2, SRF and TCF is sufficient to drive expression in response to neuronal activity¹⁰.

From these early synthetic enhancer examples, it is tempting to state that deciphering the enhancer code ‘simply’ boils down to identifying the combination of TFs that co-bind enhancers. However, given that neither enhancers nor TF binding are discrete biological entities and that both the expression of TFs and their interactions are cell-type specific, genome-wide identification and interpretation of enhancers is technically challenging (see ref. ¹¹ for an excellent comparison of deciphering the protein and cis-regulatory code). Indeed, the activity of synthetic WNT reporters is specific to certain biological contexts only, requiring additional cell-type-specific TFBSs to drive activity in others⁹. Furthermore, certain grammar rules, such as distance constraints, presence of weak TFBSs and absence of repressor sites, might apply to form functional enhancers (Fig. 1b).

To address the intricacies of enhancers, massively parallel reporter assays (MPRAs)¹² (Fig. 1c) that test a variety of rules using brute force have been used to determine the features underlying enhancer activity^{13,14,15,16,17,18,19,20,21,22,23}. Complementary assays that measure proxies to enhancer activity, like chromatin accessibility and histone modifications, have provided a wealth of information in a variety of cell types^{24,25,26,27,28,29,30}. To link DNA sequence features (Fig. 1b) to (proxies of) enhancer activity, computational models such as deep neural networks (DNNs) are used. We refer to these models as sequence-to-function models as they predict high-throughput genomic assays in the cell type of interest using the DNA sequence as input. Accurate sequence-to-function models can serve as ‘oracles’ for biology, partially resolving the need for additional high-throughput screens to test and find new hypotheses. Naturally occurring or fully synthetic DNA sequences can be optimized according to the output of the oracle, facilitating the development of synthetic enhancers with tailored functional properties.

In this Review, we explore the design of synthetic enhancers employing DNNs as predictive oracles. The first section focuses on sequence-to-function models, comparing conventional machine learning techniques with deep learning approaches. Subsequently, we provide an overview of methods used for synthetic enhancer design. Finally, we discuss future perspectives and potential applications of synthetic enhancers in various biological and biomedical contexts.

Sequence-to-function models

Conventional machine learning models

Enhancers were first discovered in the early 1980s as a viral genomic fragment enhancing transcription of a nearby gene^31,32,33,34. Soon after, equivalent eukaryotic elements were discovered^{35,36,37,38,39,40,41,42,43} and it was shown that enhancers consist of TFBSs^{44,45,46,47,48,49,50,51} and have cell-type-specific activity^{48,52,53,54,55}. A well-described example of an enhancer is the even-skipped (eve) stripe 2 enhancer⁵⁶. During Drosophila development, eve is expressed in seven transverse stripes along the embryonic anterior–posterior axis. The eve stripe 2 enhancer regulates the second stripe and has been studied in detail using site-directed mutagenesis and TF mis-expression experiments^56,57,58.

Attempts have been made to reconstruct the eve stripe 2 enhancer based on identified TFBSs^59,60. For example, a computational model was trained on orthologous enhancer sequences of eight Drosophila species⁶⁰. Subsequently, this model was used to design synthetic enhancers with varying degrees of similarity to the naturally occurring sequence⁶⁰. Sequences close to the wild-type enhancer produced expression in the correct segment whereas sequences with more edits lost their specificity. The model’s failure to predict correct enhancer activity of more divergent sequences might be caused by the limited amount of training data.

To increase the amount of training data and build better models, rather than focusing on a single enhancer, information from multiple enhancers active in the same cell type can be used. In this regard, a logistic regression model was fit to classify human skeletal-muscle enhancers⁶¹. The model used position weight matrix (PWM) log-likelihood scores (representing the likelihood that a given sequence is generated by the PWM model over a background model) of known muscle TFs. With a training set of 29 skeletal-muscle enhancers and ~2,000 randomly sampled negatives, it reached 60% sensitivity at a 4% false positive rate⁶¹. In another study, over 700 features were used to classify human heart enhancers⁶². These included known PWM log-likelihood scores from TFs from a variety of tissues, log-likelihood scores of de novo PWMs and Markov model scores with orders up to six. A Lasso regression model was fit to classify enhancers and, using a training set of 77 validated enhancers and 1,000 randomly sampled negatives, it reached a sensitivity of 70% at a 50% false positive rate⁶². Furthermore, making use of the model’s weights, PWMs of important heart-specific TFs could be identified. Finally, the model was used to scan the genome, identifying novel heart-specific enhancers with an accuracy of 74%⁶². Together, these studies highlight that, even with a small sample size, accurate sequence-to-function models can be built.

Mechanistic insights into gene regulation can be gained through thermodynamic models. Thermodynamic models use the DNA sequence and TF concentration to model gene expression. A prominent example is Gemstat⁶³, which predicts gene expression in a two-step procedure, modelling both the interaction of TFs with DNA and the interaction of bound TFs with the BTM. By selecting the model that best fits the data, mechanistic insights can be acquired. For instance, Gemstat was fit on 44 enhancers with varying anterior–posterior expression profiles in the Drosophila blastoderm along with the TF protein levels. By selecting the model that best explained the data, a case for synergistic interactions between TFs and short-range repression was made⁶³.

Both the thermodynamic and classic regression models rely on features based on pre-existing knowledge (for example, a given set of PWMs). However, features can also be learned de novo directly from the DNA sequences. This requires a much larger sample size and therefore genome-wide assays that identify hundreds to thousands of candidate enhancers have been of tremendous importance. In this regard, more than 30,000 chromatin accessibility peaks were used to train a gapped k-mer support vector machine (gkm-SVM) classifier⁶⁴, distinguishing chromatin accessibility peaks from randomly sampled genomic background sequences. The gkm-SVM model uses vectors of all possible 10-mers with 6 informative bases (that is, gaps of 4 bp are allowed) as features. It reached a sensitivity of ~70% at a false positive rate of ~5% using MPRA data as ground truth⁶⁴.

In conclusion, sequence-to-function models leveraging information that is shared between enhancers active in the same cell type can reach high accuracies to classify functional enhancers.

Deep neural networks

Since the mid-2010s, there has been a boom in sequence-to-function models employing DNNs (Table 1). Technically, these models take candidate regulatory sequences as input and predict biological properties in either a regression or classification task. For instance, a model might predict ‘whether’ a region is active as an enhancer (classification) or ‘to what extent’ it is active (regression). A single input can have multiple output labels; a single genomic region may be active in several cell types (Fig. 2a–c). The main architectures used for DNN sequence-to-function models are convolutional neural networks (CNNs), recurrent neural networks (RNNs) and transformers^65,66,67 (Table 1 and Fig. 2b).

Table 1 Overview of deep learning-based sequence-to-function models

Full size table

**Fig. 2: Modelling enhancers using supervised sequence-to-function models trained on genomics data.**

CNNs use convolutional filters that are tuneable during training. These filters are scanned over the one-hot encoded input (a numerical representation in which each nucleotide is transformed into a binary vector with four positions, where one position is ‘1’ indicating the presence of a specific nucleotide, and the other positions are ‘0’) producing a local weighted sum of the input features⁶⁶. For the first convolutional layer, this is analogous to scanning a PWM over a DNA sequence; in the first layer each nucleotide is seen in the context of other nucleotides, the number of which is determined by the filter size. However, filter sizes are generally small to match local patterns like TFBS. Therefore, many convolutional layers need to be stacked together to reach a large context size, which increases the number of trainable parameters and consequently makes the model more prone to overfit with limited training data. To overcome these limitations, dilated convolutions⁶⁸ or pooling layers can be adopted, increasing the context size by either skipping or aggregating consecutive input features.

RNNs are networks specifically designed for sequential data⁶⁶. In genomics, RNNs often make use of long short-term memory (LSTM). The LSTM block maintains an internal state, which the model updates sequentially as it processes outputs from the previous layer. At each step (representing a window of the input sequence), the internal state is updated based on the current input, effectively capturing the dependencies between successive steps⁶⁶. LSTM blocks can be preceded by a convolutional layer to first capture local patterns and later model the dependencies of these patterns^{25,27,28,69,70}. Owing to the sequential nature of RNNs, parallelization within a single sample of the LSTM step is ruled out, limiting its efficiency. This mainly concerns the applicability of RNNs to longer input sequences where memory constraints limit batching across samples⁷¹.

The transformer architecture is more efficient compared to RNNs and has a greater capacity to represent large sequence lengths⁷¹. In transformers, each input segment is embedded in some embedding space, for example, using a block of convolutional layers^72,73. These embeddings are dynamically adjusted based on the contextual relationship formed by the entire input sequence. The degree to which one segment influences another is determined by the attention mechanism, a learnable component that assigns varying levels of importance to different segments during training⁷¹, enabling the modelling of a much larger sequence context (Table 1).

Explainable artificial intelligence can be used to obtain gene regulatory insights from DNNs⁷⁴ (Fig. 2d). CNNs can capture short sequence motifs relevant for regulation within their first convolutional layer^27,74. Additionally, probing DNNs through both forward and backward propagation enables inference of the effect of each nucleotide on the prediction⁷⁴. One effective method is in silico saturation mutagenesis, where each nucleotide in a region is systematically mutated to observe changes in the model predictions. Additionally, the model’s gradient with respect to an input can be evaluated to assess the effect on the prediction of infinitesimally small input changes. Explanations can be visualized using plots, where the height of each nucleotide is adjusted based on its contribution to the model’s prediction. This visualization offers an intuitive representation of how individual nucleotides influence the model’s output, potentially identifying important TFBSs (Fig. 2d).

From a biological standpoint, sequence-to-function models employing DNNs can be subdivided into three categories: those that predict a single modality in a single cell type, those that predict a single modality across multiple cell types, and large models that predict many modalities across many biological conditions.

Modelling a single modality in a single cell type

DeepBind⁷⁵, one of the first genome-based CNNs, takes 14–101-bp DNA (or RNA) sequences as input and predicts TF binding affinity based on protein binding microarray data. Inspired by DeepBind, one of the first ‘niche’ DNNs was successfully trained on chromatin immunoprecipitation with high-throughput sequencing (ChIP–seq) data to predict functional TP53 enhancers⁵.

Similarly, BPNet⁶⁸ predicts TF binding but its output is more sophisticated than DeepBind; it takes 1-kb DNA sequences as input and separately predicts ChIP–nexus binding profile shape and coverage at single base pair resolution. Like DeepBind, BPNet is a CNN but makes use of dilated convolution to increase its receptive field. BPNet can predict the binding profiles of SOX2, OCT4 (also known as POU5F1), KLF4 and NANOG in embryonic stem cells⁶⁸. A distance-dependent bias toward cooperative binding could be revealed by inserting TFBSs at varying relative distances in random background sequences and examining model predictions⁶⁸. Indeed, the use of simulations provides a powerful way to learn cis-regulatory rules and is conceptually close to enhancer design.

Finally, DeepSTARR⁷⁶ is a CNN that takes 249-bp sequences as input and predicts developmental and housekeeping enhancer activity as measured by genome-wide STARR sequencing in Drosophila S2 cells. In this work, the authors selected ~200 sequences with variable prediction scores and measured their activity in vitro. The activity of these sequences was consistent with predictions by the model, highlighting the power of DNNs to make accurate predictions even on left-out or random data⁷⁶.

Modelling a single modality across multiple cell types

A second category consists of models that predict chromatin accessibility (or another epigenomic track) across multiple cell types. Owing to the multi-class, multi-label nature of these models, they can learn features important for cell-type-specific chromatin accessibility in the context of other cell types.

DeepMel²⁷, DeepMel2⁷⁰, DeepFlyBrain²⁵, DeepLiver²⁸ and DeepBrain⁷⁷ fit in this category (we will refer to these models as ‘DeepTopic’ models). These models take 500-bp DNA sequences as input and perform binary multi-label classification of cell-type-specific chromatin accessibility peaks using either a combination of convolutional and LSTM layers^25,27,28,70 or only convolution (in the case of DeepBrain⁷⁷). DeepLiver adds another type of variation: after training on accessibility data, it uses transfer learning on MPRA data to discern additional features related to enhancer activity²⁸. Regarding the training data, DeepTopic models do not learn directly from raw single-cell ATAC sequencing data but rather from a latent representation that is learned beforehand using topic modelling^78,79. Each topic represents both a combination of regions that are co-accessible across cells and a combination of cells with a similar chromatin accessibility profile^78,79. These models can accurately classify chromatin accessibility peaks across cell types of human melanoma states^27,70, the Drosophila brain²⁵, the mouse liver²⁸, the mouse and human cortex⁷⁷, and the chicken telencephalon⁷⁷. Even though these models were not directly trained on enhancer activity, they can learn features relevant to enhancer activity⁸⁰.

Basset⁸¹ is a binary classifier of bulk chromatin accessibility measured across hundreds of cell lines using 600-bp DNA sequences as input. The follow-up model scBasset⁸² is a binary classifier at the single-cell level with 1,344-bp DNA sequences as input. scBasset has a bottleneck layer to generate a lower dimensional representation of the cells and can infer cell-type-specific importance of individual TFBSs, denoise single-cell ATAC sequencing data and perform batch effect correction.

In contrast to the DeepTopic and (sc)Basset classification models, AI-TAC⁸³ is a regression model that takes 251-bp DNA sequences as input and predicts the level of chromatin accessibility across 81 mouse immune cells⁸³. Using this model, a combinatorial set of motifs underlying immune cell states was identified⁸³.

Modelling many modalities across various biological contexts

The third category consists of models with a larger number of parameters (for example, 30–250 million) trained on multiple modalities across many different biological contexts. In practice, these models are trained on large publicly available data repositories such as ENCODE^84,85,86,87, Roadmap epigenomics⁸⁸ and Cistrome^89,90. DeepSEA⁹¹, DanQ⁶⁹ and Sei⁹² are multi-label binary classifiers predicting whether a region (of 1–4 kb) is bound by a TF, has a certain histone mark on its surrounding nucleosomes or is accessible. Basenji⁹³, Enformer⁷² and Borzoi⁷³ are sequence-to-function models with progressively larger input sizes of 131 kb, 200 kb and 523 kb, respectively. Similar to DeepSEA, DanQ and Sei, they predict multiple functional modalities from DNA sequences including TF binding, histone modifications and chromatin accessibility. However, Basenji, Enformer and Borzoi stand out by predicting the levels of these modalities through regression rather than classification tasks. Moreover, these models predict gene expression levels as either cap analysis of gene expression (CAGE) sequnecing at the TSS^72,93 or RNA sequencing across the gene body⁷³. To achieve the large receptive field crucial for accurately modelling gene expression, Basenji employs dilated convolution techniques⁹³. By contrast, Enformer and Borzoi leverage the transformer architecture^72,73; notably, analysing the attention weights learned by the transformer models provides insights into candidate enhancers that might be regulating gene expression^72,73. Based on Enformer, Epiformer²⁹ was trained to specifically predict cell-type-specific chromatin accessibility in the human brain²⁹.

DNA language models

DNA language models are gaining traction in genomics^{94,95,96,97,98,99}. These models are trained on vast amounts of unlabelled data in a self-supervised fashion. Specifically, masked genomic language models are trained directly on the DNA sequences by predicting masked segments, or ‘tokens’, from the surrounding context. This self-supervised approach enables the models to learn and reconstruct intricate dependencies within the genome⁹⁹. These learned dependencies are encoded in high-dimensional embeddings, which can serve as rich features for efficient training of supervised models on specific tasks (fine-tuning)¹⁰⁰. For example, SegmentNT¹⁰⁰ fine-tuned the Nucleotide Transformer⁹⁵ to segment the genome in several regulatory categories, including the prediction of enhancers and promoters^99,101.

Designing enhancers and promoters

Sequence-to-function models as oracles

DNN sequence-to-function models can generalize to unseen data

Typically, cross-validation is used to evaluate the performance of a model on left-out data. Moreover, sequence-to-function models can generalize to data outside the genome. Enformer, DeepMel, DeepLiver and gkm-SVM efficiently predict the effect sizes of mutations from in vitro saturation mutagenesis assays of enhancers^27,28,64,72. The predictive accuracy is particularly high at locations of TFBSs where changes can substantially alter gene expression. Similarly, deepSTARR can accurately predict the activity of hundreds of random DNA sequences⁷⁶. Furthermore, DeepMel was used to identify orthologous enhancers across species²⁷; by analysing evolutionary changes, DeepMel provided insights into which specific alterations contribute to differences in regulatory activity²⁷. Similarly, DeepBrain⁷⁷ could map orthologous cell types between human and chicken, AI-TAC⁸³ could predict chromatin accessibility across human and mouse immune cells, and using an SVM, binding of Twist across fly species could be classified¹⁰².

Gene expression variation is another task where models can be evaluated on their performance to predict the effect of unseen data. In that line, Enformer⁷² and Borzoi⁷³ can predict the influence of genetic variants on gene expression to a certain extent; these models begin to distinguish expression quantitative trait loci from negative controls^72,73 but still require improvement to fully model cross-individual variation^103,104. Together, these experiments show that high-fidelity sequence-to-function models, capable of generalizing to unseen DNA sequences, can function as powerful biological ‘oracles’. By formulating a cost function, based on predictions by the oracle, sequences can thus be optimized towards achieving target enhancer activity (Fig. 3). Of note, enhancers likely represent sequences close to local optima of the landscape represented by all possible DNA sequences. Below, we provide an overview of recent research where DNNs are used to design synthetic enhancers or promoters.

**Fig. 3: Cell-type direct design of enhancers using sequence-to-function model as an oracle.**

Using the oracle output to design promoters and enhancers

The type of training data used to fit the oracle model is critical for designing synthetic cis-regulatory elements as it must be closely related to the desired output (Supplementary Table 1). In simpler model systems like yeast, this alignment between training data and the target biological outcome is often straightforward. The unicellular nature of yeast allows for direct measurement of promoter activity; for example, a CNN was trained to predict reporter gene expression based on the expression induced by random 80-mers¹⁰⁵. Subsequently, the vast landscape of all possible 80-mers was navigated by a genetic algorithm using the output of the trained model as a fitness function resulting in synthetic promoters exhibiting substantially higher reporter expression levels compared to natural promoters¹⁰⁵.

In mammalian or Drosophila cell lines, although more complex than yeast, it is often still possible to measure enhancer activity directly. For example, randomly generated sequences that score high with the deepSTARR model were experimentally validated as active enhancers, being as strong as the strongest genomic enhancer in Drosophila S2 cells¹⁰⁶. In mammalian cell lines, cell-line-specific enhancers were designed for K562, HepG2 or SK-N-SH cells^107,108; for this, a multi-task regression CNN was trained on MPRA data and used in a cost function to maximize enhancer activity of the cell line of interest while minimizing off-target activity^107,108. Through several search space optimization methods (AdaLead¹⁰⁹, Simulated Annealing¹¹⁰, Fast SeqProp¹¹¹ and Deep Exploration Networks¹¹²), thousands of candidate enhancers were designed and tested using another round of MPRAs. The optimized sequences were both stronger and more cell-line specific compared to genomic sequences. To improve the enhancers, the CNN was further trained on the additional data acquired from the synthetic enhancers. Using this iterative approach, the resulting enhancers had an even larger average difference in activity across the cell lines¹⁰⁸.

In more complex multicellular organisms, where it is often more challenging to directly measure enhancer activity, chromatin accessibility, either alone or in combination with a limited amount of enhancer activity data, is sufficient for the design of cell-type-specific enhancers. For example, separate CNNs were first trained one for each of five Drosophila melanogaster embryonic tissues (central nervous system, epidermis, gut or muscle) that predicted pseudobulk chromatin accessibility levels across the fly genome. Next, using transfer learning, a second set of CNNs was trained that classified genomic regions as ‘active’ or ‘inactive’ enhancers within the same tissues¹⁰⁶. Subsequently, the output of these sequence-to-enhancer activity models served as a cost function for a ‘shotgun’ optimization approach. Here, three billion random DNA sequences were generated and evaluated and, for each tissue, 8 out of the top 3,000 sequences with the highest predicted enhancer activity were selected based on visual inspection of motif content and were tested using in vivo enhancer reporter assays. Notably, around 70% drove cell-type-specific activity in the target tissue¹⁰⁶.

Enhancer models trained on cell-type-specific chromatin accessibility data alone were also sufficient to decipher and design enhancers⁸⁰ using DeepFlyBrain²⁵. Two greedy search algorithms were employed for this purpose: in silico evolution and motif embedding⁸⁰. Both start with randomly generated seed DNA sequences matching the local guanine-cytosine content of genomic candidate enhancers and optimize DNA sequences. In silico evolution optimizes DNA sequences by iteratively making single nucleotide changes that increase the predictions score the most and motif embedding entails implanting essential TFBS for the desired cell type one by one, at the specific location that is predicted to cause the greatest increase in prediction score. These approaches successfully generated enhancers for Kenyon cells and perineurial glial cells in the fly brain⁸⁰. Only 10 to a maximum of 20 mutations were sufficient to achieve cell-type-specific enhancer activity⁸⁰. Similarly, using DeepMel2⁷⁰, enhancers specific for the melanocytic melanoma state in human were designed⁸⁰.

In the genome, enhancers do not function alone^113,114,115; therefore, to consider the potential effect of surrounding regulatory elements, models with a larger receptive field are needed and designed elements must be tested in their genomic context. Such models are still limited in performance^103,104,116; yet, as a proof of principle, genomic edits were designed to the locus of PPIF to either maximize or minimize the expression of this gene or cause differential expression between two cell lines (THP-1 and Jurkat cells)¹¹⁷. Either the THP-1 or Jurkat CAGE head of Enformer⁷² were used and a total of 185 edits ranging from 4 to 10 bp were tested and integrated into the genome of either of two cell lines 20–100-bp upstream of the gene’s TSS. The expression of the gene could substantially be altered by these changes: specifically, edits creating ATF4-binding sites increased expression by up to 200%, whereas edits introducing ZEB-binding sites or FOX-binding sites resulted in a decrease of 80% and 100%, respectively¹¹⁷. Notably, edits creating CEBPA-binding sites, a TF specific to THP-1 cells, led to a twofold difference in PPIF expression in THP-1 cells compared to Jurkat cells¹¹⁷.

Fundamental insights gained through enhancer design experiments

Besides testing whether sequence-to-function models have the necessary and sufficient knowledge to generate synthetic enhancers, fundamental insights into enhancer function can be gained from enhancer design experiments. In particular, they confirm that all information for cell-type-specific enhancer activity is encoded by particular combinations and arrangements of TFBS. Furthermore, it highlights the importance of having a thorough understanding of the ‘background’ DNA sequence used to design enhancers.

The notion that all information needed for cell-type-specific enhancer activity is encoded by combinations of TFBSs and, more precisely, their identity, copy number, arrangement and affinity is validated by several findings. First, enhancers that were specifically active in either Kenyon cells or perineural glia in the fly brain were designed starting from identical seed sequences with few mutations that only generated TFBS of activator TFs or destroyed TFBSs of repressor TFs, specifically expressed in the cell type of interest⁸⁰. Second, embedding of essential TFBSs of the cell type of interest is sufficient to generate an active enhancer, although the distance between TFBSs and their individual affinity are important^76,80. Third, based on the learned distance rules, a minimal enhancer was generated consisting solely of Mef2 positioned 5-bp upstream of Eyeless and Onecut placed 3-bp downstream, and this enhancer fully recapitulated Kenyon cell enhancer activity⁸⁰. Similarly, designed enhancers minimal in length (down to 50 bp) were shown to have activity equivalent to longer enhancers (145 bp)¹⁰⁸. Fourth, attempts to optimize already functional synthetic enhancers revealed that this is only possible by generating additional TFBSs¹⁰⁸. Moreover, efforts in destroying enhancer activity by making sequence edits to functional synthetic enhancers without touching TFBSs showed that this was only possible by generating TFBSs for repressors specifically expressed in the cell type of interest⁸⁰. Fifth, enhancers active in two cell types were generated by supplementing enhancers active in one of the cell types with TFBSs of TFs specifically expressed in the other⁸⁰. Finally, enhancers tend to need more than one TFBS, except for certain cases where a single TFBS is sufficient, for example, enhancers consisting of a single TP53-binding site^5,108. Thus, these studies illustrate how synthetic design can be leveraged to understand enhancer rules.

Another insight is that a critical challenge in enhancer design lies in the selection of the starting seed sequence. Random DNA sequences often contain unintended binding sites for repressor TFs expressed in the target cell type. For instance, randomly generated sequences for melanoma enhancers almost always harbour ZEB2-binding sites⁸⁰ (5-bp consensus site, occurring 1/1,024 bp by chance). Therefore, it is crucial to have a thorough understanding of the background sequence using models that consider both activator and repressor sites for accurate predictions.

If the right combination of TFBSs is all it takes to generate an enhancer, one might ask why artificial intelligence is needed in the first place. In this context, it has been argued that enhancer activity is dependent on the combinatorial binding of TFs; therefore, to decipher the enhancer code, TFBSs should be modelled in their cis-regulatory context (that is, the context of other TFBSs)¹¹⁸. Furthermore, even though TFBS arrangement (for example, order, spacing and orientation), also referred to as the motif syntax, might be important for enhancer activity, this is often a ‘soft syntax’, that is, there is a specific preference of TFBS arrangements but this arrangement is not exact. For these reasons, artificial intelligence is well suited for modelling enhancers because it considers entire enhancer sequences and has the potential to capture soft syntax rules¹¹⁸, which are otherwise difficult to discover in the first place. We can nevertheless anticipate that the relative simplicity of enhancer logic discovery through deep learning models and enhancer design can lead to improved mechanistic (that is, ‘white box’) models in the future.

Using the gradient of the oracle to design enhancers

The enhancer design techniques described so far all make use of the oracle output. Given that DNNs are inherently differentiable, their gradients with respect to the input can also be harnessed. This allows for the computation of the direction in which the input should be adjusted to maximize the model’s output — known as gradient ascent. However, gradient ascent cannot be applied directly to DNA sequences because they are represented by an array of categorical variables with classes ‘A’, ‘C’, ‘T’ and ‘G’. Thus, categorical DNA sequences must be approximated by a continuous distribution (for example, the Gumbel–Softmax distribution¹¹⁹, whereby the values for each nucleotide are referred to as logits). The continuous approximation can then be directly optimized using gradient ascent and a discrete DNA sequence can be regenerated from this approximation (often by simply taking the nucleotide with the highest value for each position, that is, argmax operation). Based on this principle, a method named Ledidi¹²⁰ was developed to design a small set of edits on a sequence to change the model’s output. Using this technique, edits were proposed that eliminate the binding of JunD in a cell-line-specific manner. For this purpose, Basenji⁹³ was used as an oracle and a loss function was formulated that minimizes the number of sequence edits while minimizing the predicted binding of JunD¹²⁰.

However, two problems have been identified with the approaches that directly use a continuous approximation of the input for gradient ascent^111,121. The first problem is that sequence-to-function models are only trained on discrete inputs; therefore, there is no guarantee that they will also function on a continuous approximation. Furthermore, an optimized continuous input does not necessarily relate to an optimal discrete input. To account for this discrepancy, Ledidi updates the sequence using the continuous approximation and selects the iteration at which the discrete DNA sequence is most fit¹²⁰. A second problem lies in the use of the Softmax function to approximate discrete DNA sequences, which can lead to vanishing gradients when the difference between the nucleotide logits becomes large (that is, the model becomes more ‘certain’ about the nucleotide at a certain position).

To address the challenge of continuous versus discrete inputs, a straight-through estimator method named SeqProp was proposed¹²¹. In SeqProp, instead of directly using the continuous approximation as input to the model, a discrete sequence is first sampled from this approximation. This sampled sequence serves as the model’s input for loss function computation, whereas the continuous approximation is updated based on the gradient of the loss¹²¹. To address the problem of vanishing gradients, a normalization strategy¹¹¹ was implemented in a method named Fast SeqProp¹¹¹. Using these two adaptations, more fit local optima could be obtained with fewer iterations compared to using the continuous input directly, either with or without normalization, or using sampling without normalization¹¹¹.

Employing a variety of search space optimization methods or the model’s gradient with respect to the input helps DNN sequence-to-function models serve as accurate oracles that can be used to design functional enhancers. Another strategy is generative artificial intelligence.

Generative artificial intelligence for enhancer design

Designing synthetic enhancers using cost functions from oracles still bears limitations: first, a greedy search approach, such as in silico evolution, can generate sequences that fall outside the natural distribution of functional biological sequences (Supplementary Table 1). This drawback could yield unexpected results, where designed enhancers do not function when tested experimentally even though the oracle predicts their functionality. Second, traditional optimization techniques often focus on local improvement, potentially missing out on more optimal solutions that require a broader exploration of the sequence space. Further investigation through large-scale enhancer reporter assays is needed to test the extent to which these are concerns for enhancer design. Tackling these potential limitations, generative artificial intelligence approaches are emerging as promising methods for enhancer design. State-of-the-art generative models make use of a variety of architectures, including generative adversarial networks (GANs)¹²², diffusion models^123,124, flow matching¹²⁵ and autoregressive language models^126,127,128 (Fig. 4a–c).

**Fig. 4: Enhancer design using generative artificial intelligence.**

Generative adversarial networks

GANs¹²² consist of two neural networks, a generator and a discriminator (Fig. 4a). The generator creates sequences whereas the discriminator evaluates them with the aim of producing realistic sequences that are indistinguishable from genomic ones. For example, training a GAN on a dataset of 14,098 experimentally verified promoters enabled the design of 50-bp synthetic bacterial promoters from which up to 70% were experimentally functional¹²⁹. Additionally, a GAN was used to design realistic background sequences for promoters containing user-defined TFBSs¹³⁰. Extending to yeast, ExpressionGAN¹³¹, a network capable of generating regulatory DNA with prespecified target mRNA levels, was introduced: a generator of cis-regulatory elements was coupled to a predictor trained to model gene expression using both regulatory and coding regions as input¹³¹. Synthetic constructs across three orders of magnitude of predicted gene expression levels were designed and showed a high correlation (Spearman rho 0.7) with measured levels of expression in yeast. Moving to complex model organisms, GANs were used to generate synthetic enhancers in the Drosophila brain and melanoma cell lines⁸⁰. The GAN-generated sequences were subsequently evaluated using DeepFlyBrain²⁵ and DeepMel2, respectively. Only the synthetic enhancer sequences that received high scores from these computational ‘oracles’ demonstrated substantial cell-type-specific activity.

Although successful, GANs are still limited: they need a reliable oracle (discriminator) for training, frequently fail to converge and can suffer from mode collapse, an issue where the generator produces a limited variety of outputs^132,133. These problems pose limitations in synthetic sequence design, resulting in the replacement of GANs by diffusion models¹³².

Diffusion models

Diffusion models^123,124 are a class of generative models that learn to produce data samples through a series of incremental transformations (Fig. 4b). The training of diffusion models includes a forward and a reverse process. The forward process involves a sequential addition of noise to the input (diffusion) whereas the reverse process aims to undo the diffusion process to reproduce the original input. When applied to a dataset of candidate enhancer sequences, the model can learn to reverse the noise addition steps, recovering the original enhancers. Once trained, the diffusion model generates new enhancer sequences by starting with random noise and iteratively applying the learned reverse process. However, diffusion models cannot directly be applied to discrete data such as DNA. Possible solutions include modifying the input or modelling space¹³⁴ or mapping the discrete input into a continuous latent space^135,136,137.

Modifying the input or modelling space is exemplified by the Dirichlet diffusion score model (DDSM)¹³⁴. DDSM represents individual nucleotides as the vertices of the standard 3-simplex (that is, a regular tetrahedron with unit lengths) with its interior representing all possible probability distributions over the four nucleotides modelled by the Dirichlet distribution¹³⁴ (Fig. 4b). DDSM enables diffusion in this probability simplex space¹³⁴. One example involves generating 1,024-bp human promoter sequences, conditioned on the transcription initiation signal profile obtained from CAGE data. The promoters were then validated in silico based on the predicted H3K4me3 level by an independent model, Sei⁹².

BitDiffusion¹³⁵ maps discrete DNA sequences into a continuous representation by a transformation that casts binary bits as real numbers (‘analogue bit’). These analogue bits are used as the input and the output of the diffusion model, and a simple thresholding operation is performed to regenerate discrete data. Using BitDiffusion, DNA-Diffusion¹³⁷ generated cell-type-specific regulatory sequences based on chromatin accessibility data across three cell lines (GM12878, K562 and HepG2)¹³⁷. For this purpose, the model was trained on a DNase I hypersensitive site index dataset³⁰. During training, DNase I hypersensitive site peaks were provided with the corresponding cell-type labels to condition the model to generate cell-type-specific enhancers. Once trained, 100,000 candidate enhancers were generated per cell type. The synthetic sequences were evaluated across various genomic characteristics, including TFBS composition, predicted cell-type-specific chromatin accessibility and cell-type-specific enhancer activity using state-of-the-art sequence-to-function models (that is, ChromBPNet¹³⁸ and Enformer⁷²). These in silico validations revealed that the generated sequences recapitulate endogenous properties of genomic enhancers. DiscDiff¹³⁶ makes use of a variational autoencoder, upstream of the diffusion operation, to map the discrete input into a continuous latent representation¹³⁶. Thus, 50,000 candidate promoter sequences were generated across 15 species with properties of genomic promoters illustrated by their motif composition, latent distribution distance to genomic examples and the predicted chromatin profiles from Sei⁹².

Extending the in silico validation, a comprehensive suite of evaluation metrics was introduced to assess the functional similarity, sequence similarity and regulatory composition of generated sequences¹³⁹. Eventually, DNA discrete diffusion (the implementation of discrete diffusion¹⁴⁰ for genomic sequences) was benchmarked on three high-quality functional genomics datasets spanning human promoters and fly enhancers¹³⁹. The authors demonstrated that DNA discrete diffusion outperforms existing methods (such as BitDiffusion and DDSM) in capturing the diversity of cis-regulatory grammars and generating sequences that more accurately reflect the properties of genomic regulatory DNA¹³⁹.

Flow matching

Generative models based on flow matching¹²⁵ have emerged as another approach for DNA sequence design with promising potential¹⁴¹. These models are conceptually similar to diffusion models; by leveraging normalizing flow¹⁴², they model the underlying unknown probability density distribution of data by learning a set of transformations (‘flow’). The process starts from a simple and known distribution (for example, a Gaussian) and produces a probability distribution that fits the data by learning a set of operations that map the simple distribution to the one that fits the data. After training the model, realistic samples can be generated by sampling from the starting distribution and applying the set of learned transformations. Like diffusion, typical implementations of flow matching cannot be directly applied to discrete data. To address this limitation, the Dirichlet flow matching¹⁴³ approach was introduced; like DDSM, this approach relaxes the input data by representing it as a mixture of probability distributions on the simplex, resulting in an improvement over the promoter design task of DDSM¹³⁴ (in terms of being able to generate synthetic promoters that recapitulate H3K4me3 profiles of genomic promoters as predicted by Sei⁹²; mean squared error of 0.0269 versus 0.0334 for DDSM). In addition, effective class conditional generation was demonstrated via guided Dirichlet flow matching to design cell-type-specific enhancers and was evaluated in silico by measuring the distance in embedding space of the generated samples to enhancer sequences from the genome^25,70.

Autoregressive language models

Autoregressive language models learn the conditional probability of the next token based on previous ones (Fig. 4c). For example, regLM¹⁴⁴ is a framework to design synthetic promoters and enhancers with desired properties, such as high, low or cell-type-specific activity, using a DNA language model¹⁴⁴. The authors label-encoded biological activity using ‘prompt tokens’ that were prefixed to DNA sequence and trained a HyenaDNA model⁹⁷ to perform next-token prediction starting from the biological prompts. Once trained, the model can generate promoters or enhancers with the prompted level of activity. However, a sequence-to-function model, trained on MPRA data, was still used as an oracle to select promising candidates from the generated sequences.

Generative artificial intelligence provides a promising avenue for the generation of novel synthetic promoters and enhancers. Nevertheless, employing the most advanced models for synthetic design does not compensate for the need for experimental validation of synthetic DNA sequences to fully appreciate the performance of the model.

Evaluation of synthetic enhancer activity

Two aspects are important when evaluating synthetic enhancers: specificity and strength. The designed enhancer must be specifically active only in the target cell types. Additionally, the level of transcription the enhancer induces must be strong enough to drive sufficient gene expression without being too high to disrupt normal cellular processes. The latter is especially important in the context of gene therapy, where a therapeutic product potentially needs to be expressed at a specific level. Currently, there is a trade-off between both aspects: either synthetic enhancers are tested using fluorescent reporter assays, where the activity of the enhancer is visualized through fluorescence microscopy^80,106, or synthetic enhancers are tested using MPRAs^107,108. On the one hand, using microscopy, the cell type (and/or developmental time) specificity of the enhancer is easily evaluated, especially when the enhancer is integrated into the organism’s genome at an early developmental stage, avoiding mosaicism. On the other hand, these assays are limited in quantifying enhancer strength, as fluorescence intensity may not linearly correlate with transcriptional output, and it is difficult to control for enhancer copy number. MPRA provides a quantitative measurement of enhancer strength, controlling for copy number but is typically performed in bulk populations of cells and using mostly a handful of cell types at a time (limiting its ability to measure enhancer specificity). Furthermore, not all cell types are easily accessible to MPRAs. A challenge remains in combining the strengths of these two approaches. Future developments in the field may involve the use of single-cell MPRAs^145,146 and the use of primary cells to fully quantify both the specificity and strength of synthetic enhancers.

Some studies evaluate synthetic enhancers computationally^{134,137,139,143,147}, which evaluates enhancers mostly based on sequence diversity, motif content and similarity to genomic enhancers^{134,137,139,143,147}. In some cases¹⁴⁴, these relatively simple metrics are combined with prediction scores from independent sequence-to-function models. Despite being valuable (for example, as a first-quality filter), they should be used in concert with experimental validations.

Outlook

Classical genetics studies and high-throughput genomic assays combined with computational modelling have resulted in a comprehensive understanding of the rules underlying enhancer activity. From these studies, the location, orientation, affinity, action (activating versus repressing), number of TFBSs, distance between them and their identity are the most important features. The necessity and sufficiency of these rules can be assessed through synthetic enhancer design, using a model as a guiding oracle, and testing cell-type-specific activity experimentally. The workflow of design and evaluation using sequence-to-function models is an important tool in the toolbox of genomics researchers facilitated through easy-to-use software packages (Box 1).

Genomic models have already shown many successes in enhancer interpretation and design, also for fly genomes^80,106, where data is more limited due to their smaller genome size (compared to mammals), which can be marginally improved using phylogenetic augmentation¹⁴⁸. However, De Boer and Taipale¹¹ have argued that the performance of models trained on genomic sequences is in general overestimated due to data leakage caused by sequence homology¹¹. Alongside, genomic sequence diversity is insufficient to explore all possible features underlying enhancer activity. Using randomly synthesized DNA sequences, after testing their activity, as training data instead poses a possible solution¹¹. This provides interesting prospects, not only owing to the increased number of diverse training samples but also by designing specific sequences to further improve model performance in an active learning paradigm. Although interesting, this is limited to cell types in which enhancer activity can be measured in a high-throughput manner (that is, mostly cell lines that are easily transfectable, proliferative to produce a high number of cells and homogeneous to measure activity in bulk); therefore, further optimization of single-cell and in vivo MPRAs is needed^145,146,149.

In line with this reasoning, it seems intuitive that large genomic datasets comprising as many different cell types and states as possible would result in better models for cell-type-specific enhancer design. Nevertheless, ‘niche’ models trained on a specific system^28,77 or even a single cell type⁶⁸ already provide sequence interpretations with deep insight. The transition of training data from bulk to single cells is beginning to provide fine-grained models in this regard. A short-term challenge is to train foundation models that cover a larger universe of cell types while maintaining high-resolution interpretations for each individual cell type.

Besides training data, there are important technical aspects to consider while designing a sequence-to-function model for enhancer design, for instance, the choice between regression or classification. Classification models are more prone to learning discrete sets of features that optimize the decision boundaries between classes rather than a non-linear continuous function that maps input to output variabilities as in regression. In practice, this might result in the exploitation of TFBSs that are not part of the cell type enhancer logic by attributing a negative feature importance to them as they decrease the likelihood of the positive class.

In this Review, we focused on the interpretation and design of single enhancers; however, genes are usually regulated by multiple enhancers that interact either additively¹¹³ or multiplicatively^114,115, potentially conferring robustness¹⁵⁰. Furthermore, expression is dependent on intrinsic promoter activity¹⁵¹, enhancer–promoter interactions^151,152 (which are mostly non-linear^113,152,153), enhancer–promoter distance^{115,153,154,155} and enhancer–enhancer distance^115,155. Therefore, modelling the expression profile of a target gene is more challenging than modelling individual enhancer activity. Specialized architectures, for example, the combination of the transformer architecture and U-net in Borzoi⁷³ or Scooby¹⁵⁶, are needed to reflect both local features (individual TFBSs) and global features (individual enhancers). Next to cis-regulatory interactions, the steady-state transcriptome is further influenced by multiple biological aspects, including promoter-proximal pausing, RNA processing and degradation, thus adding complexity to modelling gene expression¹⁵⁷. To address part of this multifactorial process, transcriptional initiation was modelled based on precision nuclear run-on-cap data, revealing the rules of transcription initiation and features important for the strand-specific location of the gene’s TSS and total initiation activity separately¹⁵⁷. We envision that sequence-to-expression models can be used to generate artificial genomic loci driving specific levels of gene expression, potentially leading to the design of an entire novel cell state with a designed transcriptome. However, the use of sequence-to-expression models for locus design is currently still limited both because of the technical challenges related to producing and testing large DNA sequences and because these models do not always generalize well^103,104 and often miss important distal regulatory elements¹¹⁶.

Sequence-to-function models are now also widely established for protein modelling¹⁵⁸. Models that can accurately predict protein–protein and protein–DNA interactions, such as AlphaFold 3 (ref. ¹⁵⁹), will provide further insights into genomics. For example, such models could measure the effect of non-coding and coding mutations on TF binding, cooperative binding of multiple TFs and how TF binding is affected by co-factors. These models could be included into a fitness function when designing synthetic DNA sequences or even be used to jointly design enhancer sequences with synthetic TF complements.

Next to the fundamental insights that can be gained from modelling and designing enhancers, synthetically designed enhancers have practical applications. Being programmed to be active in a specific cell type or state, synthetic enhancers are invaluable for gene therapy applications (Box 2). In the context of biomedical research, enhancers can be designed to serve as markers for specific cell types by driving a fluorescent reporter gene, enabling the visualization and tracking of cell types in vivo. In line with this, enhancers could be designed to only drive gene expression after a certain cell-state change, for example, neuronal activity, cell–cell signalling hypoxia or malignant cell-state switching.

Thus, we anticipate that improvements in data (for example, single-cell MPRAs potentially on large libraries of random DNA sequences), improvements in models (such as sequence-to-gene expression models that generalize well and that can be applied to single-cell datasets¹⁵⁶) and improvements in benchmarking studies (for example, to evaluate whether generative artificial intelligence produces higher quality synthetic enhancers compared to oracle-based optimization methods) will provide further advances in the field of enhancer design.

Box 1 Software packages for gene regulatory modelling, model interpretation and enhancer design using deep learning

Today, several open-source software packages are available or are actively being developed to facilitate building and interpreting deep neural network sequence-to-function models. Next to these comprehensive software suites, refactored versions of popular deep neural network sequence-to-function models allow easier training of these networks on new datasets, enabling the use and generation of new models by a broader field of researchers.

Selene¹⁶², Janggu¹⁶³, gReLU¹⁶⁴, EUGENe¹⁶⁵ and CREsted¹⁶⁶ are examples of comprehensive software suites to build and train models. These packages include functions to perform common pre-processing steps on the data (such as quality filtering, train–test splitting and generating negative training examples), build models from neural network architecture blocks (such as convolutional block and dense layer block), train and evaluate models, perform downstream analysis (such as calculating attribution scores and in silico saturation mutagenesis), and design synthetic enhancers.

ChromBPNet¹³⁸ is a model specifically designed to predict ATAC sequencing and DNase sequencing profile shapes. The enzymes used to perform these assays (Tn5 and DNase I) have a specific sequence bias. ChromBPNet explicitly models and regresses out this bias to predict more accurate base-pair-resolution chromatin accessibility profiles. bpnet-refactor¹⁶⁷ is an improved version of BPNet⁶⁸ whereas bpnet-lite¹⁶⁸ is a reimplementation of both ChromBPNet and BPNet in pytorch. Similarly, enformer-pytorch¹⁶⁹ and borzoi-pytorch¹⁷⁰ are analogous to Enformer and Borzoi, respectively, in pytorch.

Regarding downstream analyses, tfmodisco¹⁷¹ and tfmodisco-lite¹⁷² are packages that perform de novo motif discovery based on attribution scores acquired from sequence-to-function models. Based on these de novo motifs, finemo_gpu¹⁷³ is designed to call motif hits on genomic regions. Tangermeme¹⁷⁴ is a software suite to perform a variety of downstream analyses including de novo motif discovery, motif foot-printing, predicting the effect of motif ablation and variant effect predictions as well as synthetic enhancer design. Finally, polygraph¹⁴⁷ is a toolkit to systematically assess the quality of synthetic enhancers based on sequence composition, motif content, predicted activity and sequence embedding.

Together with standardized data repositories for deep learning models for genomics (for example Kipoi¹⁷⁵ but also the more general machine learning repository Hugging Face (https://www.huggingface.co.), these developments increase the ease of use and reproducibility of sequence-to-function models for genomics applications.

Box 2 Translational considerations

Gene therapies are defined as medicinal products that make use of a genetic construct engineered to express a specific transgene. Typically, the construct is transduced in the target cells making use of adeno-associated viruses (AAVs) to avoid genomic integration and owing to its low immunogenicity¹⁷⁶. Voretigene neparvovec is an approved product for the treatment of retinal dystrophy caused by homozygous loss-of-function mutations in the RPE65 gene. It is an AAV2 vector containing the cDNA of human RPE65 under the control of a hybrid chicken β-actin and cytomegalovirus promoter administered by subretinal injection¹⁷⁷. Improvements in sight are evident after treatment, albeit rather modest¹⁷⁸. For this reason, attempts have been made to improve the expression of RPE65, including replacing the exogenous hybrid promoter with a 1.6-kb fragment of the human genome containing the endogenous RPE65 promoter¹⁷⁹. However, using this large promoter-containing fragment did not improve efficacy¹⁸⁰. Only after more careful selection of a smaller genomic sub-fragment (after testing multiple sub-fragments), increased expression was observed in the retinal pigment epithelium with less ectopic expression in surrounding cell types¹⁷⁸. This optimization, together with another AAV vector serotype and a codon-optimized gene, named AAV2/5-OPTIRPE65, recently underwent phase I (NCT02781480)¹⁸¹ and II clinical trials (NCT02946879)¹⁸². In the future, the use of sequence-to-function models could greatly help in selecting or designing enhancers and/or promoters to drive specific levels of expression. Furthermore, synthetic enhancers can be designed to drive expression only in selected (set of) cell types. This is important when it is difficult or undesirable to perform targeted delivery of the medicinal product near the target cell population. Importantly, a synthetic enhancer has the potential to drive gene expression in a targeted set of cell types, even when such enhancers do not exist in the genome, increasing the flexibility of gene therapy. Enhancers could be designed to react to certain physiological state changes next to harbouring the code to drive cell-type-specific expression. Following these future improvements, highly cell-type-specific enhancers that are only active in the subset of cells responsible for the disease state could be designed.