From benign to pathogenic variants and vice versa: pyrimidine transitions at position -3 of TAG and CAG 3′ splice sites

Introduction

Prediction of phenotypic consequences of somatic and germline mutations or natural DNA polymorphisms is a major challenge in biology and medicine. Despite considerable resources dedicated to this field, only a small fraction of sequence variants described to date has been associated with altered interactions of variant alleles with cellular components. The number of ‘likely benign’, ‘likely pathogenic’ or ‘variants of uncertain significance’ deposited in multiple databases has risen dramatically in the last decades [1]. As a result, the translational potential of genomics has remained limited by our inability to reliably predict which variants lead to actionable phenotypes, often prohibiting accurate diagnosis and counseling. This challenge is magnified by realization that even identical mutations at the same position of traditional splice-site consensus sequences, such as those discussed in this work, may have unexpected or even opposite phenotypic effects, depending on their genomic context.

Splicing outcomes can be affected by formation of RNA structures by nascent transcripts and can be modified even by a single-nucleotide change [2,3,4,5,6,7,8,9,10,11]. RNA secondary structures may inhibit or promote RNA processing or even replace a need for protein splicing factors (reviewed by [11]). Their formation involves both the traditional splicing signals (splice sites, branch points and polypyrimidine tracts), for example by bringing adjacent splice sites to proximity [5], and the auxiliary splicing motifs in exons [6, 12]. However, the importance of intramolecular RNA base-pairing at individual splice-site positions is poorly understood.

A striking example of the functional dichotomy of splice-site mutations are pyrimidine transitions at position -3 relative to intron-exon junctions. This position forms a part of the 3 splice site (3ss) consensus sequence YAG/G (where Y is cytosine or uridine and / is the intron-exon boundary). During evolution, nucleotide preferences at position -3 coevolved with increasingly more complex spliceosomes: whereas many yeast species prefer uridines ([13] and refs. therein), mammalian 3ss generally favor cytosines while other lineages such as some nematodes have almost exclusively cytosines ([14,15,16,17] and refs. therein). As an example, Caenorhabditis elegans 3ss are defined by a highly conserved octamer UUUUCAG/R where -3C must remain adjacent to the AG/R motif for efficient splicing [16]. In the human genome, the number of CAG 3’ss exceeds TAG 3’ss by a factor of ~2.2 [14, 18]. The CAG and TAG 3ss were initially regarded as functionally equivalent and transitions -3T > C or -3C > T as splicing-neutral [19,20,21]. However, increasing numbers of reports have now convincingly shown that mutations -3C > T do not only induce exon skipping [18, 22,23,24,25], but also promote exon inclusion [18, 26, 27] and/or activate silent or cryptic 3’ss [18, 28]. Testing a large number of human exogenous transcript pairs clearly showed that when CAG or UAG 3ss of minigene mid-exons were cloned between identical exons and intronic segments, exon skipping was a preferred outcome for mutations -3C > T while mutations -3T > C usually improved exon inclusion [18]. A small number of inefficient CAG 3ss (termed here ‘laggard’ 3ss) do not obey the accepted trend that -3T > C transitions are typically less detrimental than -3C > T transitions [18, 28]. It remains unclear, however, if this bias can explain the higher abundance of CAG 3ss in mammalian genomes. In addition, no ab initio tools exist to identify anomalous YAG 3ss that increase exon skipping when mutated from UAG to CAG. Finally, it has been unclear why the non-canonical -3T alleles can, in some cases, promote exon inclusion as compared to the -3C alleles and thus become superior to canonical -3C alleles.

To address the last question, this study has compared base-pairing probabilities (BPPs) of transcript pairs with laggard CAG 3ss and canonical 3ss. Even the small number of informative transcript pairs (n = 22) has revealed higher average BPPs across intron-exon junctions of laggard CAG 3ss (ie. 3ss with the hierarchy in splicing efficiency of UAG > CAG) as compared to 3ss with the canonical order CAG > UAG. The maximum discrimination was observed for positions -5 to -1 relative to 3ss, consistent with the involvement of U2AF1. These results suggest that the accessibility of pyrimidine bases at position -3 can control not only splicing efficiency but also clinical outcome of these mutations on a scale benign to pathogenic or vice versa.

Materials and methods

Nucleotide sequences

Sequences of primary transcripts with CAG 3ss that preferred exon skipping (laggard CAG 3ss) as compared to UAG 3’ss counterparts are shown at the top of Table 1. Sequences of primary transcripts with UAG 3ss that preferred exon skipping as compared to CAG 3ss counterparts are shown at the bottom of Table 1. The transcripts were identified previously in our study of >80 minigene constructs [18], which were mostly derived from established human disease genes that sustained transitions –3 C > T or –3 T > C implicated in the phenotype. Disease genes, clinical phenotypes, splicing patterns and nucleotide sequences were compiled in Supplementary Table S5 of ref. [18]. The splicing patterns of minigene constructs were shown in Fig. 4 of the same reference. Each mutated 3ss sequence was confirmed by Sanger sequencing of plasmid constructs [18] and was identical to reference sequences of human genes [29].

Table 1 Nucleotide sequences of 3 splice sites in 22 informative transcript pairs and exon skipping preferences of their -3C/T alleles
Full size table

Computing base-pairing probabilities

BPPs for individual RNA sequences were defined previously [30]. Briefly, a BPP pij (i < j) is equal to the probability that the ith and jth nucleotides of a sequence form a base pair and can be interpreted as a confidence measure of predicted base pairs. BPPs were computed using CentroidFold [31], employing both CONTRAfold [32] and McCaskill [33] algorithms.

RNA secondary structure predictions

Minimum free energy and centroid predictions were carried out using RNAfold and Centroid [31, 34]. Unlike the minimum free energy, which is regarded as a maximum likelihood estimator, the Centroid estimator considers the entire distribution over the solutions instead of only the solution with the highest probability [35]. Paired, wobble and unpaired configuration at position -3 was recorded for each transcript in both groups of 3’ss using most stable minimum free energy and Centroid predictions (Table 1, Supplementary Fig. S1).

PU values

PU (probability of unpaired) values were computed for 100 nucleotides covering 3ss and 30 nucleotides of flanking sequences in each direction, as described [6, 36]. Briefly, the PU value for the region a to b in an RNA sequence is defined as e^([Eall − Eunpaired]/RT), where Eall is the free energy of the ensemble of all structures, Eunpaired is the free energy of the ensemble of all structures that have the complete region a to b unpaired, R is the universal gas constant, and T is the temperature [6]. Eall and Eunpaired values were computed using RNAfold [34]. PU values were also implemented in the NIPU server at http://rna.informatik.uni-freiburg.de/NIPU/Input.jsp [6]. PU values range between 0 (completely base-paired) and 1 (completely unpaired).

Statistical analysis

BPP and PU values were averaged and means and standard deviations of the two groups of 3ss were compared using an unpaired t-test. Nucleotide distribution across 3ss and distribution of paired and unpaired nucleotides in most stable structures was compared using χ2 tests.

Results

Figure 1a summarizes tested minigene transcript pairs. Each pair had two transcript versions that differed only by a pyrimidine at position -3 of 3ss of middle exons. By screening over 40 minigene pairs containing transitions -3C > T or -3T > C, 22 pairs were informative (Table 1 and ref. [18]). They consisted of 18 pairs where UAG 3ss generated more exon skipping than their CAG 3ss counterparts and 4 pairs where CAG 3’ss produced more exon skipping than UAG 3ss counterparts [18]. The screening indicated that human -3T > C transitions are usually, but not always, less detrimental than -3C > T transitions and identified a small group of non-canonical, ‘laggard’ CAG 3ss that produced lower exon inclusion in mature transcripts than their mutated UAG versions (Table 1).

Fig. 1
figure 1

CAG 3 splice sites that produce more exon skipping than their UAG counterparts are associated with higher base-pairing probabilities across introns-exon boundary. a Schematics of plasmid construct pairs (n = 22). Each minigene had identical first and third exons. The tested exon in the middle had either CAG or UAG 3ss, but no other sequence changes [18]. b Average BPPs across laggard and canonical 3ss (n = 4 and 18, respectively) and their allelic counterparts. Native 3ss are shown as solid lines, dashed lines represent BPP values for alternate pyrimidines. Asterisks represent the region with significant differences between the two groups of 3ss. c, d Mean BPP values for the indicated regions and associated P-values for McCaskill (c) and CONTRAFold (d) algorithms. *P value < 0.05, ***P < 0.0001 (unpaired t-tests). e Mean BPPs across laggard and canonical 3ss and across their allelic counterparts, as computed using CONTRAFold

Full size image

Figure 1b shows BPPs for the two groups of 3’ss and their allelic variants with alternate pyrimidines at position -3. Using the McCaskill algorithm [33], the laggard CAG 3’ss had higher BPPs at positions -5 to -1 than 3ss that promoted exon skipping upon transitions -3C > T (Fig. 1b, c). This tendency was also observed for exon positions +2 and +3, but not for the first exon nucleotide. CONTRAfold [32] identified a similar increase for laggard CAG 3ss although the exonic peak was lower (Fig. 1d, e). BPPs of the same transcripts mutated at position -3 to the other pyrimidine maintained higher values at positions -5 to -1 (Fig. 1b–e).

The distinct BPP profiles between laggard CAG 3ss and canonical CAG 3ss were also found with a measure of RNA single-strandedness computed as the probability that all bases in the sequence are unpaired (termed PU values) [6]. The average PU values are elevated for auxiliary splicing motifs in exons that promote exon inclusion in mature transcripts [6]. PU profiles across laggard CAG 3ss showed low values between position -5 and +2, consistent with their higher CONTRAFold- and McCaskill-derived BPPs (Fig. 2a). A decrease of PU values could be seen also further upstream (Fig. 2a, b).

Fig. 2
figure 2

PU values across laggard and canonical 3 splice sites. a Mean PU values across 3ss sequences of the two groups of 3ss. b Comparison of average PU values for the indicated positions relative to the intron-exon junction (vertical line). **P value < 0.001, ***P < 0.0001 (unpaired t-tests). NS not statistically significant

Full size image

Nucleotide distribution of the two groups of 3ss showed that laggard CAG 3ss lacked adenines and, to a lesser extent, uridine upstream (Fig. 3a) but not downstream (Fig. 3b) of 3ss. At positions -19 to -3 relative to the intron-exon junction, adenine was completely absent (Fig. 3c), although this could result from a chance since adenine is depleted in this region of human introns (Fig. 3d). Nevertheless, the altered nucleotide composition of laggard 3ss is likely to influence secondary pre-mRNA structures formed during or after transcription by reducing weaker A:T base-pairing and increasing stronger C:G base-pairing, consistent with their higher BPPs (cf. Figs. 1b and 2a).

Fig. 3
figure 3

A lack of adenines and uridines between positions -3 and -20 of laggard CAG 3 splice sites. a, b A lack of adenines and, to a lesser degree, uridines upstream (a) but not downstream (b) of the intron-exon boundary. χ2 values for 2 × 4 contingency tables were 31.7 (P < 0.0001) (a) and 4.2 (P = 0.2) (b). c Adenines were absent just upstream of laggard CAG 3ss. d Nucleotide distribution upstream of 195,404 human 3’ss

Full size image

Figure 4a shows examples of most stable local structures for laggard CAG 3ss. In EPS15, the transcript with the highest average BPP at positions -5 to -1 (0.98), a six base-pair stem sequestering 3ss was further extended by three consecutive C:G pairs at exon positions +2 through +4. In ABCA4, the transcript with the second highest BPPs in this region (0.85), the 3ss is sequestered in a continuous seven base-pair stem. In both transcripts, the stems involving 3ss are more stable with the C allele as opposed to the U allele (Fig. 4b) and their length is close to a threshold required for rapid annealing of DNA or RNA [37]. The remaining transcript pairs where UAG 3’ss were superior to CAG 3ss had shorter stems but the stems included the full YAG 3ss consensus in each case. In each laggard CAG 3ss, the hairpin structures had either four- or six-nucleotide loops; loops of the same size consistently inhibited splicing in Saccharomyces cerevisiae when introduced at 5ss and branch point regions [4]. In contrast, RNA secondary structure predictions for eighteen 3ss that preferred canonical CAG showed incompletely paired YAG 3ss motifs at position -3 in 11 (50%) cases (χ2 = 5.5, P = 0.02; Table 1 and Supplementary Fig. S1). The higher frequency of paired interactions involving 3ss consensus in laggard CAG 3ss than canonical 3ss suggests that if -3C is paired, laggard CAG 3ss could occur more likely than in the unpaired context.

Fig. 4
figure 4

Laggard CAG 3 splice sites may require longer stem structures across intron-exon junctions. a Transcripts with cytosine alleles. b Transcripts with uridine alleles. Secondary structure predictions were carried out by RNAfold [34, 50]. Predicted structures of canonical (laggard UAG) 3’ss are shown in Supplementary Fig. S1

Full size image

In conclusion, independent profiling of BPPs and PU values across the two groups of 3ss identified significant increase in predicted base-pairing in the group of transcripts where UAG 3ss were, atypically, more efficient than their CAG 3ss versions. Therefore, RNA secondary structure has a potential to alter the hierarchy in intrinsic efficiency of human 3’ss from canonical CAG > UAG(>AAG > GAG) to non-canonical UAG > CAG(>AAG > GAG) (3ss in parentheses have not been tested in this work). As a result, the same C > T or T > C mutations at position -3 of 3ss can have distinct phenotypic outcomes in different sequence and structural contexts.

Discussion

Becoming laggard CAG 3 splice sites

This study shows that human CAG 3ss that include downstream exons less efficiently in mature transcripts than their UAG 3ss counterparts are more structured across intron-exon boundaries and suggests that prediction of phenotypic outcomes of pyrimidine transitions at position -3 could be improved by considering intramolecular base-pairing. In other words, the higher BPPs across laggard CAG 3ss can reduce their splicing efficiency to the extent that they become worse performers than the UAG versions of the same transcripts. This would explain why a subset of -3T > C transitions in human disease genes can, atypically, reduce exon inclusion [18, 26, 27], which may lead to a more frequent disease occurrence, earlier onset, faster disease progression and/or greater severity. However, it remains to be tested whether the -3C/T transitions are more likely to be functional if the secondary structure of the pre-mRNA is permissive, ie. less paired around 3ss (Table 1 and Supplementary Fig. S1).

The BPP differences between the two groups of 3ss were centered on intron positions -5 to -1 (Fig. 1b–e). Given the small number of laggard CAG 3ss identified so far, it cannot be excluded that altered BPP and/or PU values at these 3ss extend to other exon or intron positions or splicing motifs, such as branch point sequences and polypyrimidine tracts (Figs. 1b, e, and 2a). The 22 transcript pairs examined in this study were identified by screening >40 minigene pairs with -3 C > T or -3T > C transitions, indicating that about a half of the pairs were not informative, ie. with zero or 100% exon inclusion for each allele or identical exon inclusion levels [18]. Such 3ss may still display -3C/T allelic differences in splicing efficiency, which could be unmasked by 3ss weakening [18]. These figures suggest that identification of one additional laggard CAG 3ss would require examination of about 10 more transcript pairs in transfection studies. The fraction of human 3’ss with the canonical order CAG > UAG > AAG > GAG in splicing efficiency is thus unlikely to exceed 90% and is probably lower than that. Because the number of -3C > T or -3T > C transitions in patients with genetic disorders is growing, approaching a hundred of reported cases (compiled in ref. [18]), the expansion of this study to obtain a larger dataset should be feasible in the future. The sample expansion could be facilitated by focusing on -3C/T variants incontrovertibly associated with QTLs, ultimately providing more robust evidence and improving ab initio predictions of laggard CAG 3ss. At present, BPP values alone are unlikely to provide sufficient discrimination power to distinguish benign and pathogenic variants. Moreover, it remains unclear to what extent a relatively small depletion of uridines upstream of laggard CAG 3ss (Fig. 3a), potentially reducing U2AF interactions, rather than secondary structure constraints could switch CAG versus UAG 3ss preferences in splicing efficiency.

Although hairpin structures can influence splice site selection from yeasts to humans [3, 4, 9, 10, 38], it has not been possible to reliably predict without testing whether a particular structure impedes or promotes exon inclusion. Four- or six-nucleotide loop hairpins sequestering 5ss and branch points always inhibited RNA splicing in vivo and in vitro [4], consistent with loop sizes predicted for laggard CAG 3ss (Fig. 4), but only rarely for canonical (laggard UAG) 3ss (Supplementary Fig. S1). Establishing a larger group of laggard CAG 3’ss and their local folding patterns should help define molecular interactions at this position and 3ss responses to dynamic secondary structure formation across intron-exon junctions.

As compared to position -3, anomalous behavior of C/T variants is less well documented in polypyrimidine tracts, which are usually located between lariat intron branch points and position -5. Here, uridines are preferred over cytosines, particularly in shorter polypyrimidine tracts [39]. Although unpublished or anecdotal findings suggest that this is not always the case, systematic studies of the splicing impact of C/T transitions at this location have not been available. Identification of anomalous mutations in these upstream locations in genetic disease is likely to be more arduous than anomalous mutations at position -3 of 3ss since upstream transitions may often lead to less significant splicing alterations and only mild or low-penetrance phenotypes [40]. Their pathogenicity could be merely manifested by an overrepresentation of cytosines over uridines in disease phenotypes as compared to controls, as suggested for pheochromocytomas [41]. The milder character of C/T transitions in polypyrimidine tracts could reflect structural preferences of the large subunit of U2AF (U2AF2) for pyrimidines and for hydrogen bonds of uracil edges as opposed to sequence-specific recognition strategies of other interacting regulators, such as PTB or SXL [42].

The role of U2AF in generating laggard CAG 3ss

Static models of RNA secondary structure suggest that position -3 accessibility may expose intrinsic binding differences between CAG and UAG 3ss: if -3 positions are paired, differences between -3C and -3U binding to trans-acting factors could be masked (Fig. 4). The hottest candidate for this interaction is the small subunit of U2AF, or U2AF1, which forms a heterodimer with U2AF2 and contacts 3ss early during spliceosome assembly [43,44,45,46]. In high resolution ultraviolet crosslinking and immunoprecipitation studies with the wild-type, U2AF1 showed strongest signals for terminal five nucleotides of the introns [47]. This preference was maintained for U1AF1 with a cancer-associated substitution S34F, but S34F shifted the most frequently crosslinked nucleotide by a single position and altered pyrimidine distribution at position -3 in favor of CAG 3ss [47]. -3C was the most common nucleotide preceding S34F-promoted exons in independent studies [48]. For yeast U2AF1, UCAGNU RNAs had consistently higher dissociation constants and weaker binding in isothermal titration calorimetry studies than UUAGNU oligomers (where italics denote variant positions and N is any nucleotide) [49], but uridine is preferred in yeasts, unlike in humans [17]. The binding preferences of yeast U2AF1 to 3ss were also affected by substitution S34F [49], nevertheless comparable calorimetry data for human U2AF1 complexed with a short fragment of U2AF2 have not been available. Apart from U2AF subunits, one cannot exclude other interactions between the same pre-mRNA segment and small nuclear RNAs or proteins. The reported preference in binding of wild-type U2AF1 to UAG 3ss [43,44,45,46,47, 49] would not explain the canonical order in the efficiency of human 3ss usage (CAG > UAG > AAG > GAG) [18, 19].

Finally, the variable outcome of -3C/T transitions supports a speculation that the selection pressure at position -3 has been relatively mild during recent evolution, but could be stronger earlier, such as during evolution of single-cell eukaryotes. Distribution of pyrimidines at position -3 did not appear to play a role in compensatory responses of human traditional or auxiliary splicing motifs when comparing highly conserved exons encoding calcium- and zinc-coordinating residues in metalloproteins [18].

In conclusion, if more structured, the more abundant and generally more splice-proficient CAG 3ss may turn into “laggards” and skip the downstream exon more than their intrinsically weaker UAG 3’ss counterparts. This work identifies a collection of 3ss that provide a starting point for exploring structural requirements for their usage in much greater detail, which should facilitate our understanding of structural interactions that involve position -3. These results also suggest that prediction of splicing and clinical outcomes of DNA mutations and polymorphisms in mammalian genes may never be 100% accurate without considering RNA structure of primary transcripts, particularly across traditional and auxiliary splicing motifs.

Related Articles

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Large language models (LLMs) have emerged as transformative tools with the potential to revolutionize philosophical counseling. By harnessing their advanced natural language processing and reasoning capabilities, LLMs offer innovative solutions to overcome limitations inherent in traditional counseling approaches—such as counselor scarcity, difficulties in identifying mental health issues, subjective outcome assessment, and cultural adaptation challenges. In this study, we explore cutting‐edge technical strategies—including prompt engineering, fine‐tuning, and retrieval‐augmented generation—to integrate LLMs into the counseling process. Our analysis demonstrates that LLM-assisted systems can provide counselor recommendations, streamline session evaluations, broaden service accessibility, and improve cultural adaptation. We also critically examine challenges related to user trust, data privacy, and the inherent inability of current AI systems to genuinely understand or empathize. Overall, this work presents both theoretical insights and practical guidelines for the responsible development and deployment of AI-assisted philosophical counseling practices.

Aerospace medicine in China: advancements and perspectives

With the rapid growth of China’s space industry, long-term manned space missions face challenges from the complex space environment, posing risks to human health. Aerospace medicine, a key field, addresses these risks by researching the impacts of space on biochemical changes, cognitive abilities, and immune systems. This article reviews China’s aerospace medicine research, summarizing efforts from various institutions and offering insights for future developments in the field.

Faster implicit motor sequence learning of new sequences compatible in terms of movement transitions

New information that is compatible with pre-existing knowledge can be learned faster. Such schema memory effect has been reported in declarative memory and in explicit motor sequence learning (MSL). Here, we investigated if sequences of key presses that were compatible to previously trained ones, could be learned faster in an implicit MSL task. Participants trained a motor sequence before switching to a completely new sequence, to a compatible sequence with high overlap in ordinal positions, or to an incompatible sequence with low overlap, while the compatible and incompatible sequences had the same overlap in movement transitions. We observed accelerated learning in the Compatible and Incompatible groups compared to the New group, if participants trained for 3 sessions before switching to the altered sequence. Overall, our study suggests facilitative learning of implicit motor sequences that are compatible in movement transitions, if the previous sequence has been trained extensively.

Probabilistic machine learning for battery health diagnostics and prognostics—review and perspectives

Diagnosing lithium-ion battery health and predicting future degradation is essential for driving design improvements in the laboratory and ensuring safe and reliable operation over a product’s expected lifetime. However, accurate battery health diagnostics and prognostics is challenging due to the unavoidable influence of cell-to-cell manufacturing variability and time-varying operating circumstances experienced in the field. Machine learning approaches informed by simulation, experiment, and field data show enormous promise to predict the evolution of battery health with use; however, until recently, the research community has focused on deterministic modeling methods, largely ignoring the cell-to-cell performance and aging variability inherent to all batteries. To truly make informed decisions regarding battery design in the lab or control strategies for the field, it is critical to characterize the uncertainty in a model’s predictions. After providing an overview of lithium-ion battery degradation, this paper reviews the current state-of-the-art probabilistic machine learning models for health diagnostics and prognostics. Details of the various methods, their advantages, and limitations are discussed in detail with a primary focus on probabilistic machine learning and uncertainty quantification. Last, future trends and opportunities for research and development are discussed.

Too big to purge: persistence of deleterious Mutations in Island populations of the European Barn Owl (Tyto alba)

A key aspect of assessing the risk of extinction/extirpation for a particular wild species or population is the status of inbreeding, but the origin of inbreeding and the current mutational load are also two crucial factors to consider when determining survival probability of a population. In this study, we used samples from 502 barn owls from continental and island populations across Europe, with the aim of quantifying and comparing the level of inbreeding between populations with differing demographic histories. In addition to comparing inbreeding status, we determined whether inbreeding is due to non-random mating or high co-ancestry within the population. We show that islands have higher levels of inbreeding than continental populations, and that this is mainly due to small effective population sizes rather than recent consanguineous mating. We assess the probability that a region is autozygous along the genome and show that this probability decreased as the number of genes present in that region increased. Finally, we looked for evidence of reduced selection efficiency and purging in island populations. Among island populations, we found an increase in numbers of both neutral and deleterious minor alleles, possibly as a result of drift and decreased selection efficiency but we found no evidence of purging.

Responses

Your email address will not be published. Required fields are marked *