Chimeric origins and dynamic evolution of central carbon metabolism in eukaryotes
Main
The origin of eukaryotes represents a defining event in the history of life that occurred between 2.6 and 1.2 billion years ago (Ga)1,2,3,4,5, possibly coinciding with rising oxygen levels in the atmosphere of the Earth6,7,8. One of the key steps during eukaryogenesis involved symbiosis between a member of the Asgardarchaeota9,10,11,12 and a bacterial partner related to the Alphaproteobacteria that evolved to become the mitochondrion13,14.
To explain the evolutionary driving forces underlying eukaryogenesis, many models have been proposed15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 that differ with respect to the identity and number of partners involved and the nature of their initial interactions, ranging from syntrophy to phagocytosis and parasitism. The discovery of the Asgardarchaeota lent support to hypotheses invoking a syntrophic relationship between at least one archaeal and bacterial partner26,28, and metabolic capabilities inferred for the asgardarchaeal ancestor of eukaryotes have inspired updated hypotheses about the syntrophic interactions between the archaeal and bacterial partners during the early stages of eukaryogenesis24,25,30. These syntrophy-based hypotheses suggest that one partner may have been dependent on the other as an external electron sink, but make distinct predictions about the types of metabolites exchanged, the origin of eukaryotic cell membranes, the timing of mitochondrial acquisition, the mechanism of mitochondrial uptake and the origin of the nucleus26,28,31,32,33. However, testing these hypotheses with current data is challenging, in part because the evolutionary origin of eukaryotic metabolism remains understudied.
Previous genomic analyses have suggested that ‘informational’ genes (those involved in translation, replication and transcription) generally have archaeal origins, while ‘operational’ genes (those involved in metabolism) derive predominantly from bacteria, particularly from the premitochondrial endosymbiont34,35,36,37,38,39. This has led to the hypothesis that, during eukaryogenesis, the archaeal host metabolism was replaced by counterparts from the endosymbiont15. However, considering that syntrophy relies on metabolic repertoires from both partners, archaeal gene contributions to eukaryotic metabolisms might be expected.
To assess current models on the origins of eukaryotic cells and the evolution of plastids, we here analyse the origins of eukaryotic central carbon metabolism (CCM), comprising four main pathways: the Embden–Meyerhof–Parnas (EMP) and the Entner–Doudoroff glycolytic pathways, the pentose phosphate pathway (PPP), the pyruvate/acetate conversions into acetyl-CoA and the tricarboxylic acid (TCA) cycle (Supplementary Discussion). While it has previously been suggested that many TCA cycle enzymes, as well as enzymes involved in pyruvate conversions, were present in last eukaryotic common ancestor (LECA) and trace their origin back to Alphaproteobacteria13,40,41,42, the evolutionary origins of glycolysis and PPP in eukaryotes remain unresolved20,40. Furthermore, several genes involved in eukaryotic metabolism appear to have origins unrelated to either symbiotic partner, potentially reflecting independent horizontal gene transfer (HGT) acquisitions either before or after the radiation of the extant eukaryotic lineages43,44,45, further complicating phylogenetic analyses. The metagenomics-based discovery of new archaeal and bacterial lineages during the past decades, including the Asgardarchaeota, has provided a wealth of new information to address the origins and evolution of eukaryotic CCM within the context of a more broadly sampled tree of life9,46,47,48,49. Our comprehensive phylogenetic analyses reveal a much more complex pattern of evolution than previously anticipated and identify a chimeric CCM that includes contributions of archaeal origin to the LECA proteome. The distribution of CCM enzymes across the eukaryotic tree of life illuminates the subsequent highly dynamic evolution of these enzyme repertoires shaped by gene loss, endosymbiotic gene transfers (EGTs) and gene replacements.
Results
Central carbon metabolism of LECA
We selected a balanced and representative set of 207 eukaryotic proteomes that cover currently known taxonomic diversity and lifestyles, including anaerobic eukaryotes and eukaryotes with primary and higher-order plastid organelles (Fig. 1a and Supplementary Data 1). To compare gene trees of CCM enzymes with the eukaryotic tree of life, we first reconstructed a species tree on the basis of a manually curated set of 317 concatenated phylogenetic markers50. We used both maximum-likelihood and Bayesian approaches combined with trimming of heterogeneous sites to evaluate the robustness of support for major eukaryotic clades (Methods; Fig. 1, Extended Data Fig. 1 and Supplementary Discussion). The resulting tree comprises three major supergroups (Fig. 1a): Excavata, Amorphea and Diaphoretickes. Although we rooted the tree between Excavata and the other groups for visualization, the placement of the root remains under debate48,51,52,53,54,55 and our interpretations of gene family origins do not assume a particular root position. Excavata include Jakobids (within Discoba) which, together with Mantamonas (within CRuMs as part of Amorphea), possess one of the most gene-rich mitogenomes among eukaryotes56,57,58,59,60. Another lineage of the Excavata is Metamonada, members of which are anaerobic and contain mitochondrial-related organelles that have been entirely lost in some representatives61,62,63,64. Metamonada often form long branches in phylogenetic trees, which hampers the phylogenetic placement of some putative member lineages such as the Anaeramoebae65, which in our analyses alternatively branch with Amorphea clades under a subset of tree inference parameters (Extended Data Fig. 1b,c). Amorphea include Obozoa (Fungi, Metazoa and various protists), Amoebozoa and other putative taxa such as CRuMs, Malawimonada and Ancyromonada. The monophyly of Amorphea is not fully stable: specifically, while Ancyromonadida most consistently places within Amorphea, we also observed its clustering sister to Diaphoretickes (Extended Data Fig. 1c and Supplementary Discussion). Diaphoretickes form a stable group composed of two well-supported monophyletic clades: the Cryptista–Archaeplastida (the last including Chloroplastida, Rhodophyta, Glaucophyta, Picozoa and Rhodelphis) and SAR (Stramenopiles, Alveolata and Rhizaria). Apart from these, Haptista, Telonemia, Hemimastigophora and Anconracysta (the last now classified as Provora phylum66) showed varying placements within Diaphoretickes, depending on site filtering and phylogenetic methods (Extended Data Fig. 1). Overall, the inferred eukaryotic tree of life, and the inference of the three major supergroups, Excavata, Amorphea and Diaphoretickes, provide a solid framework for interpreting individual gene trees, defining LECA versus post-LECA clades and thereby determining the relative timing of gene acquisitions.

Left, maximum-likelihood phylogeny of the eukaryotic tree of life based on the concatenation of 317 phylogenetic markers. The tree is unrooted, but drawn with Excavata at the root for ease of visualization. The concatenated multiple sequence alignment (MSA) consisted of 207 taxa, 97,680 positions and the tree was built with IQ-TREE.2.1.2 under the LG + C60 + G model and using optimized ultrafast bootstrap (Ufboot2 -bnni). Annotation corresponds to characteristic traits (see legend). Extended tree and additional phylogenetic analyses of eukaryotic species tree provided in Extended Data Fig. 1. Right, global phylogenetic distribution of CCM enzymes across eukaryotic species trees. HK, hexokinase; GLK, glucokinase; ROK, repressor protein, open reading frame, sugar kinase (that is, hexokinase); ADPGK, ADP-dependent glucokinase; GPI, glucose-6-phosphate isomerase; PFKA, 6-phosphofructose kinase (A); ALDO, fructose-biphosphate aldolase, Class 1; FBA, fructose-biphosphate aldolase; TPI, triosephosphate isomerase; GAPDH, glyceraldehyde 3-phosphate dehydrogenase; PGK, phosphoglycerate kinase; GPMA/GPMB/GPMI/APGM, 2,3-bisphosphoglycerate phosphoglycerate mutases; ENO, enolase; PK, pyruvate kinase; G6PD, glucose 6-phosphate dehydrogenase; PGLS, 6-phosphogluconolactonase; PGL, 6-phosphogluconolactonase; PGD, 6-phosphogluconate dehydrogenase; RPE, ribulose-phosphate 3-epimerase; RPIA/B, ribose 5-phosphate isomerase A/B; TKT, transketalase; TAL, transaldolase; PRPS, ribose-phosphate pyrophosphokinase; PDH, pyruvate dehydrogenase complex; POR, pyruvate-ferredoxin/flavodoxin oxidoreductase; LDH, lactate dehydrogenase; ACS, acetyl-CoA synthetase; CS/GLTA, citrate synthase; ACO, aconitate hydratase; IDH, isocitrate dehydrogenase; SUC (A/B), 2-oxoglutarate dehydrogenase complex, subunit A/B, succinyl-CoA synthetase complex; SDH (A/B/C/D), succinate dehydrogenase complex (A/B/C/D); FUM (AB/C), fumarate hydratases (AB/C); MDH, malate dehydrogenase; ACL, ATP-citrate lyase (see also Supplementary Data 2). Pie charts group consecutive, isofunctional and protein complex enzymes and different grey shading indicates the proportion of taxa from the respective taxonomic level bearing such gene. Orthogroups were manually selected from phylogenetic trees. Asterisks denote those trees containing paraphyletic clades with unclear origins (Supplementary Discussion).
Next, we inferred the phylogenies of 64 gene families encoding enzymes involved in the CCM of eukaryotes and evaluated the evolutionary origins of eukaryotic homologues in each phylogeny (Supplementary Figs. 1–32 and Supplementary Discussion). CCM enzyme gene family membership was determined using protein model annotations based on the KEGG orthology database (Supplementary Data 2) providing the starting point for the collection of homologues for phylogenetic inferences. Phylogenetic analyses were performed iteratively to maximize resolution and flag problematic data as well as putative eukaryotic contaminations (Extended Data Fig. 2 and Supplementary Data 3; Methods). For each CCM enzyme family, we manually identified putative ancestral clades in eukaryotes, including those containing organisms from at least two major groups, Excavata, Amoprhea and Diaphoretickes (that is, potential LECA clades). The distribution of these orthogroups was mapped onto the eukaryotic species tree (Fig. 1), showing the widespread distribution of these enzymes in most eukaryotic groups and suggesting the presence of a canonical CCM in LECA.
Specifically, our phylogenetic analyses suggest that nine out of ten EMP glycolysis (Fig. 1 and Supplementary Figs. 1–12) and seven out of eight PPP (Supplementary Figs. 13–22 and Supplementary Discussion) enzymatic steps comprise putative LECA clades. By contrast, phylogenies of Entner–Doudoroff glycolytic enzymes seem to indicate that these enzymes have been acquired later during eukaryotic evolution in photosynthetic eukaryotes and representatives with secondary endosymbionts67 (Supplementary Fig. 21). For pyruvate/acetate conversions, we observed that the pyruvate dehydrogenase complex (PDH, formed by PDHA/B/C/D subunits) as well as acetyl-CoA synthetase (ACS) and lactate dehydrogenases (LDH), were present in LECA. In contrast, the respective analogous enzymes for these reactions—pyruvate formate lyase, pyruvate-ferredoxin/flavodoxin oxidoreductase (POR) and ADP-forming acetyl-CoA synthetase (ACDA/B)—were found in extant anaerobic Metamonada, Archamoebae and Breviatea and some aerobic organisms (Fig. 1, Supplementary Figs. 23–27 and Supplementary Discussion). Most of these later cases probably reflect post-LECA acquisitions with subsequent transfer among eukaryotes, while others (such as POR) may have been present in LECA45. The phylogenies of the reverse TCA cycle defined by ATP-citrate lyase subunits (ACLA/B) or its fused version (ACLY; Fig. 1 and Supplementary Fig. 28), suggested that both were probably present in LECA as proposed previously68. Finally, for the TCA cycle, all phylogenies for the ten enzymatic steps showed clear LECA clades (Fig. 1, Supplementary Fig. 29–37 and Supplementary Discussion), indicating that the TCA was present in LECA. Yet, the TCA was nearly absent in Metamonada, Archamoebae, Microsporidia and partially in Breviatea, in line with suggested secondary losses in these eukaryotic clades64. As previously observed64,65,69,70, mitochondrial-derived PDHD and fumarate dehydrogenase C (FUMC, see below) of metamonads branch within LECA clades (Fig. 1 and Supplementary Figs. 23 and 35b), consistent with the prevailing view that these organisms secondarily lost mitochondria, rather than that they never had them55. Notably, the phylogeny of some enzymatic steps associated with these metabolic pathways revealed the presence of independent orthogroups coding for enzymes predicted to perform the same reactions. Examples include phosphoglycerate mutases (in EMP), citrate synthase (CS), aconitases (ACO) and isocitrate dehydrogenases (IDH, in TCA), suggesting that LECA harboured metabolic redundancy for these metabolic steps (Fig. 1 and Supplementary Figs. 9, 10, 29, 30 and 31). Altogether, these phylogenies unambiguously show that LECA had the complete set of enzymes needed for the CCM.
Prokaryotic origins of eukaryotic CCM
Next, we inferred the prokaryotic donor lineages for CCM enzymes present in LECA (summarized in Fig. 2a and detailed origins and distribution in Fig. 2b, Extended Data Fig. 3 and Supplementary Data 4), classifying gene trees by inferred donor lineage including Asgardarchaeota, the mitochondrial and chloroplast endosymbionts and other prokaryotic taxa. In what follows, we discuss examples of particular interest for each category. The associated phylogenetic trees are presented in Fig. 3a–d and Supplementary Figs. 1–37, with Fig. 3e depicting the general taxonomic composition of prokaryotic sister groups to enzymes present in LECA.

a, CCM pathways highlighting the proposed origins by colours. Enzyme names (see Fig. 1, Supplementary Discussion and Supplementary Data 2) in bold denote those enzymes potentially present in LECA. Edd, phosphogluconate dehydratase, dihydroxy-acid dehydratase; Eda, 2-dehydro-3-deoxyphosphogluconate aldolase; Azf, NAD+ dependent glucose-6-phosphate dehydrogenase; Acd, acetate–CoA ligase; AceA, isocitrate lyase; Pfl, formate C-acetyltransferase (Supplementary Data 2). Asterisks indicate tentative inferences. b, Phylogenetic profile of eukaryotic orthogroups manually selected from phylogenetic trees. Coloured cells indicate presence of genes with a proposed origin, see legend. Unknown origins (black cells), refer to those unresolved phylogenies. Grey cells column indicate sequences not considered as orthogroups (unclassified) and they are shown if they are present in >80 eukaryotes. Bold enzyme names are those hypothesized to be present in LECA and grey and light-brown denote isofunctional and protein complex enzymes, respectively. Eukaryotic tree of life includes fast-evolving and low genome completeness taxa such as Microsporidia and Picozoa (Extended Data Fig. 3). Raw data for presence/absence profile provided in Supplementary Data 4.

a–d, Examples of potential archaeal contributions from Asgardarchaeota (a), contributions from Alphaproteobacteria and Cyanobacteria (b), other prokaryotic origins (from known or unknown donor) (c) and Asgardarchaeota to eukaryote vertical gene transfer (VGT) versus HGT (d). Discontinuous lines indicate simplified tree topology after pruning the relative branches of interest. Number between brackets denotes the number of sequences in the respective clades. 2ryCHSRA, denotes secondary endosymbionts including Cryptista (C), Haptista (H), Stramenopile (S), Rhizaria (R) and Alveolata (A), Rhodophyta (Rhod), Chlorplastida (Chlor), photosynthetic (Ph) Archaeplastida. Phylogenies were built with IQ-TREE.2.1.2 under the LG + C20 + G + F model and using optimized ultrafast bootstrap (NNI UfBoot2). Extended trees are provided in Supplementary Figs. 1–37. Complete enzyme gene names defined in the text are: ADPGK, APGM and GPMI, ENO, RPIA, PDHA/B/C/D, PRPS, HK, PK, CS and ACDA. e, General taxonomic composition of sister group(s) of selected potential LECA clades. First and second panels (1) depict the taxonomic composition of the first sister group to a LECA clade, while the third panel (2) depicts the composition of the LECA closely related sister groups (see legend for visual explanation). Bar length is the sum of the respective presences of a taxon. ‘Absolute presence’ counts the presence of a specific taxon in the sister group, while ‘proportional presences’ represent the proportion of a taxon respective to the size of the sister group(s). Those taxa whose proportional presence in the single sister group was ≥1, were shown in the plot. Different coloured stacked bars refer to different pathways and prokaryotic phyla were sorted according to species relationships. BS, bootstrap support.
Asgardarchaeal host contributions
In four EMP phylogenies, putative LECA clades branch sister to Asgardarchaeota (Figs. 2 and 3a and Supplementary Figs. 4, 12 and 13): ADP-dependent glucokinase (ADPGK, acting in the first and third step of EMP71), two 2,3-bisphosphoglycerate-independent mutases (APGM and GPMI, analogous enzymes, see also Supplementary Discussion) and enolase (ENO). In three of these phylogenies, eukaryotes are nested within Archaea sister to sequences from Asgardarchaeota, consistent with the putative inheritance of these enzymes from the asgardarchaeal ancestor of eukaryotes (ADPGK 76%, APGM 89% and ENO 39% UfBoot2). In contrast, the topology of GPMI_1 shows a sister relationship between a few asgardarchaeal and eukaryotic sequences. In the PPP, the ribose 5-phosphate isomerase A (RPIA) phylogeny shows a large eukaryotic cluster containing species from the Amorphea and few Discoba branching next to Asgardarchaeota (97%; Figs. 2 and 3a and Supplementary Fig. 16). Together, this may indicate that Asgardarchaeota have contributed gene families to the CCM of eukaryotes.
Other cases of potential archaeal origins remain more speculative. Eukaryotic ATP-citrate lyase subunits (ACLA/B) or its fused version (ACLY), appear to be present in various Asgardarchaeota and might therefore have been inherited by eukaryotes through the host lineage. However, the eukaryotic ACLA/B and ACLY clades branch with DPANN72,73 and Thermoplasmatota-E3, respectively, suggesting independent origins from different archaeal groups (Supplementary Fig. 28 and Supplementary Discussion). Eukaryotic malate dehydrogenase LECA paralogues operate in the cytoplasm (MDH1) and in the mitochondria (MDH2)74. The phylogeny of the MDH family in combination with conserved spliceosomal intron positions (Malin75 LECA intron probability 0.6; Methods) suggest that MDH1 and MDH2 originated by duplication before LECA, with a clade containing TACK76 archaeal sequences and Baldarchaeota sequences as sister groups (Supplementary Fig. 36, Supplementary Data 5 and Supplementary Discussion). However the long branches characterizing these phylogenetic relationships render the archaeal origin of MDH1/2 tentative.
Lastly, in three phylogenies, we observed the clustering of a limited number of eukaryotes with Asgardarchaeota (Fig. 3d). The pyruvate kinase (PK) phylogeny contains a phylogenetic group, PK_5, mainly composed of Amoebozoa (plus a few others), nested within an asgardarchaeal clade (Fig. 3d and Supplementary Fig. 12d). Furthermore, the phylogeny of GPMI showed, in addition to a clear LECA clade (GMPI_1), a small monophyletic group of Asgardarchaeota with some eukaryotes (labelled as GPMI_2; Fig. 3a, Supplementary Fig. 10 and Supplementary Discussion). The third case is the ACDA/B enzyme family which is involved in the reversible conversion of acetate into acetyl-CoA using ATP (analogue of ACS) (Fig. 2a). We found a fully supported monophyletic group (100% UfBoot2, ACD3) containing Lokiarchaeia and Fornicata, a lineage within Metamonada (Fig. 3d and Supplementary Fig. 27). The clustering of these eukaryotic clades with Asgardarchaeota would be consistent with orthogroups that were present in LECA and subsequently underwent loss during eukaryotic evolution, resulting in their absence from model organisms77 (that is, vertical gene transfer from Asgardarchaeota to LECA of PK). Alternatively, this clustering might reflect a post-LECA transfer from Asgardarchaeota to the ancestor of one of these eukaryotic groups with subsequent transfer between eukaryotes. Overall, and despite the sometimes limited resolution of single-gene trees, these cases demonstrate a previously underappreciated role of the asgardarchaeal host cell in shaping eukaryotic CCM.
Alphabacterial and cyanobacterial endosymbiotic contributions
The next category involves putative endosymbiotic contributions to LECA and the Archaeplastida ancestor through EGT78. Indeed, many CCM phylogenies recovered Alphaproteobacteria as sister clades to eukaryotes (Figs. 2 and 3b). We found two alphaproteobacterial contributions to the EMP (fructose-bisphosphate aldolase (ALDO) and triosephosphate isomerase (TPI)), one to the PPP (ribose phosphate pyrophosphokinase (PRPS)), four in pathways related to pyruvate conversions (PDHA/B/C/D) and ten to the TCA cycle (IDH1, 2-oxoglutarate dehydrogenase subunits, SUCA/B, succinyl-CoA synthetase alpha/beta subunits, LSC1/2, succinate dehydrogenase subunits, SDH1/2/3/4 and FUMC; Supplementary Figs. 5a, 6, 22, 23 and 31–35). Thus, most of the potential contributions from alphaproteobacteria seem to operate in the mitochondria (Fig. 2a).
We also found several likely cyanobacterial contributions to the CCM of photosynthetic eukaryotes (Fig. 2a). Specifically, four enzymes of the EMP (glucose-6-phosphate isomerase (GPI), ALDO class II, FBA, glyceraldehyde 3-phosphate dehydrogenase, (GAPDH) and phosphoglycerate kinase (PGK)), six of the PPP (ribulose-phosphate 3-epimerase (RPE), RPIA, PRPS, transketolase (TKT), phosphogluconate dehydratase (EDD) and 2-dehydro-3-deoxyphosphogluconate aldolase (EDA)) and four among pyruvate conversions (PDHA/B/C/D), have topologies consistent with being derived by EGT from cyanobacteria (Supplementary Figs. 3, 5b, 7, 8 and 17–23). These phylogenetic clusters also contain photosynthetic eukaryotes with higher level plastids (for example, secondary and tertiary endosymbioses). The PDHC/D phylogenies revealed a green algae plastid EGT in Chlorarachnea, as well as red algae plastid EGTs in Cryptophyceae, Haptophyta, Myzozoa and Gyrista (chromist lineage79; Fig. 3b and Supplementary Fig. 23). Similarly, the phylogenies of PGK and RPE (Supplementary Figs. 8 and 17) also comprise plastid EGTs, whereas the phylogenies of PFK2, ENO, RPIA, LSC2 and ALDO, showed monophyletic groups comprising red algae derived lineages within LECA clades indicative of nucleus-to-nucleus EGTs (Supplementary Figs. 4, 5, 11 and 32). These observed EGTs involved various sister groups, with eukaryotic taxa ranging from red and green algae to non-photosynthetic organisms and suggesting complex evolutionary histories. Thus, PDHC/D phylogenies support distinct secondary endosymbiosis events involving green and red algae, respectively, and serial endosymbioses in red lineages50,79.
The phylogenetic signal for the sister relationship between Alphaproteobacteria or Cyanobacteria and eukaryotes is not always unequivocal13 (Supplementary Discussion): in phylogenies of TPI and PRPS, eukaryotes are sister to Alpha/Gammaproteobacteria, while trees of LSC2 and ACS, recover genes of other prokaryotic clades interspersed between the alphaproteobacterial/LECA clades. Furthermore, the PDHD and FUMC phylogenies include divergent eukaryotic sequences which branch within the Alphaproteobacteria clade rather than within the LECA clade (especially Excavata taxa; Supplementary Figs. 23 and 34). Similarly, cyanobacterial contributions are not always highly supported (for example, RPE, EDD and EDA; Supplementary Figs. 17 and 21). Nevertheless, our analysis shows that alphaproteobacterial contributions to LECA mainly operate in the TCA pathway, and that cyanobacterial EGT contributions to the Archaeplastida ancestor often comprise enzymes of the EMP glycolysis and the PPP. The PDH complex and PRPS phylogenies revealed contributions derived from both Alphaproteobacteria and Cyanobacteria (Figs. 2 and 3b), potentially illustrating the importance of pyruvate and ribose phosphate metabolisms in these endosymbioses.
Contributions to LECA CCM from other prokaryotic lineages
Besides contributions of the asgardarchaeal host and the alphaproteobacterial endosymbiont to the LECA proteome, several phylogenetic trees indicate donations from other prokaryotic lineages, some with good support (>95% UfBoot2). In some cases, these donations included enzyme families that lacked respective homologues in Asgardarchaeota and Alphaproteobacteria. Examples comprise phylogenies of glycolytic enzymes (such as hexokinase (HK), 6-phosphofructokinase 1 (PFKA), GAPDH, PGK and PK), enzymes involved in the PPP (glucose-6-phosphate 1-dehydrogenase (G6PD), 6-phosphogluconolactonase (PGLS), RPE, TKT and transaldolase (TAL)), as well as TCA cycle enzymes (CS, IDH1 and FUMA/B) (Figs. 2 and 3c). Potential donors that we identified included Chlamydia (TAL), Planctomycetota–Verrucomicrobiota (PGK, G6PD and RPE), Fusobacteriota (PK), Cyanobacteria, Dependentiae (for two independent donations of LDH) and Chloroflexota (CS) (Fig. 3c and Supplementary Figs. 12, 13, 17, 29 and 36). However, several phylogenies displayed a mixed composition of sister groups which hindered the identification of the donor for the respective clade (for example, GAPDH; Supplementary Fig. 7). Hence, we tallied for each LECA orthogroup the occurrence of prokaryotic taxa in its sister group, which, besides a clear archaeal and proteobacterial signal, also revealed the presence of recurrent phyla in these sister groups, including Myxococota–Desulfobacterota, Bacteroidota and Acidobacteriota (Fig. 3e). These examples suggest that prokaryotes other than Asgardarchaeota and Alphaproteobacteria have contributed to the assembly of CCM during and after eukaryogenesis.
Gene families of unresolved origins
In several (at least 12) phylogenetic reconstructions, it was not possible to clearly denote LECA clades because of paraphyletic branching of eukaryotic and prokaryotic sequences resulting in unresolved sister groups. While the phylogenetic signal was limited in some cases (phosphoglycerate mutases GPMA and GPMB or 6-phosphogluconolactonase (PGL)), others recovered consistent and robust topologies across a range of datasets and analyses; that is, glucokinase (GLK), GPI, PGD, EDA/D, PRPS1 and ACS; Supplementary Figs. 1, 3, 16, 22 and 26). For example, our phylogeny of PRPS recovers an unresolved eukaryotic group (PRPS1), that might be derived from archaeal PRPS. However, this is speculative because of the presence of interspersed bacterial groups (Fig. 3b, Supplementary Fig. 22 and Supplementary Discussion). Similarly, GPI phylogeny is consistent with previous work that also resolved paraphyletic clades for eukaryotic homologues80,81 (Supplementary Fig. 3). Our investigation of the conservation of spliceosomal introns82 across the MSA of eukaryotic GPI showed several conserved intron positions across eukaryotic clades, suggesting that this enzyme may in fact have been present in LECA (Mailin probability >0.5; Supplementary Fig. 3c and Supplementary Discussion). On the other hand, homologous recombination events between paralogues of different origins may explain some of the observed patterns78 (for example, potential recombinant region in GPI; Supplementary Fig. 3d,e and Supplementary Discussion). Thus, the evolutionary origins of these latter eukaryotic gene families remain unresolved.
CCM remodelling by transfer, loss, replacement and targeting
We next investigated post-LECA evolution of CCM enzymes including their correlative distribution across the eukaryotic tree and their predicted organellar localization as inferred from organelle targeting sequences. The analysis of CCM enzyme distribution across the tree revealed that orthogroup repertoires vary between distinct eukaryotic clades (Fig. 2b). We identified both cases of independent replacement and differential retainment of isofunctional enzymes (for example, HK/GLK/ADPGK, ALDO/FBA, PGMA/PGMB/GPMI/APGM, RPIA/RPIB, PDH/POR, ACS/ACDAB, ACO/ACO2, IDH1/IDH2/IDH3 and FUMAB/FUMC; Fig. 2b). The evolutionary history of enzymes of inferred asgardarchaeal origin, such as ADPGK, APGM and RPIA, suggests that these genes were present in LECA but subsequently replaced in some eukaryotic taxa by horizontally acquired homologous or analogous enzymes of bacterial origin (HK, PGMA/B and RPIB, respectively) (Fig. 2b). A correlation network analysis (±0.5 phi coefficient cut-off; Methods) of orthogroups and lifestyle characteristics (for example, anaerobic, primary and secondary endosymbiosis) suggests that CCM enzyme repertoires partially reflect eukaryotic lifestyles (Fig. 4a, Extended Data Fig. 4 and Supplementary Discussion). Correlated distributions between photosynthetic eukaryotes (by/of primary and secondary endosymbioses) are mainly related to EMP/EDP and PPP, while correlations regarding aerobic/anaerobic lifestyle usually involved pyruvate/acetate conversions and TCA cycle enzymes (Fig. 4a). Phylogenies of POR, ACDA/B, GPI, FBA and RPIB, displayed orthogroups involving anaerobic eukaryotes (Fig. 4a), suggesting adaptations to anoxygenic conditions.

a, Left, correlative networks for the distribution of those orthogroups with higher (red edges) or lower (blue edges) phi coefficients than 0.5 and −0.5, respectively. Light pink indicates orthogroups including anaerobic eukaryotes. Clusters were obtained by modularity using Gephi. Right, phylogenetic profile of the respective correlated orthogroups indicating their evolutionary origins (cell colour) and targeting signal (cell shape). Taxonomic tree is a subselection of representatives and is annotated with characteristic traits of the respective taxa (see legend). b, Distribution of targeted proteins along the eukaryotic tree of life and the CCM. Bars represent the proportion of sequences with the respective targeting. noTP, no transit peptide; mTP, mitochondrial transit peptide; cTP, chloroplast transit peptide; luTP, thylakoid luminal transit peptides; multiTP, multiple transit peptides (see legend). LuTPs were clustered together with cTPs. Only sequences from selected orthogroups (Fig. 2b) were used for this analysis. Raw data for these plots provided in Supplementary Data 4.
Most enzymes of the EMP and PPP as well as the key enzymes of the reverse TCA enzymes, do not encode obvious targeting signals, whereas most of the enzymes involved in pyruvate conversions and the TCA appear to be targeted to mitochondria (Fig. 4b). Nevertheless, exceptions exist, indicating potential sub- or neo-functionalization of certain enzymes. For instance, PGMA and PGMB are typically found in both the cytoplasm and mitochondria/chloroplast, whereas their analogous enzymes, GPMI and APGM, do not exhibit mitochondrial targeting sequences (Fig. 4b). Likewise, in agreement with their general targeting patterns, MDH1 is generally associated with mitochondrial functions, whereas MDH2 tends to be associated with cytoplasmic activities74, although the reverse is true in some taxa (Fig. 4b and Extended Data Fig. 3). Not all proteins of alphaproteobacterial origin are targeted to the mitochondria (for example, ALDO and TPI) and conversely some enzymes of non-alphaproteobacterial origin appear to have mitochondrial targeting signals (for example, CS, ACNB and IDH1/2; Figs. 3 and 4b), illustrating retargeting of CCM enzymes. The following two cases exemplify the complexity of post-LECA retargeting of CCM: chloroplast and mitochondrial glycolysis (Fig. 4b and Extended Data Fig. 5). In Archaeplastida, the genes coding for EMP (and PPP) enzymes are targeted to the cytoplasm and chloroplast, respectively83. In particular, our results highlight the frequent duplication and subsequent relocation of ‘nuclear’ genes to the photosynthetic organelle (Extended Data Fig. 6 and Supplementary Discussion). Similarly, the parallel glycolysis in cytoplasm and mitochondria described in SAR84,85, appears to be specific to secondary endosymbionts involving the lower glycolysis, between TPI and PK (Extended Data Figs. 5 and 7 and Supplementary Discussion). Therefore, the distributions of targeted proteins across the CCM enzymes illustrate the general compartmentalization of these pathways in eukaryotic cells. However, the targeting of proteins is not always in agreement with their origins, suggesting an ongoing process of retargeting during the evolution of eukaryotes86,87.
Discussion
Our phylogenetic analyses demonstrate that a complete set of eukaryotic CCM enzymes was probably present in LECA. These enzymes originated from a variety of sources, including not only contributions from the alphaproteobacterial symbiont but also from the asgardarchaeal host and other prokaryotic donor lineages (Figs. 2a, 3e and 5). We found six putative contributions from Asgardarchaeota to the CCM of LECA, within the EMP and PPP (Fig. 2a): ADPGK, GPMI, APGM, ENO, PK and RPIA, which is in contrast to previous work postulating that Asgardarchaeota did not contribute to eukaryotic CCM12 or that ENO was the only eukaryotic enzyme within carbon metabolism to be of archaeal origin88,89. Even more salient is the potential archaeal affiliation of MDH1/2 and ACLA/B/Y which are involved in the TCA and reverse TCA cycles, and which might therefore represent archaeal host contributions that became integrated into the mitochondrial TCA cycle. With the exception of ENO, these asgardarchaeal host contributions are patchily distributed in extant eukaryotes, apparently because of independent horizontal replacement events. This, combined with limited taxon sampling of prokaryotes and microbial eukaryotes in previous studies, might explain why these contributions went undetected. Our findings on asgardarchaeal contributions to the CCM strengthen the idea that eukaryotic metabolism emerged from the integration of genes from both symbiotic partners (Fig. 5), rather than being derived solely from the mitochondrial progenitor89.

Schematic representation of tree of life illustrating the evolutionary history for the assembly of CCM in eukaryotes. Please note that we only depict selected major routes of HGT, which is pervasive in both eukaryotic, archaeal and bacterial evolution. PVC, Planctomycetes, Verrucomicrobia and Chlamydiae122; CPR, Candidate Phylum Radiation123. Credit: Icons from PhyloPic under a Creative Commons license CC0 1.0: protist (Andalucia godoyi), Matus Valach; human, T. Michael Keesey; fungus, Guillaume Dera; protist (Colponema vietnamica), Guillaume Dera; tree, Gabriele Midolo.
We found 17 putative alphaproteobacterial contributions, most of which are predicted to operate in the mitochondria (except for TPI, ALDO and PRPS; Fig. 2a). This finding is reminiscent of the evolutionary mosaicism previously reported for another essential process in eukaryotes, iron–sulfur cluster biosynthesis, in which the mitochondrial steps are predominantly alphaproteobacterial in origin, while the cytosolic steps are carried out by enzymes of varying evolutionary affinities90. While eukaryotes appear to have several CCM genes acquired from different individual bacterial taxa other than alphaproteobacteria, no additional dominant source of gene donations is apparent (Fig. 3e). We nonetheless note that a substantial number of phylogenies displayed a mixed composition of sister groups. This may be due to lack of phylogenetic signal, undersampling (that is, lack of sequence data) of relevant prokaryotic taxa and the ongoing evolution of, and HGT within and between, archaea, bacteria and eukaryotes91,92,93,94,95,96. Furthermore, we identified certain lineages within the sister groups that have previously been suggested to have exchanged genes with stem eukaryotes, such as Chlamydiota45 and Myxococcota97. This diversity of potential donors highlights the mosaicism of the CCM in eukaryotes including contributions from additional prokaryotic sources.
The cyanobacterial contributions representing EGTs from the chloroplast to the Archaeplastida ancestor operate in the EMP and PPP (Fig. 2a), which are connected to the Calvin cycle83. The evolutionary origins of both chloroplast and cytoplasmic versions of the EMP and PPP in Archaeplastida show a general prevalence of nuclear gene duplications over the genes originating from the chloroplast. The predominant process appears to be one in which nuclear genes were duplicated, with one copy relocated to the photosynthetic organelle, which might have promoted the genome reduction of the endosymbiont78,86,98. Similarly, while the targeted localization of glycolytic enzymes to the mitochondria has previously led to the suggestion of an endosymbiotic origin of glycolysis84,85, our work does not support this conclusion. Instead, our data indicate that CCM enzymes have been retargeted between cytosol, mitochondrion and plastid many times independently during the evolution of eukaryotes, revealing an ongoing remodelling of eukaryotic CCM.
Our results show that investigating the origin of the eukaryotic metabolism is crucial to inform our understanding of eukaryogenesis and the impact of the two primary endosymbiotic events that occurred during the origin and diversification of eukaryotes. The archaeal contributions we identify are not consistent with the view that eukaryotic metabolism is exclusively of bacterial origin34,35,36,37,38,39. Instead, they suggest that eukaryotic CCM is the result of an integration of host and symbiont contributions and continuous HGT (Fig. 5). The observation that most enzymes of archaeal ancestry are cytosolic and operate in the EMP and PPP, while genes of alphaproteobacterial origin function in the TCA within the mitochondrial organelle (Fig. 2a), is consistent with symbiogenetic models of eukaryogenesis: that is, models that invoke an archaeal origin of the eukaryotic cytoplasm and an alphaproteobacterial origin of the mitochondrium15,21,24,26,28. Specifically, our results support the view of syntrophic interactions between host and endosymbiont, in which the archaeal partner produced reducing equivalents by the degradation of organic substrates via glycolysis which, in the absence of a suitable electron acceptor, were shuttled to a bacterial symbiont which contributed a TCA cycle and an electron transport chain24,25,26,30,99. While we could not identify a third dominant donor lineage, our results suggest that some CCM enzymes present in LECA have other phylogenetic origins among prokaryotes which may be due to transient interactions with other prokaryotes before and during eukaryogenesis. Given our results and the previously undetected asgardarchaeal host contributions to the eukaryotic CCM, we expect future studies analysing the gene origins of additional metabolic pathways to further inform symbiogenetic models for the origin of the eukaryotic cell.
Methods
Dataset construction
Initial proteome selection, annotation and redundant filtering in the core dataset
We assembled a representative and balanced dataset of selected proteomes comprising 483 archaea, 487 bacteria (5 archaeal and 95 bacterial phyla) and 224 eukaryotic proteomes, which we refer to as core database (Supplementary Data 1). We collected representatives of all major eukaryotic clades available in 2021, selected on the basis of proteome quality (that is, completeness and prevalence of contamination). For archaea and bacteria, we preferentially selected type strains and high-quality metagenome-assembled genomes (MAGs) (based on completeness (>90%) and contamination (<5%) scores). In addition, we added MAGs representing taxa that did not fulfil our stringent quality criteria such as genome-reduced DPANN and CPR which otherwise would not be present in our core database. Each proteome was annotated with eggNOG-mapper v.2.1.4-2 (MMseqs search mode100,101), KOFAM_SCAN v.1.3.0 (ref. 102) (-f mapper-one-line, e value 1 × 10−3) and HMMSEARCH (HMMER.3.2.3 (ref. 103), e value < 1 × 10−3, selecting best i-value hit) against KO.hmm database104. We also performed DIAMOND v.2.0.6 (ref. 105) protein sequence searches against NCBI_nr release 244. To identify sequences for metabolic gene trees, we primarily used Kegg orthology (KO; Supplementary Data 2) annotations, prioritizing KOFAM classifications. In instances where KOFAM annotation was absent, we relied on HMMSEARCH annotations. The respective sequences were additionally annotated with TargetP v.2.0 (ref. 106).
Eukaryotic proteomes were downloaded and manually selected from EukProt v.3 (ref. 107). As this selection includes a variety of sequencing methods (genomes, transcriptomes and single-cell genomes), redundant and truncated sequences were filtered out uniformly. For each proteome, we first used MMseqs2 (options easy-cluster, –cluster-mode 2, –cov-mode 1, -c 1 –min-seq-id 0.95; ref. 100) and, then, used a custom script (read_clusters_mmseqs_declusterization.py) to redefine clusters.
Curation of proteomes from eukaryotic contaminations
We performed phylogenies of eukaryotic phylogenetic markers (see below) and identified prominent contaminations in the proteomes of some taxa in our dataset (Supplementary Data 3). Among others, these seem to be a result from difficulties in obtaining axenic cultures (for example, Telonemia108). To detect and filter out these contaminant sequences, we implemented the following workflow: first, we clustered protein families using Broccoli v.1.2.1 (ref. 109), using representative non-redundant eukaryotic proteomes. For each orthogroup, we aligned sequences with MAFFT-auto v.7.453 (ref. 110) and trimmed the MSA with trimAl 1.4.22 (ref. 111) (-gt 0.2), removing sequences with coverage <35% (custom script). We then used FastTree v.2.1.11 (ref. 112) (-lg) for inferring the phylogeny of each orthogroup. Finally, we used a custom ETE113 script to identify contaminations defined as cases in which certain eukaryotic taxa formed a monophyletic group together with the known contaminants. Specifically, the following contaminants were removed: kinetoplastids sequence data were detected in several eukaryotic proteomes including Lapot gusevi, Colponemids and Telonemia, among others, and Apusomonadida sequences were detected in proteomes of Choanocystis sp. and Colponema vietnamica. For kinetoplastid contamination, truncated contaminant sequences remained after this filtering and, thus, we additionally filtered out those sequences that were taxonomically assigned to kinetoplastids given the National Center for Biotechnology Information (NCBI) and EggNog annotations (Supplementary Data 3).
KO homologies
Single KO families are not always sufficient for inferring deep evolutionary history of enzymes because they are sometimes defined on relatively shallow levels. Therefore, we inferred the homology across KO families and combined homologous families when necessary (Supplementary Data 2). We clustered all sequences from the core dataset by KO annotation and further analysed those KO families with more than ten sequences. Specifically, sequences for each KO were aligned with MAFFT-auto v.7.453 (ref. 110), trimmed using trimAl 1.4.22 (ref. 111) (-gt 0.35) and again curated with trimAl (-maxidentity 0.85 -seqoverlap 80 -resoverlap 0.5). Next, we made individual hidden Markov models (HMMs) with the HH-suite 3.1.0 package114, using HHMAKE (-M 50). We combined all the resulting KO.hhm (14,744) into a single HH-suite database. Then, we performed HHSEARCH of KO.hhm of interest against our HH-suite database (Supplementary Data 2). Finally, we merged those KOs that were relevant for inferring the evolutionary history of certain families.
Investigating the origins of LECA clades using expanded dataset
To improve identification of prokaryotic origins of eukaryotic KO families, we searched potential LECA gene families (preliminarily identified from initial trees, see below) against a broader set of prokaryotic (NCBI-GTDB) and virus (NCBI) proteomes. We assembled a local dataset including all translated genomes from NCBI that have GTDB49 annotation and whose genome completeness was >75% and genome contamination was <5% (a total of 187,681 prokaryotic proteomes which were over-represented in phyla such as Proteobacteria, Firmicutes and Actinobacteria among others; Supplementary Data 1). Additionally, we added viruses from NCBI (a total of 44,889 viral proteomes). We refer to this database as the expanded dataset. The workflow was as following: we first screened potential LECA clades across the preliminary phylogenies of CCM enzymes (see below) and performed respective HMM protein models using exclusively eukaryotic sequences. Then, we performed HMMSEARCHES (e value 1 × 10−5) of these eukaryotic HMMs against the expanded prokaryotic and viral datasets. To avoid over-representation of taxa, for each HMMSEARCH we selected the top 15 sequences for each taxonomic class until we collected a total of 150 sequences. Then, we added these sequences to our original set of sequences from the core dataset (removing redundant sequences at 97% of identity threshold using trimAl). These extended searches provided potential donors that were overlooked in the core dataset (for example, LDH phylogeny).
Phylogenetic analyses
Eukaryotic tree of life phylogenies
The eukaryotic tree of life was reconstructed by the concatenation of the alignments of phylogenetic markers that were carefully and individually assessed and curated through iterative phylogenetic reconstructions. We first assembled protein HMMs (MAFFT-auto v.7.453 (ref. 110), trimAl 1.4.22 (ref. 111) -gt 0.4, HMMBUILD103) using the sequences for 320 markers provided in ref. 50. For each phylogenetic marker HMM, we performed an HMMSEARCH (e value 1 × 10−15) against the eukaryotic proteomes and extracted the top ten sequences of each taxon sorted by individual e-value per domain. We performed an initial phylogeny using MAFFt-auto v.7.453 (ref. 110), trimming with trimAl 1.4.22 (ref. 111) (-gt 0.70) and FastTree v.2.1.11 (ref. 112) (-lg) to identify the orthogroup in question and remove spurious and/or long-branching sequences. Then, we performed two other rounds of phylogenies using MAFFT-L-INS-i v.7.453 (ref. 110), BMGE 1.12 (ref. 115) (-h 0.55), MSA cover >35% and built the gene tree with IQ-TREE 2.1.2 using ultrafast bootstrap with the best-fitting empirical or mixture model116,117 (-bb 1000 -mset LG -madd LG + C10,LG + C20,LG + C10 + R + F,LG + C20 + R + F). These two rounds were used to identify and remove contaminating sequences and select a single orthologue per taxon on the basis of the phylogenetic position and the sequence length relative to the total alignment length (note that three phylogenetic markers were excluded because of low phylogenetic resolution). We finally concatenated 317 markers which were individually aligned with MAFFT-L-INS-i v.7.453 (ref. 110) and trimmed with BMGE 1.12 (ref. 115) (-h 0.55). Phylogenetic analyses were based on IQ-TREE 2.1.2 (ref. 117) (see below). Taxa with a concatenation coverage <50% as well as fast-evolving taxa such as Microsporidia were excluded for analyses focusing on the eukaryotic tree of life (Fig. 1 and Supplementary Data 1).
We first reconstructed a phylogeny using corrected UBFoot2 and the LG + C60 + G mixture model (-mset LG -madd LG + C60 + G –score-diff all -bb 1000 -bnni) with IQ-TREE 2.1.2 (refs. 116,117). We then gradually removed heterogeneous sites using ‘alignment_pruner.pl’ script (–chi2_prune 0-0.9; https://github.com/novigit/davinciCode/blob/master/perl), followed by phylogenetic inferences using IQ-TREE 2.1.2 (refs. 116,117) (-mset LG -madd LG + C60 -bb 1000). We additionally reduced the MSA to 148 selected eukaryotes and performed a Bayesian phylogeny using PhyloBayes 3 (ref. 118) (-catfix C60, -gtr) although the chains did not converge (11,700 generations, max_dif=1, meandif=0.03).
CCM enzyme phylogenies
We used the metabolic maps of glycolysis, PPPs, Entner–Doudoroff pathway, pyruvate metabolisms and TCA cycle provided by KEGG (https://www.genome.jp/kegg/pathway.html) and determined their distribution across our core dataset to select those KOs that were present in eukaryotes (Supplementary Data 2). Instances such as glyceraldehyde-3-phosphate ferredoxin oxidoreductase and ketoglutarate dehydrogenase/multifunctional 2-oxoglutarate metabolism enzyme among other enzymes, were not found in eukaryotes and excluded in downstream analyses.
To reconstruct refined gene tree phylogenies, we performed three main steps (Extended Data Fig. 2). In the preliminary phase, we built an initial and curated phylogeny using the core dataset, by using a strict MSA covering threshold (>80%), visual inspection of the MSA and removing terminal long branches (that is, branches longer than six times the mean of all the terminal branch lengths, as assessed using the script ‘read_terminalbranchlength.py’). Final trees were obtained with IQ-TREE 2.1.2 using the best models and optimized UfBoot2 (-mset LG -madd LG + C20 + G + F -bb 1000 -bnni -alrt).
Then, we manually inspected the trees and identified potential LECA clades (including cases such as GPI and PGD) to build a eukaryotic-specific HMM. These HMM were then used for the extension phase in which we made HMMSEARCHES of each ‘LECA’ HMM against our local NCBI database including prokaryotes and viruses and add the 150 top sequences to our final set of sequences obtained in the previous phase (see above, expanded dataset).
The final phase consisted in two kinds of reconstructions. One was based on the strict trimming (MSA cover >80%) to get consistent sister group relationship and definition of LECA clades, while the other was based on inclusive trimming (MSA cover >20%) to include truncated sequences in the absence/presence profiles across eukaryotes. Final phylogenies are based on Mafft-L-INS-i alignments trimmed with trimAl 1.4.22 (-gt 0.7) and IQ-TREE 2.1.2 using empirical and mixture models (-mset LG | -m LG + C20 + G + F -bb 1000 -bnni -alrt -nstop 500 -pers 0.2). Trees were manually rooted as described in the Supplementary Information. We preferentially used outgroup rooting, but when this was not possible, we chose an arbitrary root to ease visualization of sister groups of interest. In addition, some phylogenies required further refinements including addition of an outgroup (GLK, HK, ADPGK, ACO, MDH, LDH, SDH and LSC), extraction and phylogeny of single Pfam domains (PFK, H6PD, PGLS, ACDAB and ACLAB/Y), subselection of sequences for refined phylogenies (TPI, PK, EDD, EDA, TAL, TKT, PDHD, POR and FUMAB/C) and conservation of introns (GPI, PGD, ACS and MDH/LDH). All trees were annotated and visualized with Interactive Tree Of Life (iToL)119. All alignments and raw tree files are available via Zenodo120 (https://doi.org/10.5281/zenodo.10991068).
Domain extraction for phylogenetic reconstructions
For PFK, H6PD and PGLS phylogenies, we extracted the respective Pfam domain of interest inferred with HMMSCAN. For the case of ACLAB/Y and ACDAB, we first built the respective MSAs including all homologues (fused and separate genes) using MAFFT-L-INS-i v.7.453 and trimAl 2.1.2 (-gt 0.4). Then, we split the MSA into the respective subunits (ACLA/ACLB and ACDA/ACDB) and built an HMM with HMMBUILD. Then, we aligned the set of sequences to the HMM using HMMALIGN (–trimm) and converted the HMM output file (that is, ‘.sto’ file) into unaligned sequences which were used for phylogenic analyses. A similar approach was used for separated mitochondrial pyruvate carrier subunits (MPC1/2). Note that phylogeny of AcdB/AclA subunit in Supplementary Fig. 27 consists of the extraction of ATP-grasp Pfam domain. Final phylogenies were conducted as described above.
Analyses for shared introns
We investigated the shared spliceosomal intron positions for GPI, PGD, ACS and MDH/LDH gene families to investigate the potential monophyly of eukaryotic sequences (see GPI section in Supplementary Discussion for further contextualization). To identify the extent of conserved spliceosomal intron positions for our genes, we searched the HMM of interest against a set of proteomes previously selected for which genome data were available121. Then, we made preliminary trees using MAFFT v.7.453 (default), trimAl 1.4.22 (-gt 0.7) and FastTree v.2.1.11 (-lg), from which we selected the eukaryotic orthogroups of interest to investigate shared introns. We realigned the selected sequence using MAFFT-L-INS-i v.7.453 and used imapper (https://github.com/JulianVosseberg/imapper) to infer the table of shared intron positions. Finally, we made a phylogeny including closely related prokaryotic sequences previously obtained, using MAFFT-L-INS-i v.7.453, trimAl 1.4.22 (-gt 0.7) and IQ-TREE 2.1.2 (-m LG + C20 + G + F -B 1000 -alrt 1000 -bnni) and mapped the intron positions with sufficient conservation. For putative single-gene families such as GPI and PGD we mapped those positions with more than four shared introns in the same phase relative to their codon, while for putative paralogous gene families such as ACS and MDH we mapped those positions that shared introns between two subfamilies and where each of them contained at least four taxa sharing the respective intron. In addition, to obtain the probabilities for the presence of introns in LECA, we used Malin75, a maximum-likelihood analysis of intron evolution and conservation, using as input the species tree and intron gain/loss rates provided by ref. 121 and an intron table generated with imapper scripts (Supplementary Data 5).
Topology support along MSA partitions of GPI
In the MSA of GPI, we identified a shared intron between Cyptophyceae and chloroplast clade. We made phylogenies of 20 positions downstream and upstream the intron position (at position 1,915) and the rest of the carboxy terminus, providing different topologies and suggesting a recombination event between nuclear and plastid paralogues. To verify that topology is specific to the recombined region, we made a subselection of 67 representative sequences and aligned them using L-INS-i v7.453 and trimmed with BMGE 1.12 (-h 0.55). Then, we split the MSA into partitions of 14 positions, to span the potentially recombined region identified visually. We performed phylogenies of each partition using IQ-TREE 2.1.2 (-bb 1000) with LG + G + F and C20 + G + F models. Finally, we read the *ufboot files with a custom ETE4 script, to investigate the Ufboot2 of the topologies of interest: Cryptophyceae + Chlamydia or Cryptophyceae-only sequences branching monophyletically with plastid (Archaeplastida + Cyanobacteria) or with potential LECA paralogues.
Orthogroup definition and correlations
For the definition of orthogroups, we manually selected the sequences forming a monophyletic clade in the respective trees built using inclusive trimming (see above). We compared those with trees build using strict trimming to identify and add sequences that were not correctly placed in the expanded datasets, for example, Metamonada in the PDHD phylogeny. These manually selected orthogroups were used for plotting the phylogenetic profile (in Fig. 2b) as well as for plotting the presence of targeting signals (Fig. 4b). To infer the correlative distribution of these orthogroups, we converted the orthogroup distribution table into absences (0) and presences (1) and inferred phi correlation coefficient using the sklearn.metrics.matthews_corrcoef python function.
Ethics statement
This research did not involve animals or humans and no new data have been generated. Furthermore, the information provided here does not pose a threat to public health, safety or security, animals, plants or the environment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Responses