Chromosome-level genome assembly of Jaguar guapote (Parachromis manguensis) by massive parallel sequencing

Chromosome-level genome assembly of Jaguar guapote (Parachromis manguensis) by massive parallel sequencing

Background & Summary

Parachromis managuensis (NCBI Taxonomy ID: 172535), commonly known as the jaguar guapote, belongs to the order Perciformes and the family Cichlidae within the genus Parachromis. This species is native to rivers and lakes in Nicaragua, Honduras, Costa Rica, and other Central American countries1. P. managuensis is a predominantly carnivorous, bottom-dwelling fish that helps regulate the overpopulation of smaller fish in freshwater ecosystems. It is recognized for its rapid growth, tolerance to low oxygen levels, high yield, strong disease resistance, and excellent meat quality2. These attributes have contributed to its increasing prominence as an economic fish species, leading to a significant expansion in its aquaculture. It is therefore regarded as an important tropical fish.

Research on P. managuensis at the molecular level is limited. In 2015, the mitochondrial genome of P. managuensis was sequenced3. Subsequently, in 2018, investigations into the adaptive evolution of P. managuensis at the transcriptomic level were conducted4. However, studies of its genome have not yet been reported, which significantly hinders the advancement of breeding programs and the development of desirable aquacultural traits. Recent advancements in sequencing technologies have markedly improved genomic research. Notably, Pacific BioSciences (PacBio)‘s Circular Consensus Sequencing (CCS) mode, available on the Sequel II platform, offers long read lengths (10–20 kb) and high accuracy (over 99%), thereby greatly benefiting de novo assembly studies of both plant and animal genomes5,6. High-quality reference genome sequences serve as the foundational prerequisite for advancing both fundamental and applied research in plants and animals. For example, the recent cat genome study, AnAms1.0, revealed over 1,600 novel protein-coding genes when compared to the previously utilized felCat9 assembly7. Furthermore, AnAms1.0 exhibits a higher count of structural variants and intact olfactory receptor genes compared to felCat9. These advancements play a crucial role in the discovery of novel genetic traits and in fostering advancements in veterinary medicine.

In this study, we report a chromosome-level genome assembly of P. managuensis by combining short sequencing reads, PacBio high-fidelity (HiFi) long reads, and chromosomal conformational capture (Hi-C) sequencing data. This high-quality assembled genome not only facilitates population genetic research and evolutionary analysis of P. managuensis but also provides important resources for optimizing genetic breeding.

Methods

Sampling, DNA and RNA extraction

Samples of P. managuensis were collected from Dagun Village, Lancang Lahu Autonomous County, Pu’er City, Yunnan Province (coordinates: E 100°27′95.11″, N 22°93′99.27″). Tissue samples were promptly collected, snap-frozen in liquid nitrogen, and then stored at −80 °C. DNA and RNA extraction, library construction, and sequencing in this study were performed using standard experimental and analytical protocols provided by NextOmics Biosciences (Wuhan, China).

Long read DNA preparation and sequencing

A total of 8 μg of high-quality genomic DNA was extracted from muscle tissue using a Qiagen DNeasy Blood and Tissue Kit (Qiagen, USA) according to the manufacturer’s instructions. The quality and concentration of the extracted DNA were assessed using a NanoDrop One spectrophotometer (Thermo Scientific, USA) and 1% agarose gel electrophoresis. PacBio long insert libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 according to manufacturers’ instructions, with an insert size of approximately 20 kb. The libraries were sequenced on the PacBio Sequel II system in CCS mode. Subreads were processed with SMRTLink (v11.1.0)8 using the parameters “–minPasses 3 –minPredictedAccuracy 0.99 –minLength 500”, producing approximately 21.70 Gb HiFi reads with an N50 size of 17,831 (Table 1). The parameter “minPredictedAccuracy” set to 0.99 in the context of PacBio SMRTLink software means that, during the data processing of sequencing reads, only those reads that have a predicted accuracy of 99% or higher will be retained for further analysis. This filtering step helps ensure that only high-quality reads, with a high likelihood of correct base calls, are included in the downstream processes, thereby improving the overall accuracy of the genome assembly.

Table 1 Summary of DNA sequencing data of P. managuensis genome.
Full size table

Short read DNA preparation and sequencing

The extracted DNA (~5 μg) was randomly sheared into approximately 350 bp fragments, and a short fragment library was constructed using the MGIEasy Universal DNA Library Prep Set (MGI, China). Sequencing was conducted on the MGISEQ T7 platform (MGI, China), resulting in a total of 63.45 Gb of short sequencing reads, each 150 bp in length (Table 1).

Hi-C DNA library preparation and sequencing

A Hi-C library was generated using the DpnII restriction enzyme (GrandOmics, China). Muscle tissue samples were treated with 1% formaldehyde at room temperature for 10–30 minutes to crosslink chromatin-interacting proteins. Subsequently, the DNA was digested with the restriction enzyme, and the 5′ overhangs were repaired with a biotinylated residue. A paired-end library with insert sizes of approximately 300 bp was prepared and then sequenced on the Illumina NovaSeq platform (Illumina, USA). A total of 151.09 Gb of clean data was obtained from 151.28 Gb of sequencing data using the software fastp (v0.19.5)9 with parameters “-w 20 –length_required 100” (Table 1). The “-w” parameter specifies the number of threads, while “–length_required 100” ensures that reads shorter than 100 bp are discarded.

RNA library preparation and sequencing

For the purpose of RNA sequencing, we extracted total RNA from gill and ovary tissues using the TRIzol reagent (Invitrogen, USA) following the manufacturer’s protocol. RNA purity was assessed with a NanoPhotometer spectrophotometer (IMPLEN, CA, USA), while RNA concentration was quantified using the Qubit RNA Assay Kit with a Qubit 2.0 Fluorometer (Life Technologies, CA, USA). RNA-seq libraries were prepared using the TruSeq Stranded mRNA Library Prep Kit (Illumina, USA) according to the manufacturer’s instructions. Sequencing was performed on a HiSeq X-Ten instrument, generating 150 bp paired-end reads.

Genome size estimation

The genome size of P. managuensis was estimated through K-mer profiling. First, raw short sequencing reads underwent quality control using fastp (v0.19.5)9. Subsequently, a K-mer depth distribution curve was generated with Jellyfish (v1.1)10 using clean short sequencing reads. Base on the method described in the orange-spotted grouper study11, the genome size, G, can be calculated from the formula G = K_num/K_depth, where the K_num is the total number of K-mer, and K_depth denotes the frequency occurring more frequently than the other frequencies. In this study, K was 21, K_num was 52,215,706,815 and K_depth was 56. Therefore, the genome size of P. managuensis was estimated to be 932,423,335 bp (Fig. 1). The estimated heterozygosity rate was about 0.15% and the repeat content was approximately 29.31%, suggesting that it is not a complex genome.

Fig. 1
Chromosome-level genome assembly of Jaguar guapote (Parachromis manguensis) by massive parallel sequencing

K-mer distribution of P. managuensis genome sequencing reads. The X-axis is the depth of K-mer derived from the sequenced reads, and the Y-axis is the frequency of the K-mer depth.

Full size image

De novo assembly and Hi-C assembly

The primary contigs were assembled from HiFi reads using Hifiasm (v0.16.1)12 with default parameters. Three rounds of error correction were performed with NextPolish (v1.4.1)13 with both HiFi reads and short reads. This process resulted in 346 contigs, a total size of 896.65 Mb, and an N50 length of 5.14 Mb. To anchor contigs onto chromosomes, BWA (v0.7.12)14 was used to align the Hi-C clean data to the assembled contigs. Low-quality reads were filtered using the HiC-Pro pipeline15 with default parameters. The remaining valid reads were employed to anchor chromosomes using Juicer16 and the 3d-dna pipeline17, followed by manual correction with Juicebox (v2.13.07)18. Ultimately, approximately 99.10% of the contig sequences were anchored to 24 pseudochromosomes, with only 55 contigs remaining unanchored to chromosomes (Table 2 and Fig. 2). The longest and shortest pseudochromosomes were 56.13 Mb and 26.86 Mb in length, respectively (Table 3). The final genome assembly harbored a total size of 896.66 Mb with a scaffold N50 of 38.19 Mb (Table 2).

Table 2 Summary statistics of P. managuensis assembly.
Full size table
Fig. 2
figure 2

Hi-C assembly of chromosome interactive heat map. The abscissa and ordinate represent the order of each bin on the corresponding chromosome group. The colour block illuminates the intensity of interaction from yellow (low) to red (high).

Full size image
Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.
Full size table

Annotation of repetitive sequences

Tandem repeats and interspersed repeats were identified in this study. Tandem Repeat Finder (v4.10.0)19 and GMATA (v2.2.1)20 were used to discern tandem repeats with default parameters. The analysis revealed that simple sequence repeats (SSRs) account for 0.16% of the total genome length, while tandem repeat sequences comprised 0.69% of the genome length (Table 4). Subsequently, RepeatModeler (v1.0.4)21 and MITE-hunter22 were employed to construct a de novo repeat sequence library, which was then used to detect interspersed repeats and low-complexity sequences with RepeatMasker (v4.0.7)23. DNA transposable elements (TEs) were identified through RepeatMasker (v4.0.7). In detail, a total of 325.58 Mb (~36.31%) of repetitive sequences were obtained (Table 4). Among the interspersed repeats, DNA elements were the most prevalent type, accounting for 15.64% of the genome (Table 4).

Table 4 Statistics of repetitive sequences in P. managuensis assembly.
Full size table

Protein-coding genes prediction and function assignment

Gene prediction was performed using a multifaceted approach incorporating transcriptome-based, homology-based, and ab initio methods. For the transcriptome-based prediction, a total of 55.85 Gb of RNA-seq clean reads were aligned to the P. managuensis assembly using STAR (v2.7.3a)24 (Table 5). Stringtie (v1.2.2)25 was then utilized to assemble transcripts based on the alignment results. Afterwards, these assembled transcripts were aligned against the P. managuensis assembly using Program to Assemble Spliced Alignment (PASA) (v2.4.1)26. This process led to the identification of 22,712 genes, which were designated as the “RNAseq-set”. Genomes from five fish species were retrieved from the National Center for Biotechnology Information (NCBI) database for homology-based gene prediction. The species included Oreochromis niloticus (GCF_001858045.2), Maylandia zebra (GCF_000238955.4), Neolamprologus brichardi (GCF_000239395.1), Astatotilapia calliptera (GCF_900246225.1), and Archocentrus centrarchus (GCF_007364275.1). O. niloticus is a well-studied fish with abundant genomic information and holds an important position in the field of fish biology and genomics research. It can provide a broad foundation of reference sequences and functional information for gene annotation of P. managuensis. M. zebra, N. brichardi, A. calliptera, and A. centrarchus all belong to the Cichlidae family and have a relatively close phylogenetic relationship with P. managuensis. These genomes were used as queries to search against the P. managuensis assembly using GeMoMa (v1.9)27, yielding 30,878 genes. Homology predictions were denoted as “Homology-set”. For ab initio prediction, 3,000 high-quality genes from PASA were randomly selected as the training set for model training with AUGUSTUS (v3.2.3)28. AUGUSTUS (v3.2.3)28 was then employed to predict coding regions in the repeat-masked genome, identifying 26,490 genes. The gene models generated by AUGUSTUS were labeled as the AUGUSTUS-set. Finally, all gene models were integrated using EvidenceModeler (v2.1.0)29. The weight of each evidence was set as follows: RNAseq-set > Homology-set > AUGUSTUS-set. The final comprehensive gene set comprised 24,145 genes, with an average of 10.53 exons per gene, an exon length of 167.11 bp, and a coding sequence (CDS) length of 1,760.01 bp. Circos (v0.69)30 was used to visualize the 24 pseudochromosomes, GC content, repetitive sequence density, and gene density (Fig. 3). The values for GC content were determined using SeqKit2 (v2.9.0)31. For the calculation of repetitive sequence density and gene density, bedtools (v 2.29.2)32 was employed and a window size of 1 Mb was set.

Table 5 Summary of RNAseq sequencing data of P. managuensis genome.
Full size table
Fig. 3
figure 3

Circos plot showing the features of the assembled P. managuensis genome. From the inside to the outside, GC content, repetitive sequences density, and gene density, respectively.

Full size image

The integrated gene set was translated into amino-acid sequences and annotated using six public databases. Briefly, amino-acid sequences were aligned to EuKaryotic Orthologous Groups (KOG)33, SwissProt34, Kyoto Encyclopedia of Genes and Genomes (KEGG)35, and the NCBI nonredundant database (NR) using the BLSAT program (v2.7.1+)36 with an E-value cutoff of 1e-05. Protein domains were identified using the InterProScan (v5.30)37 program, and Gene Ontology (GO) terms for each gene were also extracted through InterProScan. Overall, 23,476 (97.23%) of the predicted protein-coding genes were functionally annotated (Fig. 4).

Fig. 4
figure 4

Venn diagram showing the number of genes with homology or functional classification by each method.

Full size image

Data Records

All the raw sequencing data utilized in this study were submitted to the NCBI Sequence Read Archive database with accession numbers SRP52769238 under BioProject number PRJNA1150327. The genome assembly has been deposited GenBank under the WGS accession GCA_040437545.139. Additionally, the genome assembly and annotation were deposited at Figshare40.

Technical Validation

Genome assembly and gene prediction quality assessment

The precision and integrity of the P. managuensis assembly were evaluated using several methods. We utilized Merqury (v1.3)41 with PacBio HiFi long reads, employing a K-mer value of 17-bp to infer the consensus quality value (QV). Merqury assesses genome completeness by first defining reliable K-mers based on K-mer count histograms to set a threshold. Then, it calculates the fraction of these reliable K-mers in the read set that are also in the assembly. The results revealed a QV of 50.95 and a genome completeness of 98.28% (Table 2). Next, minimap2 (v2.24-r1122)42 and BWA (v0.7.12)14 were employed for aligning the HiFi reads and short sequencing clean reads to the P. managuensis assembly, respectively. Notably, 99.99% of HiFi reads and 99.51% of short sequencing reads were aligned to the genome. Referring to the genome study of Epinephelus coioides11 and Neosalanx taihuensis43, the completeness evaluation of the P. managuensis assembly was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) (v5.5.0)44 with the actinopterygii_odb10 database. Results showed that 98.85% of the expected actinopterygii genes (including 98.05% single and 0.80% duplicated ones) had complete gene coverage, while 0.44% were identified as fragmented (Fig. 5). Furthermore, BUSCO analysis was also conducted to validate the genome annotation quality, which indicated that 95.80% of recognized BUSCOs, comprising 94.89% single-copy and 0.91% duplicated genes, were complete (Fig. 5). In conclusion, the P. managuensis genome assembly has achieved high quality.

Fig. 5
figure 5

BUSCO assessments of P. managuensis genome and gene sets.

Full size image

Related Articles

Energy metabolism in health and diseases

Energy metabolism is indispensable for sustaining physiological functions in living organisms and assumes a pivotal role across physiological and pathological conditions. This review provides an extensive overview of advancements in energy metabolism research, elucidating critical pathways such as glycolysis, oxidative phosphorylation, fatty acid metabolism, and amino acid metabolism, along with their intricate regulatory mechanisms. The homeostatic balance of these processes is crucial; however, in pathological states such as neurodegenerative diseases, autoimmune disorders, and cancer, extensive metabolic reprogramming occurs, resulting in impaired glucose metabolism and mitochondrial dysfunction, which accelerate disease progression. Recent investigations into key regulatory pathways, including mechanistic target of rapamycin, sirtuins, and adenosine monophosphate-activated protein kinase, have considerably deepened our understanding of metabolic dysregulation and opened new avenues for therapeutic innovation. Emerging technologies, such as fluorescent probes, nano-biomaterials, and metabolomic analyses, promise substantial improvements in diagnostic precision. This review critically examines recent advancements and ongoing challenges in metabolism research, emphasizing its potential for precision diagnostics and personalized therapeutic interventions. Future studies should prioritize unraveling the regulatory mechanisms of energy metabolism and the dynamics of intercellular energy interactions. Integrating cutting-edge gene-editing technologies and multi-omics approaches, the development of multi-target pharmaceuticals in synergy with existing therapies such as immunotherapy and dietary interventions could enhance therapeutic efficacy. Personalized metabolic analysis is indispensable for crafting tailored treatment protocols, ultimately providing more accurate medical solutions for patients. This review aims to deepen the understanding and improve the application of energy metabolism to drive innovative diagnostic and therapeutic strategies.

Progressively smaller glacier lake outburst floods despite worldwide growth in lake area

Glacier lake outburst floods (GLOFs) may originate from larger lakes as these continue to grow with ongoing glacier retreat; however, this perception remains poorly supported in available GLOF databases. Here we mapped the areas of 1,686 glacier lakes, just before they drained, across 13 glaciated mountain regions outside polar regions and examined the trends in pre-GLOF lake areas between 1990 and 2023. We found that pre-GLOF lake areas barely changed, or even decreased, regionally through time, even as the total lake area, and thus hazard potential, grew overall. This counterintuitive finding reflects limits to growing GLOF magnitudes, such as the decoupling of lakes and parent glaciers, the development of wide, low-gradient outlets or human management. Across all regions, pre-GLOF lake areas depend on a few ice-dammed lakes, which have produced ten times more reported outbursts and ten times larger outbursts than moraine- and bedrock-dammed lakes. The latter two dam types will continue to impound growing amounts of meltwater, thereby accounting for most of the overall GLOF hazard potential, while ice-dammed lakes will shrink with deglaciation. As these lake types will evolve differently in the twenty-first century, we call for customized simulations of GLOF outflows and impacts, given the growing exposure of critical infrastructure.

Engineered mitochondria in diseases: mechanisms, strategies, and applications

Mitochondrial diseases represent one of the most prevalent and debilitating categories of hereditary disorders, characterized by significant genetic, biological, and clinical heterogeneity, which has driven the development of the field of engineered mitochondria. With the growing recognition of the pathogenic role of damaged mitochondria in aging, oxidative disorders, inflammatory diseases, and cancer, the application of engineered mitochondria has expanded to those non-hereditary contexts (sometimes referred to as mitochondria-related diseases). Due to their unique non-eukaryotic origins and endosymbiotic relationship, mitochondria are considered highly suitable for gene editing and intercellular transplantation, and remarkable progress has been achieved in two promising therapeutic strategies—mitochondrial gene editing and artificial mitochondrial transfer (collectively referred to as engineered mitochondria in this review) over the past two decades. Here, we provide a comprehensive review of the mechanisms and recent advancements in the development of engineered mitochondria for therapeutic applications, alongside a concise summary of potential clinical implications and supporting evidence from preclinical and clinical studies. Additionally, an emerging and potentially feasible approach involves ex vivo mitochondrial editing, followed by selection and transplantation, which holds the potential to overcome limitations such as reduced in vivo operability and the introduction of allogeneic mitochondrial heterogeneity, thereby broadening the applicability of engineered mitochondria.

Tissue macrophages: origin, heterogenity, biological functions, diseases and therapeutic targets

Macrophages are immune cells belonging to the mononuclear phagocyte system. They play crucial roles in immune defense, surveillance, and homeostasis. This review systematically discusses the types of hematopoietic progenitors that give rise to macrophages, including primitive hematopoietic progenitors, erythro-myeloid progenitors, and hematopoietic stem cells. These progenitors have distinct genetic backgrounds and developmental processes. Accordingly, macrophages exhibit complex and diverse functions in the body, including phagocytosis and clearance of cellular debris, antigen presentation, and immune response, regulation of inflammation and cytokine production, tissue remodeling and repair, and multi-level regulatory signaling pathways/crosstalk involved in homeostasis and physiology. Besides, tumor-associated macrophages are a key component of the TME, exhibiting both anti-tumor and pro-tumor properties. Furthermore, the functional status of macrophages is closely linked to the development of various diseases, including cancer, autoimmune disorders, cardiovascular disease, neurodegenerative diseases, metabolic conditions, and trauma. Targeting macrophages has emerged as a promising therapeutic strategy in these contexts. Clinical trials of macrophage-based targeted drugs, macrophage-based immunotherapies, and nanoparticle-based therapy were comprehensively summarized. Potential challenges and future directions in targeting macrophages have also been discussed. Overall, our review highlights the significance of this versatile immune cell in human health and disease, which is expected to inform future research and clinical practice.

The role of rivers in the origin and future of Amazonian biodiversity

The rich biodiversity of Amazonia is shaped geographically and ecologically by its rivers and their cycles of seasonal flooding. Anthropogenic effects, such as deforestation, infrastructure development and extreme climatic events, threaten the ecological processes sustaining Amazonian ecosystems. In this Review, we explore the coupled evolution of Amazonian rivers and biodiversity associated with terrestrial and seasonally flooded environments, integrating geological, climatic, ecological and genetic evidence. Amazonia and its fluvial environments are highly heterogeneous, and the drainage system is historically dynamic and continually evolving; as a result, the discharge, sediment load and strength of rivers as barriers to biotic dispersal has changed through time. Ecological affinities of taxa, drainage rearrangements and variations in riverine landscape caused by past climate changes have mediated the evolution of the high diversity found in modern-day Amazonia. The connected history of the region’s biodiversity and landscape provides fundamental information for mitigating current and future impacts. However, incomplete knowledge about species taxonomy, distributions, habitat use, ecological interactions and occurrence patterns limits our understanding. Partnerships with Indigenous peoples and local communities, who have close ties to land and natural resources, are key to improving knowledge generation and dissemination, enabling better impact assessments, monitoring and management of the riverine systems at risk from evolving pressures.

Responses

Your email address will not be published. Required fields are marked *