The genetic landscape of autism spectrum disorder in an ancestrally diverse cohort

Introduction
Autism spectrum disorder (ASD) is a collection of neurodevelopmental disorders manifested by impaired social communication, repetitive behaviors, and restricted interests1. In addition to these primary symptoms, individuals with ASD often experience comorbidities like intellectual disability, anxiety, depression, attention disorders, and epilepsy2. About 1 in 36 children has been identified with ASD according to the latest estimates from CDC’s Autism and Developmental Disabilities Monitoring (ADDM) Network3.
ASD etiology includes a substantial genetic component, with a large population-based study including 2 million individuals suggesting that approximately 80% of the variation in the phenotype is attributable to genetic factors4. Recent genetic analyses have uncovered that rare variations disrupting gene function, identified through whole exome and whole genome sequencing, have large effect sizes on the disorder5,6,7. However, the genetic variants identified to date only account for a small fraction of the overall disease burden8, and each of the currently known ASD genes accounts for less than ~2% of cases9. Although hundreds of ASD susceptibility genes have been identified, research suggests that there may be 400–1000 genes associated with ASD susceptibility10,11. Thus, fully understanding the genetic architecture of ASD will require continuous efforts to sequence samples from ASD cohorts. Importantly, the majority of studies are focused on single ancestries—most frequently European ancestry—which limits genetic discovery, introduces bias, and misses ancestry-specific effects, reducing generalizability.
We enrolled a modest familial ASD cohort from diverse ancestral backgrounds and performed whole exome sequencing (WES) on a total of 754 individuals from 195 families, including 222 probands with ASD and their family members without ASD. We focused on spontaneous and inherited rare deleterious variants as pathogenic candidates. In total, we identified 92 potentially pathogenic variants in 73 genes that have been previously implicated in ASD or other neurodevelopmental disorders, and 158 potentially pathogenic coding variants in 120 candidate ASD genes. We also identified 34 copy number variants (CNVs) in all individuals with ASD that overlap with known loci. Through this study in a multi-ancestral ASD cohort, we identified potentially pathogenic variants in known ASD or neurodevelopmental disease genes enriched for nervous system development and neurogenesis and novel genes enriched for regulation of signal transduction. Our study underscores the significance of genetic diversity in ASD research and highlights the roles of the identified genes in brain development.
Results
Clinical characteristics of the ASD cohort
A total of 195 simplex and multiplex families who have at least one child diagnosed with ASD were enrolled in our study (Supplementary Data 1). The enrolled families represent diverse ancestral backgrounds, including African American, Asian, Hispanic, Middle Eastern, Native American, and European (Fig. 1A). We used principal component analysis (PCA) to explore the ancestry of the families in the cohort (Fig. 1B). Our cohort clustered across the different subpopulations of the 1000 Genomes project (1000G)12. Given that our cohort does not comprise a specific population, this finding is consistent with expectations. The cohort included a total of 222 individuals with ASD and their family members without ASD (165 fathers, 188 mothers, 5 grandmothers, and 174 siblings), and we observed a male-to-female ratio of 2.7:1 (162 males, 60 females) among individuals with ASD. This is slightly lower than the more recent estimates of ~3:113,14 or previous estimates of ~4:113. Parental age, which is a possible risk factor for ASD15, was not significantly different at the time of birth of individuals with ASD compared to offspring with no ASD (Supplementary Fig. 1). A standardized medical questionnaire was collected from each of the 195 participating families and reviewed along with available medical records for the presence of clinical comorbidities commonly associated with ASD and other neurodevelopmental disorders, including attention deficit/hyperactivity disorder (ADHD), language delay or impairment, cognitive impairment including intellectual disability, specific learning disability, aggression or challenging behaviors, mood disorders (i.e., anxiety, depression, obsessive-compulsive disorder (OCD), bipolar disorder), seizures, and sleep problems. There were 222 individuals diagnosed with ASD and 532 participants without ASD. Of those individuals with ASD where complete information for a specific phenotype was available, 91.72% had language impairment, 83.21% had developmental delay, 71.31% had learning disability, 65.81% had behavioral problems, 49.55% had ADHD, 49.54% had intellectual disability, 27.45% had seizures, and 25% had OCD (Fig. 1C). Other medical comorbidities were seen at lower frequencies, including environmental and food allergies, and respiratory, gastrointestinal, and vision problems. Demographics and clinical information for the cohort are provided in Fig. 1, Table 1 and Supplementary Data 1.

A Pie chart depicting the ancestral diversity of the ASD cohort. Multiple refers to individuals with multiple ancestries. B Principal component analysis (PCA) of the ASD cohort samples combined with the 1000G populations, using the entire ASD cohort (left), the pedigree founders (middle), or the unrelated probands (right). The ASD cohort is represented in yellow. The 1000G populations are: ACB African Caribbeans in Barbados, ASW Americans of African Ancestry in Southwest USA, ESN Esan in Nigeria, GWD Gambian in Western Divisions in Gambia, LWK Luhya in Webuye, Kenya, MSL Mende in Sierra Leone, YRI Yoruba in Ibadan, Nigeria, CLM Colombians from Medellin, Colombia, MXL Mexican Ancestry from Los Angeles, USA, PEL Peruvians from Lima, Peru, PUR Puerto Ricans from Puerto Rico, CDX Chinese Dai in Xishuangbanna, China, CHB Han Chinese in Beijing, China, CHS Southern Han Chinese, JPT Japanese in Tokyo, Japan, KHV Kinh in Ho Chi Minh City, Vietnam, CEU Utah Residents (CEPH) with Northern and Western European Ancestry, FIN Finnish in Finland, GBR British in England and Scotland, IBS Iberian Population in Spain, TSI Toscani in Italia, BEB Bengali from Bangladesh, GIH Gujarati Indian from Houston, Texas, ITU Indian Telugu from the UK, PJL Punjabi from Lahore, Pakistan, STU Sri Lankan Tamil from the UK. Population abbreviations are also defined in Supplementary Data 11. C The prevalence of neurodevelopmental and neuropsychiatric conditions in the ASD cohort. ASD was diagnosed in all 222 probands (100%). Language impairment was the most commonly reported phenotype (91.72%).
Whole exome sequencing and variant discovery in the ASD cohort
We performed WES on samples from 754 individuals, including 222 individuals with ASD. The average read depth was 46X, with no differences in depth of sequencing with respect to phenotypic status, sex, or family relationships (Supplementary Fig. 2A–C). On average, 99.29% and 93.9% of bases were covered at a mean read depth of at least 10X and 20X, respectively (Supplementary Fig. 2D). An average of 86,215 total variants were identified per exome, of which an average of 73,132 were single nucleotide variants (SNVs) and 13,083 were insertions or deletions (indels) (Supplementary Data 2). After applying read depth and quality filters, 77,075 variants per exome remained, of which an average of 65,907 were SNVs and 11,168 were indels (Supplementary Data 2). A detailed summary of our WES data processing and variant filtration pipeline is shown in Fig. 2. We filtered for rare variants with a minor allele frequency (MAF) < 1% in all annotated population databases ((1000G)12, Genome Aggregation Database (gnomAD)16,17, the Greater Middle East Variome project (GME)18, and The Exome Aggregation Consortium (ExAC)19), identifying on average 8433 rare variants per exome, of which 7002 were heterozygous and 1431 were homozygous (Supplementary Data 2). We defined potentially damaging variants as the subset of rare exonic or splice site (referred to as coding) variants that are also predicted to be damaging by at least 1 of the 2 algorithms used: SIFT and PolyPhen-2 HumVar. There was no significant difference in the number of potentially damaging variants between sexes for individuals with ASD in the cohort (Supplementary Fig. 3). To assess for an excess of potentially damaging variants in individuals with ASD compared to individuals without ASD, we performed a burden analysis. We found no difference between individuals with or without ASD in the burden of rare variants with total coding, nondisrupting, missense damaging, or loss of function effects (Supplementary Fig. 4). This outcome is expected, given our modest sample size and the fact that ASD comprises individually rare diseases with genetic heterogeneity, caused by rare alleles of substantial impact. Therefore, observing an excess of these variations requires studying much larger cohorts capable of capturing this heterogeneity. We discovered an average of 5959 novel variants per exome that have not been reported in any of the populations in the public databases that we used for annotation (Supplementary Data 2). Furthermore, we found an average of 52 novel variants per individual that were private (71 for parents, 34 for offspring), meaning they have not been reported in any of the annotated populations and they were not present in any other individual in the cohort (Supplementary Data 3). In total, there were 38,834 novel private variants across all individuals in the cohort (Supplementary Data 3). As expected, more private variants were present in parents compared with offspring (Supplementary Fig. 5). We identified an average of 15 (20 for parents, 9 for offspring) private coding variants per exome, of which an average of 6 (8 for parents, 4 for offspring) per exome were nonsynonymous and predicted to be potentially damaging by at least 1 of the 2 algorithms used, SIFT and PolyPhen-2 HumVar (Supplementary Data 3).

Whole exome sequencing (WES) was performed on 754 individuals from 195 families, including 222 probands with ASD and their family members without ASD (165 fathers, 188 mothers, 5 grandmothers, and 174 siblings). Single nucleotide variants (SNVs) and small insertions or deletions (indels) were called using DeepVariant. Variant quality filtering was performed as described in the Materials and Methods. Rare de novo or inherited (X-linked, homozygous, and compound heterozygous) variants were annotated to identify potentially pathogenic variants. Risk genes were prioritized by disease annotation, specific expression, and pathway enrichment. MAF minor allele frequency. This figure was created with BioRender.com.
Identification of candidate ASD variants
For candidate ASD variant discovery, we initially focused on rare nonsynonymous exonic or splice site variants that were either de novo or segregated with ASD in the family under homozygous, compound heterozygous, or X-linked inheritance. We identified an average of 4 de novo variants (2 coding) per offspring with ASD (Supplementary Data 4). In addition, we identified an average of 155 inherited homozygous variants (38 coding) and 10 compound heterozygous variants in 3 genes per offspring with ASD (Supplementary Data 4). We also identified an average of 16 recessive X-linked variants in male offspring with ASD (8 coding) (Supplementary Data 4). We did not find a significant correlation between the number of de novo variants and maternal or paternal age at birth of an offspring with ASD (Supplementary Fig. 6). In total, we identified 630 genes harboring 1503 rare nonsynonymous exonic or splice site variants that are predicted to be potentially damaging by at least 1 of the 2 algorithms used, SIFT and PolyPhen-2 HumVar (Supplementary Data 5). The shared symptoms among individuals with ASD suggest the existence of a functional convergence downstream of loci that contribute to the condition. To investigate if there is selective expression of at least some of these 630 genes in different brain regions, we conducted specific expression analysis (SEA) using human transcriptomics data from the BrainSpan collection20. We found that genes with variants detected in the individuals with ASD in our cohort were enriched in the thalamus (p = 0.014) (Fig. 3 and Supplementary Data 6), including AR, ATP1A3, SCN1A, and SLC7A3.

Bullseye plot of specific expression analysis (SEA) of genes harboring the prioritized variants across brain regions and development. SEA revealed that genes with possibly damaging variants detected in the ASD cohort were enriched during young adulthood in the thalamus. The color bar shows Benjamini–Hochberg corrected p.
Variants in known ASD or neurodevelopmental disease genes
Table 2 summarizes the potentially pathogenic variants in 73 known ASD or neurodevelopmental disease genes for each individual with ASD after variant prioritization. Out of these genes, 40 are reported in the Simons Foundation Autism Research Initiative (SFARI) Gene database21, and the rest are OMIM-annotated disease genes associated with relevant phenotypes, including neurodevelopmental disorder, intellectual disability, developmental delay, and epilepsy. These genes were significantly enriched in pathways involving nervous system development, neurogenesis, and neuronal differentiation (Supplementary Data 7). We identified 92 unique variants in 68 individuals with ASD (~1–3 per individual). Twenty-six individuals with ASD had coding variants in 19 syndromic ASD genes: CDKL5 (3 probands), DMD (3 probands), BCORL1 (2 probands), and SETD1B (2 probands). ARID1B, ATP1A3, CHAMP1, CNOT1, FRMPD4, HUWE1, KAT6A, KMT2C, MECP2, PACS2, PHF21A, SCN1A, SLC6A1, SMARCA2, TFE3, and ZMYM3 are other syndromic ASD genes harboring variants in single probands. Twenty-three individuals with ASD had coding variants in 21 nonsyndromic ASD genes having a SFARI Gene21 score of 1 or 2: NEXMIF (2 probands) and NLGN4X (2 probands). AR, ARHGEF10, ASTN2, AUTS2, BIRC6, CACNA1F, DLG4, DYNC1H1, IL1RAPL1, ITPR1, OPHN1, PCDHA5, SKI, SLC7A3, SYN1, TOP2B, WNK3, YEATS2, and ZC3H4 are other ASD genes harboring variants in single probands. Thirty-two probands had other coding variants in 33 neurodevelopmental disease genes, with 2 genes—ADGRV1 and ATP7A—having variants in 2 probands each. ACSL4, ARHGAP31, ARMC9, ATP2B3, ATP6AP2, BCAP31, CCDC22, CHD5, DBR1, DCTN1, DHX37, FGD1, HDAC6, IGBP1, KIF1C, MINPP1, MPDZ, NOTCH1, NRG1, OBSL1, PIGG, PLXNA1, SAMD9L, SCN3A, SLC13A3, SRPX2, TMEM151A, TNRC6A, TRIM71, TRNT1, and ZNF148 are other neurodevelopmental disease genes harboring variants in single probands. Three probands had coding variants in two neurodevelopmental genes each: MC-159-5 (ADGRV1 and KIF1C), MC-161-3 (MPDZ and NRG1), and MC-172-3 (OBSL1 and SAMD9L).
Variants in new candidate ASD genes
We identified 158 potentially pathogenic coding variants in 120 candidate ASD genes after variant prioritization (Table 3). Gene ontology analysis revealed that several of the candidate ASD genes are involved in signal transduction and synaptic activity such as DLG3, GABRQ, KALRN, KCTD16, P2RX4, PKP4, SLC8A3, and TENM2 (Supplementary Data 7). Multiple variants were observed in candidate genes: ATG4A, CNGA2, CROCC, FAM47C, FRMPD3, GABRQ, GPRASP1, MAGEC3, MXRA5, OR5H1, PWWP3B, SLITRK4, TRPC5, TSPYL2, and ZNF630. Since we observed more than one potentially pathogenic variant (in known and/or novel genes) in some probands, we also ranked them according to their likelihood of causing the disease in the proband (Supplementary Data 8). In proband MC-017-3, there were two variants found in SCN1A and RBMX2. The SCN1A variant was prioritized over the RBMX2 variant as SCN1A is a known ASD gene, according to the SFARI Gene database21. Similarly, in proband MC-174-3, a variant in HUWE1, a known neurodevelopmental disease gene22,23, was ranked above a variant in another known neurodevelopmental disease gene ATP6AP224,25 based on AlphaMissense scores, and above a variant in the novel gene MTM1.
Copy number variant analysis
Since CNVs are known to play an important role in ASD26, we analyzed CNVs in the ASD cohort. We called CNVs in individuals with ASD using individuals from the cohort who did not have ASD as controls, utilizing CNVkit27. In total, we identified 539 CNVs across all individuals with ASD, including 276 deletions and 263 duplications (Supplementary Data 9 and 10). The average size of a CNV was 243 kb, and there were 15 CNVs encompassing regions that did not include any genes. Out of the identified CNVs, 34 overlapped with known ASD CNVs as defined by the SFARI Gene database21, including the 3q29, 17p11.2, and 22q13.3 loci. Of the called CNVs, 23 also overlapped with syndromic CNVs from the DECIPHER database28. Some of these syndromes, such as Potocki-Lupski syndrome29 and Smith-Magenis syndrome30, are associated with neurodevelopmental phenotypes. Although our data demonstrate an overlap between CNVs and specific genomic regions, this does not imply that the CNVs are causal. Further investigation is needed to establish the pathogenicity of these variants.
Discussion
We performed WES in a modest familial cohort consisting of 754 individuals from 195 families, with at least one child in each family diagnosed with ASD by a neurologist, child psychiatrist, or psychologist. It is important to note that the source of patient ascertainment can introduce bias; for example, recruitment through clinical centers may be skewed towards cases with comorbid conditions31. Furthermore, the difficulty in diagnosing ASD, particularly in patients with severe intellectual disability32, makes it challenging to determine whether the identified variants are exclusively associated with ASD or if they also contribute to broader neurodevelopmental disorders. The families enrolled in the cohort represented diverse ancestral backgrounds, including African American, Asian, Hispanic, Middle Eastern, Native American, and European. Sequencing a diverse cohort offered a broader genetic landscape, reduced bias, captured population-specific alleles, and provided wider global relevance. While our sample size limited in-depth ancestry-specific analyses5,33, future studies with larger samples can expand on this groundwork.
In total we discovered 38,834 novel private variants in the cohort that have not been previously reported. The lack of large public datasets for most of the ancestries represented in our cohort can affect the incidence of observed variants and could contribute to the number of novel private variants detected. We employed a variant filtration and prioritization pipeline that implements established practices in the field and aligns with other large-scale studies6,7,34, including implementing filtering strategies for all inheritance modes, utilizing deleteriousness prediction algorithms, and incorporating gene constraint scores. Due to the modest size of our cohort, we were unable to leverage more sophisticated methods like the Bayesian analysis framework. Our analysis identified 92 potentially pathogenic coding variants in 73 known neurodevelopmental disease genes. The known genes included ASD genes BCORL1, CDKL5, MECP2, and SETD1B, among other neurodevelopmental disease genes (e.g., ADGRV1, ATP7A, CHD5, and SCN3A). In addition, we compared our findings to data from large-scale cohorts6. Out of the 73 genes, we identified overlap with 11 high-confidence ASD genes identified by Fu et al.6, including ARID1B, ATP1A3, AUTS2, DLG4, DYNC1H1, KMT2C, PLXNA1, SCN1A, SKI, SLC6A1, and SMARCA2, strengthening our results. We also identified 158 potentially pathogenic coding variants in 120 candidate ASD genes (e.g., DLG3, GABRQ, KALRN, and NCOR2). For each of our candidate genes, we analyzed published data from Zhou et al.7 to obtain P values and transmission disequilibrium test (TDT) statistic values representing the contribution of de novo and rare inherited loss-of-function variants to ASD risk, respectively. Although the candidate genes did not reach study-wide significance by de novo variant enrichment (requiring p < 0.001), 4 of them—ATF7IP, ATRNL1, HECTD1, and QSER1—passed the Zhou et al.7 TDT filtering step (TDT statistic ≥ 1, within top 20% LOEUF, and A-risk ≥ 0.4). This is unsurprising, given the familial nature of the cohort in this study and the much larger case-control cohort in Zhou et al.7. In addition, 3 of the identified candidate genes—CENPV, HECTD1, and MAP2—overlapped with high-confidence neurodevelopmental disease genes reported by Fu et al.6.
Tables 2 and 3 summarize the variants we identified in each individual with ASD, specifically in known ASD and neurodevelopmental disease genes, as well as in new candidate genes, respectively. Our analysis revealed distinct sets of genes that merit further investigation. Out of 222 individuals with ASD, we identified at least one potentially pathogenic variant in 112 individuals (~50%), out of which 68 individuals have at least one potentially pathogenic variant in a known neurodevelopmental disease gene (~30%). One of the aims of this study was to aid in identifying causative variants in the probands. The broad phenotypic assessment of the probands limited the granularity of our phenotype-genotype correlations. Furthermore, complete phenotype information was not available for all probands. Nevertheless, our findings are consistent with previous reports on the association between mutations in the identified genes and the observed phenotypes in probands, with commonality in language impairment and developmental delay across variants and probands. For example, proband MC-005-3 presented with ASD, seizures, and learning disabilities, in line with phenotypes of patients with pathogenic CDKL5 mutations35. SETD1B mutations have been associated with intellectual developmental disorder with seizures and language delay (MIM # 611055)36,37,38. Probands with variants in SETD1B presented with language impairment (MC-146-3, MC-166-3) and seizures (MC-146-3). For proband MC-124-6, our analysis identified a de novo stopgain mutation in CHAMP1. Mutations in this gene are associated with neurodevelopmental phenotypes, including impaired language and speech (MIM # 616327)39, all of which are present in the proband. MC-106-4 and MC-170-3 have variants in GABRQ, associated with essential tremor and ASD40,41. DLG3 mutations were identified in MC-001-3 and MC-001-4, and have been associated with X-linked intellectual disability42,43. Other interesting genes included HECTD1 (MC-045-3) and HECW1 (MC-161-3), which encode proteins predicted to enable ubiquitin ligase activity44. NCOR2 (with a variant daintified in JC-24-3) encodes a nuclear receptor co-repressor 2 that mediates transcriptional silencing of target genes by promoting chromatin condensation, thus preventing access to basal transcription machinery45,46,47. Sequencing studies in larger cohorts and additional experimental validation will be required to establish causality for the candidate genes that have not been previously linked to ASD.
In conclusion, by sequencing a diverse ASD cohort of individuals from over ten ancestries, this study breaks away from the limitations of single-population analyses and contributes to the ongoing effort of identifying causative genes and variants. While further functional validation is necessary to pinpoint causal variants in probands, these findings provide a valuable roadmap for more targeted future research, which will ultimately deepen our understanding of this spectrum of disorders.
Methods
Subjects and specimens
All human studies were reviewed and approved by the institutional review board (IRB) of the University of Texas Southwestern Medical Center (UTSW), the research committee at the University of Jordan School of Medicine, the ethics committee of the Jordan University Hospital, and the IRB of the Jordan University of Science and Technology. We have complied with all relevant ethical regulations, including the Declaration of Helsinki. Families were primarily recruited from the Dallas Fort Worth area, with some families recruited from Jordan, and written informed consent was obtained from all study participants. Inclusion criteria included a diagnosis of autism spectrum disorder (ASD) by a neurologist, child psychiatrist, or psychologist. Patients with genetically defined syndromes, specifically Fragile X syndrome, Angelman syndrome, Rett syndrome, or Tuberous sclerosis complex, were excluded from study participation. All patients enrolled in the study received a diagnosis of ASD from their referring clinicians, who performed physical and behavioral assessments and administered the following standard ASD diagnostic measures: (1) Autism Diagnostic Observation Schedule, Second Edition (ADOS-2)—a semi-structured, standardized assessment of communication, social interaction, play, and restricted and repetitive behaviors; (2) The Autism Diagnostic Interview-Revised (ADI-R)—this established assessment took ~1.5–3 h to administer, during which an experienced clinical interviewer interviewed a parent or caregiver familiar with the developmental history and current behavior of the individual being evaluated; (3) Diagnostic and Statistical Manual of Mental Disorders (DSM-V). Since the recruitment sources included multiple sites, there may be instances where not all three tests were performed. This, along with inter-site differences, may present potential sources of variance in our study. Blood samples were collected from all available family members by peripheral venipuncture and genomic DNA was isolated from circulating leukocytes using AutoPure (Qiagen, Hilden, Germany) according to the manufacturer’s instructions.
Sample preparation and sequencing
All samples were prepared for sequencing using a custom automated sample preparation workflow developed at the Regeneron Genetics Center (RGC). Genomic DNA libraries were created by enzymatically shearing DNA to a mean fragment size of 200 base pairs using reagents from New England Biolabs. A common Y-shaped adapter (IDT) was ligated to all DNA libraries. Unique, asymmetric 10 base pair barcodes were added to the DNA fragments during library amplification with Kapa HiFi to facilitate multiplexed exome capture and sequencing. Equal amounts of sample were pooled prior to overnight exome/genotype capture with the Twist Comprehensive Exome panel, RGC developed Twist Diversity SNP panel, and additional spike-ins to boost coverage at selected CHIP sites and to cover the mitochondrial genome; all samples were captured on the same lot of oligos. The captured DNA was PCR amplified and quantified by qPCR. The multiplexed samples were pooled and then sequenced using 75 base pair paired-end reads with two 10 base pair index reads on the Illumina NovaSeq 6000 platform on S4 flow cells.
Whole exome sequencing and data processing
Sequencing reads from both exome and genotyping assays in FASTQ format were generated from Illumina image data using bcl2fastq program (Illumina). Following the OQFE (original quality functional equivalent) protocol48, sequence reads were mapped to the human reference genome version GRCh38 using BWA MEM49 in an alt-aware manner, read duplicates were marked, and additional per-read tags were added. For exome data, single nucleotide variants (SNVs) and short insertions and deletions (indels) were identified using a Parabricks accelerated version of DeepVariant v0.10 with a custom WES model and reported in per-sample genome variant call format (gVCF) files. These exome gVCFs were aggregated with GLnexus v1.4.3 using the pre-configured DeepVariantWES setting50 into joint-genotyped multi-sample project-level VCF (pVCF), which was converted to bed/bim/fam format using PLINK 1.951. Depth was calculated using mosdepth52 and coverage was assessed using custom scripts. The percent coverage was calculated as the number of base pair positions sequenced to a given depth divided by the total number of bases sequenced.
VCF files for SNVs and indels were annotated with ANNOVAR53 using allele frequencies from the 1000 Genomes project (1000G)12, the Genome Aggregation Database (gnomAD)16,17, the Greater Middle East Variome project (GME)18, and the Exome Aggregation Consortium (ExAC)19. The variants were also annotated using the Single Nucleotide Polymorphism Database (dbSNP)54, the database of Human Non-synonymous SNVs and Their Functional Predictions and Annotations (dbNSFP)55, and ClinVar56. Annotated VCF files were uploaded into an SQL database for working storage and analysis. Exome data was stored, and analyses were performed on the Texas Advanced Computing Center (TACC) high-performance computing servers, a resource of the University of Texas (Austin, TX).
Variant filtration
Variants having a read depth of ≥ 10 and a genotype quality (GQ) score of ≥ 30 were retained as quality filtered. Rare variants were defined as those with minor allele frequencies (MAF) < 1% in 1000G12, gnomAD v2.116,17, GME18, and ExAC19. When filtering for rare variants, we used the overall population frequency data from the previously mentioned databases. We further refined the analysis by applying the same cutoffs to each sub-population within the dataset as well. Novel variants were defined as variants that are not found in the four aforementioned public datasets. Private variants were defined as novel variants that occurred only in a single individual in our cohort. De novo variants were defined as heterozygous private variants present in individuals with ASD (absent from the exome of the father, the mother, and the sibling(s) when available). To minimize potential false positive de novo calls, we applied additional filtering steps, requiring that de novo variants have the following criteria: (1) GQ ≥ 99, (2) alternate allele depth (AD-Alt) ≥ 10, (3) reference allele depth (AD-Ref) ≥ 10, (4) 0.3 ≤ AD-Alt/read depth (DP) ≤ 0.7, (5) Allele Quality score ≥ 999, (6) length(Alt) ≤ 50 and length(Ref) ≤ 50. Compound heterozygous variants in offspring were defined as inherited heterozygous variants that occurred within the same gene and that were present in heterozygous form in one parent but not the other. All compound heterozygous variants were filtered for AD-Alt ≥ 10, AD-Ref ≥ 10, and 0.3 ≤ AD-Alt/DP ≤ 0.7. Inherited homozygous variants were required to be present in heterozygous form in both the father and the mother, excluding variants that are homozygous in either one of the parents or siblings with no ASD when available, on the assumption of full penetrance. X-linked variants were X chromosome-specific and were required to be present in a male offspring and heterozygous in the mother.
Variant prioritization
Rare variants that are de novo, compound heterozygous, inherited homozygous, or X-linked, were considered to be possibly damaging if they met the following criteria: (1) splice site variants, (2) exonic variants with a predicted protein effect of frameshift indels, nonframeshift indels, stopgain, stoploss, or unknown effect, (3) exonic nonsynonymous SNVs that were predicted to be damaging by at least 1 of the 2 algorithms used: SIFT57,58 and PolyPhen-2 HumVar59. PolyPhen-2 HumVar was chosen over PolyPhen-2 HumDiv because the former is more appropriate for Mendelian variants with drastic effect as we expect for ASD, while the latter is appropriate for common variants of smaller effect size. Possibly damaging variants were compared to the list of genes implicated in ASD from the Simons Foundation Autism Research Initiative (SFARI) Gene 2018 database (using the 2023 Q2 release)21. Variants were also screened for any phenotypic association in the Online Mendelian Inheritance in Man (OMIM) database60. Gene constraint was assessed using pLI, LOEUF, and Z scores from gnomAD v2.116,17. To help assess a variant’s potential pathogenicity, the variants were also annotated with ClinVar data and the number of homozygous carriers in gnomAD v4.116,17. To prioritize candidate disease variants (potentially pathogenic variants), we performed the following steps: (1) If the exact same variant was present in more than one unrelated person, it was excluded; (2) Variants within genes that had a SFARI Gene21 score of 1, 2, or S, or were associated with a neurological phenotype as annotated by OMIM were considered as “known” and the rest were considered as “novel”; (3) Within the “known” and “novel” lists, genes having multiple different variants in different people were prioritized; (4) We prioritized loss-of-function (LoF) variants and nonsynonymous SNVs with high probability of deleteriousness based on scores from prediction tools, including SIFT, PolyPhen-2 HumVar, VEST61,62, CADD63, and phyloP64; (5) We prioritized variants within genes with higher pLI (> 0.5) and lower LOEUF (< 0.5) scores. Steps 3-5 were performed sequentially, therefore, a variant was not required to satisfy all subsequent steps if it passed the initial ones; (6) We filtered out variants with ClinVar significance value as benign or likely benign; (7) We filtered out variants having one or more homozygous carriers in gnomAD v4.116,17. The gene TTN is classified as an ASD gene in the SFARI Gene database21 with a score of 2. However, due to the large size of TTN (coding sequence of 108 kb), we calculated the missense mutation rate for TTN in each of the five probands with prioritized TTN variants (JC-21-3, JC-33-3, MC-014-3, MC-053-3, and MC-061-4) to account for its size. This rate was determined by dividing the total number of base pairs carrying missense mutations in TTN in each proband by the total length of the TTN coding region. Subsequently, we compared this ratio for each proband to the TTN missense mutation rate obtained from gnomAD v4.116,17 (1.23 × 10−5). We found that the TTN missense mutation rate in each of the 5 probands (1.57 × 10−4, 2.50 × 10−4, 2.78 × 10−4, 3.33 × 10−4, and 3.70 × 10−4, respectively) exceeded the gnomAD rate. Consequently, we filtered out the TTN variants from the list of prioritized variants in “known” genes, but they are retained in the list of potentially damaging coding variants (Supplementary Data 5).
Since we observed more than one potentially pathogenic variant (in known and/or novel genes) in some probands, we also ranked them according to their likelihood of causing the disease in the proband. We followed the guidelines issued by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology for the Interpretation of Sequence Variants65. We prioritized variants in known genes over novel genes. Stopgain/stoploss and frameshift variants were ranked over nonsynonymous SNVs, and de novo variants were ranked over other inherited variants. We also annotated the variants with AlphaMissense scores66 and prioritized those with higher scores.
Copy number variant (CNV) analysis
We used CNVkit27 to detect CNVs based on the read depth in ASD samples relative to the average read depth in non-ASD samples in the cohort, using default parameters. Sample MC-064-3 was deemed as an outlier and removed from further analysis for having an unusually high number of CNVs (174 CNVs). The CNV calls segmentation file was filtered to include variants with p < 0.05 and copy number = 0, 1, 3, or 4. Variants were considered deletions if their log2 read depth ratio between the sample and control was ≤ −0.5. Variants were considered duplications if their log2 read depth ratio was ≥ 0.5. If the exact same CNV existed in more than one unrelated proband, it was filtered out. The filtered variants were annotated with known SFARI Gene21 CNVs and DECIPHER28 CNVs. The gnomAD structural variants v4.116,17 frequencies were used to filter out common CNVs with a frequency >1% if the detected CNV completely overlapped with the gnomAD structural variant.
Burden analysis
Nondisrupting variants were defined as exonic synonymous SNVs or exonic nonframeshift indels. The burden of rare LoF and predicted damaging missense variants was analyzed by comparing categories of variants identified in ASD versus non-ASD samples. LoF variants were defined as variants that are exonic or splice site predicted to result in a frameshift indel, a stopgain or stoploss, or splicing error. Missense variants were defined as nonsynonymous exonic or splice site. Missense damaging variants were defined as nonsynonymous SNVs that were predicted to be damaging by at least 1 of the 2 algorithms used: SIFT and PolyPhen-2 HumVar. Comparisons were made between ASD and non-ASD exomes in the above categories for all rare variants.
Principal component analysis
Principal component analysis (PCA) was carried out in PLINK version 1.967 using Phase 3 1000G12 data (populations shown in Supplementary Data 11). PCA input files from our samples were pruned for variants in linkage disequilibrium (LD) with an r2 > 0.2 in a 50 kb window. The LD-pruned dataset was generated using plink –indep-pairwise flag to compute the LD variants. Variants with chromosome mismatches, position mismatches, possible allele flips, and allele mismatches were identified and filtered out. The set of variants that remained was extracted from the 1000 G12 dataset and these were merged with our cohort dataset. PCA was run in PLINK using the –pca flag and the first two principal components were plotted in R. Analysis was performed for the entire cohort, pedigree founders, and probands.
Specific expression analysis
We performed specific expression analysis (SEA) with human transcriptomics data from the BrainSpan collection20 to identify particular human brain regions and/or developmental windows potentially related to ASD pathophysiology along with candidate genes identified in individuals with ASD in this study. For each cell type or brain region, transcripts specifically expressed or enriched were identified at a specificity index (pSI) threshold of pSI < 0.0568. These analyses were performed using the Dougherty lab server (http://genetics.wustl.edu/jdlab/). Lists of candidate genes that overlapped with lists of transcripts enriched in a particular cell type or brain region were finalized using Fisher’s exact test with Benjamini–Hochberg correction. The significance level was set at Q-value < 0.05.
Responses