Multi-omics insights into the molecular signature and prognosis of hypopharyngeal squamous cell carcinoma
Introduction
Head and neck cancer is one of the most prevalent cancers worldwide, with around 870,000 newly diagnosed and 450,000 fatal cases every year1, where nearly 90% of cases are head and neck squamous cell carcinoma (HNSCC)2. HNSCC is heterogeneous in its tumour site, including the lip and oral cavity, larynx, nasopharynx, oropharynx and hypopharynx3.
Hypopharyngeal squamous cell carcinoma (HPSCC) is a subtype of HNSCC defined by its anatomical subsite4,5. Due to the hidden location of the tumour site and its asymptomatic at the earlier stage, two-thirds of HPSCC cases are diagnosed in the advanced stage5,6, leading to poorer overall survival (OS).
Although it is a sporadic cancer accounting only for 2–3% of all HNSCC subtypes7, HPSCC has the worst prognosis in HNSCCs with around 30–35% of the five-year OS rate8. Current therapy for HPSCC includes surgery, radiotherapy, chemotherapy, and biotherapy9. In a retrospective study of 47 advanced HPSCC patients, the response rate to the combined therapy was only 52%10. Nevertheless, to date, there are no effective biological markers to distinguish the high-risk patients requiring aggressive treatments apart from the surgical resection in HPSCC6,9.
The Cancer Genome Atlas (TCGA) and other projects identified four gene expression (GE) subtypes in HNSCC: basal (BA), mesenchymal (MS), atypical (AT), and classical (CL)11,12,13. BA shows high expression of cell cycle and DNA repair genes, MS of epithelial-to-mesenchymal transition and immune response genes, AT of oxidative stress response and metabolic processes genes, and CL of squamous differentiation and oxidative stress response genes11,13,14. These subtypes are linked to different risk factors and patient outcomes11,12,13. However, their existence in HPSCC patients, especially in the Chinese population, remains unreported.
The aetiology of HPSCC is multifactorial, involving tobacco, alcohol, and HPV infection13,15,16. While HPV is linked to HNSCC, especially oropharyngeal squamous cell carcinomas (OSCCs)17,18, its prognostic significance in HPSCC is debated. A study of 1,931 HPSCC patients in the USA identified HPV as a significant risk factor for overall survival (OS)19. However, other studies found no significant association between HPV infection and OS in HPSCC6,20,21,22,23,24. Multi-omics approaches are used to understand molecular alterations and predict immune therapy responses in HNSCC25 and other diseases26,27,28, but they have not been applied to HPSCC. This highlights the need for a comprehensive, multi-dimensional omics study of HPSCC to provide a full molecular landscape of the disease.
In this study, we profiled 103 Chinese HPSCC patients using multi-omics to identify molecular biomarkers and pathways as pathogenesis and prognostic markers. We presented somatic mutation and copy number variation (CNV) landscapes and identified four gene expression subtypes in HPSCC. We found roles for several dysregulated transcriptional and posttranslational modification pathways in tumorigenesis. Additionally, we developed a computational model to predict overall survival (OS) from molecular biomarkers and validated it across two hospitals. Our study offers a comprehensive view of molecular alterations in HPSCC at genomic, transcriptomic, and proteomic levels in non-Western populations, highlighting potential disease mechanisms and therapeutic targets.
Methods
Clinical cohort and assessment
This HPSCC cohort recruited patients who were diagnosed with early- and advanced-stage hypopharyngeal squamous cell carcinoma (HPSCC) for the first time at two hospitals in Beijing and Yunnan. 77 patients were performed tumour resection (74%), with an additional 27 patients receiving resection plus chemotherapy with TPF (Docetaxel, Carboplatin and Fluorouracil) (26%). In total, the HPSCC cohort consisted of 103 patients, ranging from 41 to 77 years in age. There were 3 females and 100 males in this cohort. The patients in Yunnan hospital were from the Yunnan province wide but the patients in Beijing hospital came from all over China. The patients recruited in these two hospitals are all Chinese. Both tumours and the matched normal specimens (from adjacent normal tissue (NAT) or blood) were obtained during the surgical resection and immediately snap frozen and stored. Whole-exome sequencing (WES), RNA sequencing (RNA-Seq), proteome and phosphoproteome sequencing were performed. Follow-up of patients started from the time of sample collection and continued to the most recent time. Detailed information about the patients’ characteristics29 is shown in Table 1.
HPV detection methods
The p16 immunohistochemistry assay was performed by P16/Ki-67 kit (Anbiping). Tumour tissues were embedded by paraffin, fixed by 10% neutral formalin (in phosphate buffer) and sliced to 3–5 μm followed by incubation at 60 °C for 120 min. Tumour tissue slides were then dewaxed and hydrated by dimethylbenzene for 10 min, twice and anhydrous ethanol for 5 min, twice, then by 95% ethanol for 5 min and purified water for 3 min. Tumour tissue sections were conducted antigen retrieval. Briefly, 1000 ml EDTA (pH 8.0) was taken in a pressure boiler, and tissue sections were placed on a heat-resistant plastic staining rack, and they were all put into EDTA. The boiler was heat with 1600 W/210 °C until it started to steam, and the boiler was kept 800 W/130 °C for 2 min. The glass slides were taken out after the boiler cooling down, and washed multiple times with tap water, then were soaked in purified water. The endogenous peroxidase was deactivated by 100 microliters per slide of peroxidase inhibitor solution in room temperature for 10 min and washed with followed by PBS soak.
Primary antibody incubation: PBS was wiped away from the tissue and surrounding area, the tissue was covered by 1 drop (50–100 microliters) of the P16 antibody working solution and incubated at room temperature for 30 min. The tissue sections were wiped off the liquid, and washed with PBS for 5 min, twice. Secondary antibody incubation: PBS was wiped away from the tissue and surrounding area, the tissue was covered by 100 microliters the immunostaining dye (Vbright I) and incubated at room temperature for 20 min. The liquid on tissue slides were wiped off and washed with PBS for 5 min, twice.
DAB staining: 100 microliters freshly prepared DAB staining solution was added onto the tissue and tissue sections were incubated at room temperature for 1–10 min (staining time determined by examining under the microscope). The slides were fully washed with purified water to flush out the staining. The staining was stopped and re-stained with haematoxylin and washed with PBS. The slides were dehydrated, hyalinized, sealed and examined under the microscope.
Criteria for positive staining: In the tissue section, the tumour cells were found to have brown/brownish-yellow colour in the nucleus/cytoplasm of the cell and background no colour staining. Criteria for negative staining: In the tissue section, the tumour cells were not found to have brown/brownish-yellow colour in the nucleus/cytoplasm of the tissue section.
WES
Degradation and contamination of extracted DNA were assessed by 1% agarose gels, and concentration was quantified by Qubit® DNA Assay Kit in Qubit® 3.0 Fluorometer (Invitrogen, USA). Genomics DNA was enriched with 0.4ug with an Agilent liquid capture system (Agilent SureSelect Human All Exon V6) based on the kit protocol. Briefly, random fragmented DNA (180–280 bp) were repaired by Covaris S220 sonicator, and the overhangs remained were converted into blunt ends by exonuclease polymerase. Fragmented DNAs were end paired, phosphorylated, and performed A-tailing and ligation with Illumina pair-end adaptors. Ligated DNAs were amplified in the PCR reaction.
The PCR products with a biotin-labelled probe were captured by magnetic beads and streptomycin to get exons. Captured exons were amplified by PCR reaction to ligate the index tags for sequencing. PCR products were purified by the AMPure XP system (Beckman Coulter, Beverly, USA). DNA concentration was quantified by Qubit®3.0 Fluorometer (Invitrogen, USA) and the size distribution was assessed by NGS3K/Caliper and real-time PCR. Indexed-coded samples were clustered on cBot Cluster Generation System by Illumina PE Cluster Kit (Illumina, USA). The clustered DNA library was sequenced on Illumina NovaSeq 6000 for paired end 150 bp reads using Illumina PE Cluster Kit (Illumina, USA) based on the manufacturer’s protocol. The fluorescent images obtained by Illumina were transformed into short reads by base calling and recorded in FASTQ format.
Somatic variants
After filtering out the low-quality reads, sequence alignment onto the GRCh38 was conducted using BurrowsWheeler Aligner (BWA) 0.7.1730. The Cancer Genome Atlas (TCGA) DNA somatic variant calling workflow was performed. Briefly, the aligned BAM files were sorted and merged by samtools 1.1431, and duplicated reads were marked by the MarkDuplicates tool of GATK 4.2.4.032. Base Quality Score Recalibration (BQSR) was performed to estimate the mismatch errors using GATK. Somatic short variants, including single nucleotide variants (SNVs) and indels, were called by GATK MuTect2. Filtering of somatic variants was conducted by GATK CNNScoreVariants and FIlterVariantTranches tools. Annotation of the variants was done by ANNOVAR33.
CNV analysis of WES
VarScan 2.4.434 was used to detect somatic copy number alteration from match tumour and NAT pairs. The raw output was smoothed and segmented by DNAcopy to perform a circular binary segmentation (CBS) analysis35. GISTIC2.036 was used to identify the copy number alterations. In GISTIC2, the amplitude threshold was used to determine which regions of the genome were considered significantly amplified or deleted. Specifically, copy number actual changes between −0.3 and 0.3 were identified as “0”, indicating that the copy number of the sample was not amplified or deleted in that peak region. The copy number actual changes between 0.3 and 0.9 were identified as “1”, indicating that the sample had low-level copy number gains. The copy number actual changes larger than 0.9 were identified as “2”, indicating amplification. The copy number actual changes between −1.3 and −0.3 were identified as “−1”, indicating shallow deletion. The copy number actual changes less than −1.3 were identified as “−2”, indicating deep deletion. In two-category CNV analysis, copy number gains and amplification were merged as amplification. Shallow deletion and deep deletion were merged as deletion.
Whole transcriptome sequencing
Degradation and contamination of extracted RNA were assessed by 1% agarose gels. Purity and integrity were assessed by NanoPhotometer® spectrophotometer (IMPLEN, CA, USA) and RNA Nano 6000 Assay Kit (Agilent Technologies, CA, USA). The sequencing library was prepared with NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, USA) based on the manufacturer’s protocol. Briefly, purified mRNA was fragmented in elevated temperature with NEBNext First Strand Synthesis Reaction Buffer(5X). First-strand cDNA was synthesized by random hexamer primer and RNase H. Second-strand cDNA was synthesized with DNA Polymerase I and RNase H. NEBNext Adaptors with hairpin loop structure were ligated to DNA fragments after adenylation. The library was purified to obtain 250–300 bp length fragments and amplified with Phusion High-Fidelity DNA polymerase. PCR products were purified with the AMPure XP system and quantified on the Agilent Bioanalyzer 2100 system. The library was sequenced on the Illumina NovaSeq 6000 after the index-coded sample clustering.
High-quality data was obtained from raw data after cutting adapters and filtering out low-quality reads. Reference genome GRCh38 was indexed by Hisat2 2.2.137 and reads were aligned onto the reference genome by Hisat2 2.2.1. The samtools 1.1431 was used to convert SAM files to BAM. The expression genes and transcripts were assembled, quantified, and merged by StringTie 2.1.738, and the output gene count tables were used as input of R packaged DESeq2 to obtain the differentially expressed genes.
Differentially expressed genes and enrichment of pathways
Differentially expressed genes between tumours and NATs were analyzed using DESeq239.
DESeq2 assumes that there are enough sample replicates to estimate the dispersion accurately, samples are independent of each other and the count data follow a negative binomial distribution, which is justified by the overdispersion observed in bulk RNA-Seq data. The raw expressed counts were normalized using a ‘size factor’ to adjust for differences in sequencing depth between samples and fitted a negative binomial generalized linear model to the count data and performed hypothesis testing to identify genes that were differentially expressed between tumour and normal tissues. The Wald tests were used to calculate p values and p values were adjusted with the Benjamin Hochberg method to correct the false positive rate. apeglm40 method was used to shrink the log fold changes, which helps in stabilizing the estimates. Choosing DESeq2 for differentially expressed gene analysis is justified for its robust ‘size factor’ normalization method and reliable estimation for the overdispersion observed in RNA-Seq.Genes with FDR (adjusted p-value) <0.05 and fold change >2 or <0.5 were identified as DEGs. Pathway enrichment analysis for the DEGs was performed by the R package fgsea41 with the Reactome pathway database and the Hallmark Gene Sets of Human Molecular Signatures Database (MSigDB).
Gene expression subtype
The gene expression matrix was centred on gene median and the variability of each gene was computed by median absolution deviation (MAD). The top 2500 most variable genes were selected to perform the unsupervised clustering by the R package ConsensusClusterPlus42. The clustering procedures were performed on 100 randomly selected subsets of the expression matrix by a sampling proportion of 80% and one minus Pearson correlation as the distance metric. The number of subtype (cluster) was determined based-on the Delta Area Plot and Tracking Plot (Fig. S6), which were used to observe the increase in consensus and determine k (number of subtype/cluster) at which there is no appreciable increase. SigClust43 was used to assess the statistical significance of expression subtypes using the 2500 most variable genes. All pairwise comparison was conducted using 1000 simulated samples and the default covariance estimation.
The annotation of each gene expression subtype was based on the feature genes validated in previous studies11, namely the RPA2, E2F2, FGFR3 highly expressed cluster was atypical subtype, TP63, TGFA, EGFR highly expressed cluster was basal subtype, SOX2, NFE2L2, KEAP1, PIK3CA, AKR1C1 highly expressed cluster was classical subtype and PDGFRA, PDGFRB, TWIST1 highly expressed cluster is mesenchymal subtype.
Validation of the expression subtype was performed as the previously described method12. The expression values of 2500 genes with the highest MAD were extracted in both the HPSCC dataset and the HNSCC dataset. Each expression matrix of the 2500 genes was gene median-centered and centroids of each subtype were computed by the median of the samples within a subtype for each gene. Pearson correlation between the two expression matrix centroids was computed to show the similarity between the corresponding tumour subtypes.
Proteome
Total protein was extracted and dissolved in a dissolution buffer and quantified by the Bradford assay. Trypsin and 100 mM TEAB buffer were added and mixed with protein samples, followed by digestion at 37 °C for 4 h. Trypsin and CaCl2 were mixed with digested samples overnight and formic acid was added to adjust the PH under 3. Protein samples were centrifuged, and the supernatant was collected and loaded into the C18 desalting column. The supernatant was washed, eluted, and collected for LC-MS analysis. 0.1% FA in H2O and 0.1% FA in 80% ACN were used to develop the gradient elution. Samples containing supernatant were loaded into the EASY-nLCTM 1200 UHPLC system (Thermo Fisher) with Orbitrap Q ExactiveTM HF-X mass spectrometer (Thermo Fisher) to perform in the data-independent acquisition (DIA) mode with a spray voltage of 2.1 kV, Nanospray FlexTM (ESI) and capillary temperature of 320 °C. The m/z range was ranging from 350 to 1500 during the DIA acquisition. MS1 resolution was 6000 at m/z 200 with full scan AGC target value 5 × 105 and maximum ion injection time 20 ms. Peptides were fragmented by HCD in MS2 with resolution 30,000 at 200 m/z, AGC target value 1 × 106 and 27% normalized collision energy.
The protein spectra of each fraction were searched from the protein database by Proteome Discoverer 2.2 (PD 2.2, Thermo). Peptide Spectrum Matches (PSMs) were defined as PSMs with credibility above 99%, containing at least 1 unique peptide and FDR <1%. The search results of PD2.2 were imported into Spectronaut (version 14.0, Biognosys) to build a library. The DIA results were imported, and ion-pair chromatographic peaks were extracted. The protein quantification was analyzed by t-test and significantly different proteins between tumour and NAT samples were identified with p-value < 0.05 and fold change >2 or <0.5. The pathway enrichment analysis of the differentially expressed proteins (DEP) was performed by R package fgsea [11] with the Reactome pathway database and the Hallmark Gene Sets of Human Molecular Signatures Database (MSigDB).
Survival analysis
The R package survival was used for analysing the overall survival times, generating Kaplan–Meier plots, and computing log-rank p-values for stratified groups. The survfit() function fitted survival curves using the Kaplan-Meier estimator, the non-parametric statistic, to visualize the survival probabilities over time. Choosing Kaplan–Meier estimator was justified by its simplicity and ability to handle censored data effectively, and this method assumed that censoring is unrelated to the likelihood of the event occurring.
The survival distributions of two or more groups on the Kaplan–Meier plot were compared by log-rank test using survdiff() function and the significances were showed as p values. This method was used as it was widely used for hypothesis testing in survival analysis and it assumed that the survival curves were proportional over time. Univariant and multivariant Cox regression analyses were performed by the Coxph() function, which involved the Cox Proportional Hazards Model, a semi-parametric model, estimated the hazard ratio for different covariates without assuming a specific baseline hazard function. Choosing this method was because it allows including different covariates during analysis. It assumed the ratio of the hazard functions is constant over time. The forest plot to show the hazard ratio was produced by the ggforest function of the survminer R package.
Statistics and reproducibility
Software and statistical methods involved in the data analysis were described in detail in Methods. We recruited 103 HPSCC patients, and this represents the maximum number of patients whose demographic, follow-up data and specimens are available in the two hospitals in this study. The bioinformatics analysis was processed at least twice, and results were compared to ensure the reproducibility of our findings. The analysis results were validated by comparing findings of external cohorts.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Results
Patient description
We analyzed clinical and sequencing data of 103 patients who were diagnosed with HPSCC at Beijing Tongren Hospital, Capital Medical University, Beijing, China and The Third Affiliated Hospital of Kunming Medical University (Yunnan Cancer Hospital), Kunming, Yunnan, China. All the patients were treatment-naïve, and the samples were collected at the time of surgical resection, placed in a tube to flash freeze, and stored in liquid nitrogen. Both tumours and the matched normal specimens (from normal tissues adjacent to the tumour (NAT) for transcriptome and proteome, or blood for DNA) were shipped to Novogene Co., Ltd. for nucleic acid and protein extraction.
The sequencing workflow is shown in Fig. 1a. Tumours and the matched normal specimens of 38 patients in the Beijing site and 26 patients in the Yunnan site were subject to paired-end 150 bp whole-exome sequencing (WES) on Illumina NovaSeq 6000 to obtain an average of 100-fold sequencing coverage. Specimens including 61 tumours from the Beijing site, 28 tumours and 26 matched NATs from the Yunnan site were subjected to paired-end 150 bp RNA sequencing (RNA-Seq) on Illumina NovaSeq 6000, fifty-eight tumour specimens and 19 matched NATs from Beijing site were subjected to Data-independent acquisition (DIA) proteome and phosphoproteome profiling.

a Overview of multi-omics data collected of the hypopharyngeal squamous cell carcinoma (HPSCC) cohort (n = 103 patients). This figure was created using icons81,82,83,84,85,86 from BioRender. b A Venn Plot to show patient number for whole exome sequencing (WES), transcriptome, proteome and phosphoproteome sequencing. There are 58 patients with four-type sequencing data. c CoMut plot of HPSCC cohort organized by human papillomavirus (HPV) status (n = 64 patients). d Forest plot of multivariate Cox regression results of 15 somatic short variants most associated with patient OS. MUC4, EXD3, COL6A3, SYNE1, KALRN, ZNF415, SLC22A31, CEP170B, RECK, ADGRV1, ZNF469, FCGBP, DNAH8 and MACF1 mutations reached significance (p < 0.05), while OTOG mutation was near significance (p < 0.1). e 12 most significant copy number variations (CNVs) in HPSCC patients. The CNVs in red were near-significantly associated with patient overall survival (OS) in multivariate Cox analysis (p < 0.1). f Forest plot of multivariate Cox regression results of 15 CNVs most associated with patient OS. CNV on 3q26.33 was significantly associated with patient OS (p < 0.05), while CNVs on 12q12, 21p11.2, 9q21.11 and 5p13.2 were near significance (p < 0.1).
The characteristics of the patients are shown in Table 1 (Supplementary Data 1). The majority of 103 patients underwent tumour resection only (74%), with the rest receiving resection plus chemotherapy with TPF (Docetaxel, Carboplatin and Fluorouracil) (26%). Complete clinical data included patient ID, age, gender, Body Mass Index (BMI), smoking and alcohol history, HPV status, tumour stage, lymph node spread, the latest survival status and date of last follow-up. The median age of all patients was 59 years old and 96% of them were males. The patient BMI median value was 22, which is the normal range recommended by the World Health Organization (WHO)44. Most of the patients were current smokers (82%) and drinkers (78%). 25% of patients were HPV-positive. 81% of patients were in the advanced stage, 74% with lymph node spread and 99% without metastasis during diagnosis. There were 58 patients with multi-omics data as shown in Fig. 1b.
Somatic alterations
WES identified somatically mutated genes in 64 tumour tissues compared to matched normal (Fig. 1c), including 15,599 non-synonymous somatic variants and 72,525 synonymous somatic variants (Supplementary Data 2). In comparison to the previous HNSCC study13, mutations on 2531 genes were not previously reported (Supplementary Data 3). Among non-synonymous variants, 14,236 missense mutations occurred in 8093 genes accounting for most of the variants (Fig S1). The top ten most frequently mutation genes (Fig. 1c) are TP53 (66%), TNN (52%), MUC16 (26%), SYNE1 (25%), AHNAK2 (22%), FAT1 (22%), MUC4 (20%), DNAH11 (18%), KMT2D (18%), MUC5B (18%).
We found that HPSCC has high-level CNVs, with a mean of 1020 CNVs per tumour sample (Supplementary Data 4). Forty-eight regions of copy number amplification (Fig.S3) and 38 regions of copy number deletion (Fig.S4) were identified in all tumour samples with a False Discovery Rate cutoff of 0.25 (FDR = 0.25). The most significant CNVs were defined by the highest G-score36 combined with the lowest FDR (Fig. 1e). Notable genes located in the most frequently amplified CNV regions are NFE2L2 (2q37.3), SOX2 (3q26.33), CUX1 (7q22.1), EGFR (7p11.2), FGFR1, WHSC1L1 (8p11.23), FADD/CCND1 (11q13.3), and MYC (8q24.21). Notable genes in the top deleted CNV regions are FGF9 (13q11), APO (11q23.3), and CDKN2A/2B (9p21.3).
In order to discriminate microdeletions and micro amplifications, CNVs were divided into four categories based on the discrete value between −2 and 2 assigned by GISTIC2.0 algorithm36, where −2 is a deep deletion indicating a homozygous deletion, −1 is a shallow deletion indicating a heterozygous deletion, 0 is diploid normal copy, 1 is copy gains indicating a low-level gain and 2 is amplification indicating a high-level amplification45. The CNVs frequencies in TTN and MUC12 were the highest (82% and 77% respectively) in the four-categories CNV analysis (Table S5, S6, Supplementary Data 5), and the somatic mutation frequencies of these two genes were also among the top (52% and 17% respectively).
Multivariate Cox regressions were performed to associate genomic alterations (gene mutation and CNV) with patient OS, together with other significant risk factors including smoking, HPV and lymph node spread (Fig. 1d, f). MUC4 is the most associated gene with patient prognosis (p = 0.008, HR = 3.43, CI95% (1.48, 9.80), Fig. 1d), and it is also a top mutated gene (Fig. S2) shared between this HPSCC cohort and TCGA-HNSCC cohort (Table S1)13, with substantial mutational frequency differences (Fig. S2). CNV of 3q26.33 (harbouring SOX2) and 12q12 (harbouring MUC19) are significantly associated with patient OS (p = 0.01 and p = 0.011 respectively, Fig. 1f), with 3q26.33 gain protective and 12q12 gain risky factors (Fig. 1f).
Gene expression subtypes in HPSCC
Previous HNSCC studies11,14 identified four subtypes named classical (CL), mesenchymal (MS), basal (BA) and atypical (AT) based on gene expression signatures of HNSCC (Fig. 2b) and the subtypes were validated by other independent cohorts12,13. To characterize the expression subtypes of HPSCC, a consensus clustering was performed on the 2500 most variable genes in transcriptome data of 89 tumour samples. Four gene expression subtypes were identified by ConsensusClusterPlus42 (Fig. 2a, Fig. S6, Supplementary Data 6).

a The RNA expression of 14 feature genes in four subtypes of the current hypopharyngeal squamous cell carcinoma (HPSCC) dataset (n = 89 patients). Gene expressions used to annotate each subtype were shown in four black frames. The four expression subtypes were grouped as color bar shown. b The RNA expression of 14 feature genes in four subtypes of an external head and neck squamous cell carcinoma (HNSCC) dataset (n = 149 patients). Gene expressions used to annotate each subtype were shown in four black frames. The four expression subtypes were grouped as color bar shown. c Validation heatmap of Pearson correlation between centroids of the expression subtypes in HPSCC (rows) and HNSCC (columns) datasets. d Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in HPV (+) and HPV (−) patients. e Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in four expression subtypes in HPV (+) patients. f Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in four expression subtypes in HPV (−) patients.
The annotation of the current four HPSCC subtypes was based on the expression of fourteen feature genes (RPA2, E2F2, FGFR3, SOX2, NFE2L2, KEAP1, PIK3CA, AKR1C1, PDGFRA, PDGFRB, TWIST1, TP63, TGFA, EGFR) previously identified in HNSCC11 (Fig. 2a, b). To explore the similarity between the HPSCC and HNSCC subtypes11, subtype centroids of each cohort were computed12, and the Pearson correlation of the two matrices’ centroids was used to measure the distance between HPSCC subtypes and HNSCC subtypes (Fig. 2c). In general, four consensus subtypes in HPSCC correlated closely to previous reported HNSCC subtypes (Fig. 2c, Supplementary Data 6), indicating the previous four subtypes in HNSCC derived from Western patients also present in Chinese HPSCC patients.
HPV infection is one of the most significant clinical risk factors for HPSCC. To investigate whether HPV infection (Figs. 2d, 4a) may affect the function of molecular subtypes, the association between subtypes and OS were re-evaluated after HPV stratification (Fig. 2e, f). Among HPV (−) patients, there was no obvious association observed between OS and GE subtypes (Fig. 2f). Whereas in HPV (+) patients, the combination of CL and AT subtypes tended to have an unfavourable patient OS (Fig. 2g, p = 0.33). In contrast, previous studies showed that the HPV(−) AT subtype is significantly associated with worse prognosis in patients with HNSCC11.
Gene and protein pathways associated with tumourigenesis
We compared the tumours from 26 patients with paired NATs to identify differentially expressed genes (DEGs) (Fig. S7, Supplementary Data 7–8). We obtained 544 DEGs (FDR <0.05, Fold Change >2 or <0.5, Fig. S7), including 393 upregulated and 151 downregulated genes. Also, 58 tumours and 19 NAT samples were compared to identify differentially expressed proteins (DEPs) associated with tumourigenesis (Supplementary Data 9). A total of 1513 DEPs (FDR <0.05, Fold Change >2 or <0.5, Fig. S8) were detected, including 545 upregulated and 968 down-regulated proteins (Fig. S8, Supplementary Data 9).
We found 27 DEGs that were also supported by DEPs (Supplementary Data 10), of 9 genes (FUT3, SIPA1L3, CGN, SCEL, SULT2B1, CSTA, CSTB, KRT78, CNFN) were observed to elevated in RNA but decreased in the proteome. The 27 overlapped DEGs were found enriched in 14 hallmark pathways46, and of 2 pathways were down-regulated at the transcriptomic level and up-regulated at the proteomic level, such as peroxisome and bile acid metabolism pathways (Fig. 3a, Figs. S9, S10). The common up- and down-regulated pathways were shown in Fig. 3a. Most of the genes are enriched with estrogen response late, estrogen response early, interferons (IFN)-gamma response, and hypoxia pathways. Of 14 in the 27 common genes/proteins were found to be somatically mutated (Fig. 3b), and most of them are also associated with estrogen response late and estrogen response early pathways (Fig. S11).

a Hallmark gene set enrichment analysis (GSEA) of tumourigenesis-associated pathways from 27 differentially expressed proteins overlapped with differentially expressed genes. b Venn Plot of common proteins/genes in somatic mutations, differentially expressed genes and proteins. c Sequence logo of the most frequently phosphorylated motif pattern in tumours.
A total of 2707 phosphosites corresponding to 1347 differentially expressed phosphoproteins were identified (Supplementary Data 11) in tumours as compared to the paired NATs. Most phospho-modifications in tumours occurred in Serine (87.6%) and Threonine (11.3%) (Fig. S12). There are 81 phosphorylated Serine motifs, and 11 phosphorylated Threonine motifs found in tumours, and TPPP (Fig. 3c) was the most frequent motif sequence. The NetworKIN algorithm of KinomeXplorer47 was used to infer the kinase activities in tumours based on phosphorylated motifs, the Protein Kinase C (PKC) family and Cyclin-dependent kinases (CDKs) obtained the highest NetworKIN score in tumour samples when compared to the normal (Fig. S13, Supplementary Data 12).
Risk factors for prognosis of HPSCC
The clinical characteristics of 103 patients were analyzed by univariate Cox regression and significant clinical risk factors were identified, including HPV (p = 0.043, HR = −0.84, CI95%= (0.19, 0.98)), smoking status (p = 0.023, HR = 1.1, CI95%= (1.2, 7.7)), and lymph node spread (p = 0.031, HR = 0.83, CI95%= (1.1, 4.8)).
A previous study13 reported that GE subtype was significantly associated with OS in HNSCC. This factor was then included in the following multivariate analysis, along with the risk factor identified from univariate analysis. Multivariate Cox regression analysis shows that HPV (p = 0.009, HR = 0.22, CI95%= (0.071, 0.68)) and lymph node spread (p = 0.047, HR = 3.60, CI95%= (1.017, 12.76)) remained as significant factors to patient OS (Fig. 4a), with HPV infection as a favourable and lymph node spread as an unfavourable factor. Patient gender did not reach significance in the hazard ratio due to the small sample size of females (N = 3). However, the female patients’ outcome was better than the males, where all female patients survived for five years (Fig. 4b). The gender disparity in prevalence and prognosis of HPSCC may be attributed to dysregulated estrogen pathways observed in the omics analysis (Fig. S11).

a Hazard Ratio of clinical risk factors in hypopharyngeal squamous cell carcinoma (HPSCC) patients evaluated by Cox Proportional-Hazards Model. b Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in different gender. c Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in MUC4 mutation and wildtype groups in all patients. d Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in MUC4 mutation and wildtype groups in HPV (−) patients. e Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in MUC4 mutation and wildtype groups in HPV (+) patients. f Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in SYNE1 mutation and wildtype groups in all patients. g Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in SYNE1 mutation and wildtype groups in HPV (−) patients. h Kaplan–Meier plot and Log-Rank Test p-value comparing the overall survival in SYNE1 mutation and wildtype groups in HPV (+) patients.
The patient characteristics stratified by significant risk factors (HPV, smoking and lymph node spread) were analyzed (Tables S10–12). Smoking was significantly associated with HPSCC tumour stage (p = 0.002, Table S11). Overall survivals of patients stratified by the above significant risk factors were shown in KM plots (Figs. 2d, S14), and favourable prognoses were observed in non-smoker and no-lymph node spread groups (p = 0.018 and 0.027 respectively). High-frequency mutated genes shared by HPSCC tumours and HNSCC (Fig. S2) were grouped by these significant risk factors and compared (Figs. S15–17). In general, the HPV-negative, non-smoker and without lymph node spread groups display lower mutation frequency in these top mutated genes, whereas the HPV-positive (Fig. 2d), non-smoker and without lymph node spread groups show favourable prognoses (Fig. S14).
Multivariate Cox regressions were performed for the highly mutated genes shared by current HPSCC and TCGA-HNSCC cohorts (Fig. S2). We only found two genes (MUC4, p = 0.008 and SYNE1, p = 0.011) showing significant associations with OS in our HPSCC cohort (Supplementary Data 13, Fig. 1d, Fig. 4c, f). The K–M plot of patients stratified by MUC4 mutation exhibits a significant difference in correlation with patient OS (Fig. 4c, p = 0.0057). This difference was also observed in HPV (+) patients (Fig. 4e, p = 0.0053). SYNE1 mutation is also significantly associated with patient OS in all (Fig. 4f, p = 0.027) and HPV (−) patients (Fig. 4g, p = 0.03).
A six-gene signature-based model to predict patient survival
The expression value of each DEG was analyzed by multivariate Cox regression for their correlation with patient OS, together with co-factors including expression subtype (BA or non-BA), HPV infection status, smoking status, lymph node spread and gender. There were 21 near-significant (p < 0.2, Supplementary Data 14) genes selected for training of the Elastic regularized Cox regression model with 61 patients from the Beijing site. Six DEGs CALB1, OAS3, SOX21, ANKUB1, SENP1, and HUNK were treated as stable gene features to establish a prognostic model for patient OS (Fig. 5a).

a Nomogram of expression marker genes and clinical risk factors for patient risk of death. To read the nomograms, draw a vertical line between the feature (genes or clinical factors) and the points scale at the top of the nomogram to determine its contribution in points to the total score for each feature. Add up the points from each feature and then draw a vertical line from the total points scale at the bottom of the nomogram to the Risk of Death scale to determine the probability of death. b Kaplan–Meier plot and Log-Rank Test p-value comparing the high- and low-risk groups dichotomized by the median of prognostic score in the validation dataset (HPSCC patients in Yunnan site, N = 27) c Receiver operating characteristic (ROC) curve of five-year survival for the patients from the Yunnan site.
We further validated the model with 27 patients from the Yunnan site. The prognostic power of this survival model was evaluated by dichotomizing the prognostic scores in the median value (Fig. 5b), and the log-rank test was used to compare the survival curves of different risk groups. The OS of high- and low-risk groups reached a significant difference (p = 0.028, Fig. 5b). The ROC curve draws five-year survival for the patients from the Yunnan site (Fig. 5c, AUC = 0.657). The nomogram shows the risk of death predicted by the normalized gene expression value and clinical factors (Fig. 5a).
Discussion
Comprehensive identification of somatic mutations, dysregulated genes and proteins of HPSCC has been limited by the availability of large, well-annotated HPSCC patient cohorts with matched genomic sequencing data, particularly within the HPV (+) subgroup. Here we present a joint analysis with a relatively large HPSCC cohort, a collection of 103 Chinese patients who were diagnosed with early and advanced stage HPSCC for the first time by two hospitals in Beijing and Yunnan province, China. We comprehensively profiled WES, transcriptome and proteome on the tumour tissues and paired normal and identified key genes/proteins and pathways that are highly related to the aetiology and prognosis of HPSCC. This data provides valuable bases for identifying prognostic biomarkers with better clinical management, and for developing novel drugs that would help for patient treatment, particularly in non-Western populations.
We found the 20 most frequently mutated genes in the TCGA-HNSCC cohort are also highly mutated (mutation frequency above 5%) in our HPSCC cohort13. For example, TP53 and TTN are the top two highly mutated genes in both HPSCC and HNSCC. However, we also identified other novel non-synonymous variants in 2531 genes. Furthermore, we establish the mutation frequency of some highly mutated genes which are different in HPSCC as compared to HNSCC. Several previous studies have noted a reduced prevalence of TP53 mutation in HPV ( + ) HNSCC13,48,49,50,51, whereas in our HPSCC cohort, the TP53 mutation rate is higher at 84.62% (11/13) in the HPV (+) cases than 62.75% (32/51) in the HPV (−) cases (Table S13). The inconsistency of mutation frequency was also detected in MUC4 which regulates cellular senescence in HNSCC through the p16 pathway52 and in OBSCN, encoding obscurin protein53, a tumour suppressor gene for breast cancer54. MUC4 mutation is also significantly associated with patient OS (Fig. 4c–e), indicating its potential as a prognostic biomarker to identify patients with elevated risk of metastasis or relapse.
HRAS mutation was not detected in our study, which has been a strongly confirmed mutation in HNSCC55,56. The previous finding regarding the HRAS mutations is almost absent in HPSCC compared with other head and neck cancer subtypes in the European population56. Another study also reported that laryngeal and pharyngeal cancers rarely harbour HRAS mutations similarly compared to oral cancers in the European population55. These results are consistent with our findings that HRAS mutation is not observed in the Chinese HPSCC population.
Based on the two-category CNV analysis, namely amplification and deletion, all CNVs detected in our patients were reported in TCGA-HNSCC cohort13. All reported CNVs in an HPSCC cohort have also been detected in our data57. The reported gains of copy number in HNSCC of 3q, 7p, and 11q11 were confirmed in this cohort (Fig. 1e and Fig. S5). By contrast, the reported losses of copy number in 3p, 9p and 14q regions11 were observed except for loss in 3p (Fig. S5, Supplementary Data 14). One of the quintessential copy number alterations in squamous cell carcinomas (SCC) is the amplification of 3q11,58,59, and both HPV (+) and (−) tumour samples in this study were observed recurrent focal amplification of 3q26.33 (Fig. S5, Table S2). While 3q26.32 (harbouring oncogene PIK3CA) and 3q28 (harbouring transcription factor TP63) were not detected in this study, which is inconsistent with previous HNSCC studies11,13.
The genomic region 3q26.33 is the most frequently occurred CNV in this HPSCC cohort, and it is significantly associated with patients’ OS (Fig. 1e, f, Table S2). One of the notable genes located in this region is a transcription factor, SOX2, which plays a crucial role in the regulation of embryonic development and in the determination of cell fate. As one of the four Yamanaka transcription factors, SOX2 can reprogram a somatic cell into a pluripotent embryonic-like cell. Previous studies have shown that SOX2 was frequently overexpressed in several cancers, contributing to tumour progression, which makes it a potential target for therapeutics60. In lung SCC, SOX2 is frequently amplified and acts as an oncogene, collaborating with the transcription factor p63 to regulate genes involved in cell proliferation and survival61. SOX2 is also amplified in oesophageal SCC, playing a role in maintaining the stemness and proliferative capacity of cancer cells, contributing to tumour aggressiveness61. Several approaches target SOX2 in cancer therapy including small molecule inhibiting SOX2 activity, anti-SOX2 antibodies, RNAi technology to silence SOX2 expression and CRISPR/Cas9 gene-editing system to knock out the SOX2 in tumour cells62. Thus, our findings shed light to HPSCC therapeutic interventions.
Several data validated a GE subtype in HNSCC cohorts. However, the subtypes in HPSCC have not been established due to the small sample size used11,13. Here, we found that four consensus GE subtypes previously reported in the pan-cancer results of HNSCC also exist in our HPSCC samples. Our GE subtype proportions partially correlate with that in an HNSCC cohort12, where the MS accounts for the majority and BA for the minority of all GE subtypes (Table S7). There were both concordant and discordant patient characteristics for each subtype. The AT subtype was characterized by the enrichment of HPV (+) tumours in both HNSCC13 and HPSCC. The CL subtype was reported as a smoking history-related subtype in HNSCC13 but in the HPSCC cohort, the BA subtype was found to have 100% current smokers in HPSCC (Table S7).
The genomic alterations for the GE subtype in HPSCC differed from that in HNSCC13, particularly the high prevalence of TP53 mutation across all four subtypes (Table S8). Genes related to oxidative stress were not abundant in the CL subtype of HPSCC compared to that in the CL subtype of HNSCC13. PIK3CA and HRAS mutations were noted to be the features of AT and BA subtypes in HNSCC13, whereas in HPSCC, PIK3CA mutation was presented in each subtype and HRAS mutation was not detected in all subtypes (Table S8), and the absence of HRAS mutation in HPSCC has been explained above. The CNV of the GE subtype in HPSCC correlated not very closely to that in HNSCC11,13. Briefly, CNV in 11q, 7p, and 14q regions of each subtype was consistent with that in the HNSCC subtypes, and CNV in 3q regions was inconsistent with subtypes in HNSCC11,13 (Table S9). The different tumour sites and ethnicities in the two cohorts may contribute to this inconsistency. Moreover, this reveals that the genomic alterations pattern of HPSCC subtype is not entirely consistent with that of HNSCC and the targeted therapeutics for HNSCC may not be specific for HPSCC. Most clinical trials for targeted therapies often include a broad range of tumour sites of HNSCC, leading to a lack of robust data on the efficacy of these treatments in HPSCC63.
The observed differences (mutations, CNVs, and GE subtypes) may be because HPSCC is a subset of HNSCC, implying tumour heterogeneity among tumour sites. Indeed, different anatomical sites expose/respond differently to risk factors such as tobacco, alcohol and HPV infection4 and these different factors lead to various mutation patterns64. Such differences were also observed in other heterogenous tumour types such as cholangiocarcinoma65. In addition, these differences may also be due to population diversity, because most current HNSCC studies were done in Western populations. Our finding from the Chinese population highlights the importance of conducting more omics research in different ethnic groups and anatomic sites.
Fourteen tumourigenesis-associated genes were observed in all three datasets: mutation, transcriptome and proteome. They are enriched in three up-regulated hallmark pathways including estrogen response late, estrogen response early, and interferons (IFN)- γ response. The role of IFNγ in regulating tumour immunity has been widely documented including in HNSCC66. The up-regulated estrogen response-related pathways in tumourigenesis may explain the gender disparity in patient OS of this HPSCC cohort. Natale et al.67 provided the importance of the estrogen signalling pathway as a tumour suppressor in pancreatic cancer and the therapeutic potential of this observation remains to be studied. To our knowledge, no published studies linked estrogen signalling pathways to the HPSCC tumours. Identification of tumourigenesis related genes in estrogen signalling pathway has potential translational applications, including targeted therapy or hormonal therapy, prognostic biomarker based-on genes in this pathway and understanding the resistance mechanisms68.
The proline-rich region, TPPP, was one of the most frequent motif patterns in phosphorylation in tumours. Tandem repeats are prone to mutations, leading to genomic instability, which is hallmark of many cancers69,70. Phosphorylation of the proline-rich domain of WAVE3 was reported to be an oncogenic driver in breast cancer71. Inferred kinases from the phosphorylated modification motifs included the PKC family and CDKs. They are drug targets for cancer therapy. PKC enzymes participated in several cellular processes including cell growth, differentiation and apoptosis72. Targeting PKC isoforms and CDKs contribute to more effective cancer therapies73. Enzastaurin was developed to inhibit PKC but showed limited efficacy and severe side effects in clinical trials73. Bryostatin-1 modulates PKC activity and showed potential in lymphoma combined therapy73. CDKs are crucial regulators of cell cycle and transcription. The dysregulation induced uncontrolled cell proliferation. Palbociclib, Ribociclib, and Abemaciclib, as CDK4/6 inhibitors, have been approved for treating hormone receptor-positive breast cancer72. Both PKC and CDKs inhibitors have the potential to be used in combination therapy to overcome treatment resistance mechanisms72.
The predictive model based on the expression of 6 genes was established and validated to predict patient OS in this study, with important clinical co-factors including smoking, HPV status, lymph node spread, and gender. Among these genes, SOX family genes, including SOX21 in this model, are related to cancer development such as epithelial-mesenchymal transition (EMT), the maintenance of cancer stem cells (CSCs) and the regulation of drug resistance74. The 3q26.33 region, where SOX2 located, is significantly amplified (Fig. 1e) in the current study. Further validation of this model is expected in an independent HPSCC cohort.
Disruptive mutation in TP53 and CDKN2A tended to be potential markers in HPV (+) patients (Fig.S18-19), with TP53 mutation tending to be protective while CDKN2A mutation to be risky, though they did not reach statistical significance. Their mutation groups showed an inverse hazard in HPV (+) groups. This inverse OS association of TP53 and CDKN2A mutation groups in HPV (+) patients has not been validated because only limited number of patients have both HPV infection and TP53 mutation simultaneously13,48,75. However, at the protein level study, patients with low expression of p53 protein showed better prognosis in HPSCC HPV (+) patients6. By contrast, patients with low expression of CDKN2A (p16) had a worse prognosis6. This finding is consistent with our result: TP53 disruptive mutation is a beneficial biomarker and CDKN2A is a risk biomarker in HPV ( + ) HPSCC patients. These mutations, including MUC4, SYNE1, TP53, and CDKN2A combined with HPV infection can be used as prognostic biomarkers to identify patients with high risk of relapse disease who require more frequent monitoring for tumour progression.
There are several discrepancies with previous HNSCC studies regarding mutation and CNV in HPV-infected patients. HPV ( + ) HNSCC are usually characterized as harbouring less TP53 mutations than HPV (−) HNSCC while almost no CDKN2A deletions or CCND1 amplifications are found in HPV infection3,76,77, whereas in our results, the TP53 mutation rate was higher in HPV (+) comparing to HPV (−) cases (Table S13), CDKN2A (9p21.3) deletion and CCND1 (11q13.3) amplifications were also presented in HPV (+) cases (Table S3-S4). TP53 disruptive mutations are nonconservative mutations in the DNA-binding domain (L2–L3 region) or stop codons in any region78. A previous study12 presented that TP53 disruptive mutation was associated with lymph node spread in the HPV(−) group in HNSCC12, whereas in our data, the association was not observed (Table S14). These inconsistencies may be caused by race differences, environmental influences and the uniqueness of HPSCC.
The discrepancies in mutation and CNV frequency seen between our data and published data in the context of HPV status might be partially caused by HPV detection methods induced misclassification of HPV status. Some studies used PCR to detect HPV sequences in DNA and RNA, while we used p16 immunohistochemistry (IHC) assay as a proxy of HPV infection, which is recommended by the NCCN Clinical Practice Guidelines in Oncology to test HPV presence in oropharyngeal cancer79. p16 IHC can sometimes yield false-positive results, where p16 overexpression is detected without the presence of HPV, which may lead to misclassification of HPV-driven cancers80. Besides, p16 IHC is not HPV-specific; it may express in other conditions80. These impact the specificity of HPV detection, and consequently misclassified HPV (−) patients into the HPV (+) group. Therefore, we think HPV detection methods with higher specificity should be used to re-evaluate the prevalence of HPV in HNSCC, and a large, diverse cohort including different races and tumour sites can be more convincible to address the HPV infection rate and associated molecular features.
There are weaknesses in the current study. Findings such as GE subtypes and the enrichment of estrogen related pathways in HPSCC aetiology will need to be validated in larger, independent HPSCC cohorts, both in Western and non-Western populations. Functional validation of the roles of SOX2 and estrogen pathways in HPSCC tumourigenesis and prognosis should be experimentally carried out. Furthermore, this study cohort consists predominantly of male patients (100 males and 3 females). Even though the patients were not preselected, the proportion implies that male has over 30 times higher prevalence of HPSCC than females, which underscores the importance of gender disparity. Gender disparity in incidence has also been noticed in TCGA-HNSCC cohort (Table S1) where the male patients are almost three-fold of females. In addition, a study also mentioned the late-life incidence of hypopharyngeal cancer in males is ten-fold of females in East-Asian7. However, gender disparity may limit the generalizability of the findings, particularly in understanding gender-specific mechanisms in HPSCC. Furthermore, this study contains only Chinese patients and similar studies involving more diverse cohorts, such as more female patients and other races of people, are valuable. Moreover, the use of p16 IHC as a proxy for HPV infection may not be as specific as other methods, such as PCR and this could impact the accuracy of HPV status in the cohort. Re-evaluation of the HPV prevalence with more specific detection method and infection associated molecular alterations in tumours are essential.
In conclusion, our data presented a comprehensive profiling of mutational, transcriptomic and proteomic data in Chinese HPSCC patients. We identified novel mutations in 2531 genes and showed that a highly mutated gene, MUC4, is associated with patient prognosis. We found that estrogen response pathways, PKC and CDK pathways play significant roles in tumourigenesis of HPSCC, indicating potential therapeutic opportunities. The combination of HPV status and mutations as predictive markers was assessed, including SYNE1 in HPV (−), MUC4, TP53 and CDKN2A in HPV (+). Finally, our predictive model involving clinical factors and transcriptomic expression of six genes can be used to predict HPSCC patient prognosis.
Responses