Deep learning enhances the prediction of HLA class I-presented CD8+ T cell epitopes in foreign pathogens

Main

Cytotoxic CD8⁺ T cells have been shown to limit disease severity and provide protection against viral infections such as human immunodeficiency virus (HIV)^1,2 and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)^3,4,5,6 by recognizing short viral peptides bound to human leukocyte antigen class I (HLA-I) molecules. While T cell-based vaccines hold great promise for foreign pathogens^2,4,7,8,9,10, rapidly identifying immunogenic epitopes across viral proteomes is challenging owing to the extensive degree of HLA polymorphism in the population and the experimental burden required to validate HLA-I binding and CD8⁺ T cell reactivity. Given that only a small fraction of pathogen-derived peptides elicit an in vivo CD8⁺ T cell response^11,12, computational algorithms that can rapidly identify immunogenic epitopes within viral proteomes for a broad range of HLA-I alleles would greatly accelerate T cell-based vaccine development.

Although numerous computational methods exist to predict CD8⁺ T cell epitopes^13,14,15,16, their accuracy varies substantially across HLA-I alleles¹⁷. Moreover, the extent to which these models can identify immunogenic epitopes is not well understood, in part owing to the lack of unbiased evaluation datasets. Peptides frequently tested for immunogenicity are often conditionally selected using existing binding predictors^13,14,15,16. Although several tools can define the characteristics of immunogenic cancer neoepitopes^{18,19,20,21,22,23}, they are yet to show strong generalizability to foreign pathogens. Thus, given the remarkable success of deep learning to improve model generalization across several areas of biology, such as protein structure prediction²⁴ and CD4⁺ T cell epitope presentation^25,26, we reasoned that it could also enhance the prediction of presented and immunogenic CD8⁺ T cell epitopes.

We therefore developed MUNIS: a deep learning-based predictor of HLA-I epitopes that utilizes a bimodal architecture to jointly model both HLA-I–peptide binding and antigen processing by leveraging a well-curated and expanded training set of 651,237 unique HLA-I ligands across 205 HLA-I alleles. We validated MUNIS in silico on immunopeptidomic data and achieved superior performance compared with existing HLA-I epitope predictors. Distinct from previous work^13,14,15,16, we evaluated immunogenicity prediction using several unbiased immunopeptidomic benchmarks. We subsequently validated MUNIS experimentally using an in vitro HLA-I–peptide stability assay²⁷ and T cell immunogenicity assays on predicted peptides from Epstein–Barr virus (EBV). We explicitly omitted all peptides in EBV from the model training set to mimic prediction of a novel virus. Importantly, MUNIS identified both established and new CD8⁺ T cell epitopes in EBV that elicited effector and memory CD8⁺ T cell responses. Moreover, MUNIS was a comparable immunogenicity predictor to an HLA-I–peptide stability assay²⁷, which defined epitopes for a SARS-CoV-2 T cell vaccine²⁸, illustrating the power of deep learning to reduce experimental burden and rapidly accelerate T cell vaccine development.

Deep learning model development for CD8⁺ T cell epitope identification

To develop a robust model of immunogenic CD8⁺ T cell epitope prediction, we constructed a training dataset composed of mass spectrometry data from immunopeptidomic experiments used for the MixMHCpred2.2¹⁶, NetMHCpan4.1¹³ and MHCflurry2.0¹⁴ training sets and data obtained from the Immune Epitope Database (IEDB: iedb.org)²⁹. We supplemented these HLA-I-binding peptides with randomly sampled decoy peptides (non-binders) from Swiss-Prot³⁰ at a ratio of 1:5 HLA-I binders to non-binders. Importantly, all epitopes used in model evaluation were removed from training sets, regardless of HLA-I restriction. This is a far stricter approach to data filtering than previous methodologies where only matched HLA-I–peptide pairs were removed from the training set, resulting in substantial overlap between test set epitopes and training sets. We show the sources of these datasets (Fig. 1a), the peptide length distributions (Fig. 1b,c) and the frequencies at which peptides bind HLA-I molecules (Fig. 1d,e).

**Fig. 1: Characteristics of the deep learning model and the training and evaluation datasets for prediction of HLA-I epitopes.**

The architecture of the model is composed of two submodules (Fig. 1f). First, the binding module takes as input the sequences of the HLA-I molecule and peptide and learns a numerical vector representation, which we refer to as binding features, to produce the final prediction score. Our second module augments the binding features with a signal relevant to antigen processing. This is accomplished by encoding the peptide and five N-terminal and C-terminal flanking residues from its parent protein of origin, similar to MHCflurry2.0¹⁴. Using this approach, we train five models and combine them in an ensemble by taking the average score on any given input. The score of the model for HLA-I–peptide pairs ranges between 0 and 1 and indicates the probability of a peptide binding that particular HLA-I allele.

Previous work has frequently relied on area under the receiver operating characteristic curve (ROC-AUC) and precision (that is, positive predictive value) as evaluation metrics. However, given the abundance of negatives, it is possible that for a given threshold, the proportion of positives is dominated by false positives even with a low false-positive rate. Thus, we primarily use the area under the precision–recall curve, known as average precision score, to describe the model performance. This score ranges from 0 to 1, with a random predictor having a value equal to the percentage of positives in the data. For datasets of sufficient size, we also show the ROC-AUC.

MUNIS outperforms existing predictors in classifying HLA-I binders

We first evaluated the presentation model on a published immunopeptidomic dataset³¹ that contains 41,725 positive HLA-I–peptide pairs and 208,625 randomly sampled decoys across 24 HLA-I alleles. We calculated the average precision scores and ROC-AUC of classifying binders and non-binders on a per-allele basis and compared our scores with the existing tools MixMHCpred2.2, NetMHCpan4.1, MHCflurry2.0, TransPHLA³² and BigMHC³³, which predicts both presented peptides (BigMHC-EL) and immunogenic epitopes (BigMHC-IM). On this dataset, our model achieves a median average precision of 0.952, which corresponds to a 21% reduction in error compared with existing tools, with MixMHCpred2.2 scoring 0.924, NetMHCpan4.1 scoring 0.925, MHCflurry2.0 scoring 0.938, TransPHLA scoring 0.854 and BigMHC scoring 0.939 (Fig. 2a). The median ROC-AUC of MUNIS is 0.980, which corresponds to a 31% reduction in error compared with MHCflurry2.0 at 0.971, NetMHCpan4.1 at 0.962, MixMHCpred2.2 at 0.956, TransPHLA at 0.948 and BigMHC at 0.969 (Fig. 2b). Importantly, over 65% of positive epitopes in the evaluation dataset are also present in the training sets of existing predictors, excluding MUNIS where we ensure a 0% overlap. Despite this overlap, MUNIS still outperforms these prediction algorithms on 22/24 HLA-I alleles tested in average precision (Fig. 2c).

Evaluation of these various predictors suggests that differences in model performance could be due to each model’s ability to encode peptide length¹⁶. We therefore evaluated model performance across HLA-I alleles stratified by peptide length (Fig. 2d,e). This revealed that MUNIS outperforms for 9-mer, 10-mer and 11-mer peptides, indicating that encoding of peptide lengths is unlikely to be a key discriminator of model predictive capability.

To further evaluate predictive capabilities, we compared MUNIS with existing predictors using the identical evaluation dataset as above but with most peptides included in the tool training sets removed. In this setting, we observed a wider gap in performance for several HLA-I alleles, with MUNIS achieving a median average precision score of 0.894, MixMHCpred2.2 scoring 0.854, NetMHCpan4.1 scoring 0.868, MHCflurry2.0 scoring 0.867, TransPHLA scoring 0.795 and BigMHC scoring 0.891 (Extended Data Fig. 1a). MUNIS outperforms existing tools on 18/24 HLA-I alleles tested in average precision and 21/24 in ROC-AUC (Extended Data Fig. 1b), and a larger performance gap was observed for several HLA-I alleles on this cleaned evaluation set against all tools, except for BigMHC which retains 30% overlap with positive epitopes in the clean test set and TransPHLA which retains 14% overlap (Extended Data Fig. 1c). These data reveal that MUNIS can identify presented peptides from mass spectrometry data with greater accuracy than existing tools across several source proteins, HLA-I alleles and peptide lengths.

MUNIS predicts fewer false positives by using canonical HLA-I motifs

Given the improved performance of MUNIS in predicting HLA-I-presented peptides, we evaluated peptide-binding motifs for several alleles. Interestingly, we found that existing tools assign high individual model scores to non-HLA-I-binding peptides and thereby predict an increased number of false positives (Fig. 3a). We therefore evaluated the HLA-I-binding motifs for correctly classified peptides (true positives) versus misclassified peptides (false positives) for all HLA-I alleles in the evaluation set, with HLA-B*40:01 shown as a representative example (Fig. 3b). While all predictors were capable of correctly identifying peptides with canonical binding motifs, MixMHCpred2.2, NetMHCpan4.1 and MHCflurry2.0 also classified peptides with non-canonical anchor residue motifs as binders as well. The binding motifs of the false positives across these three predictors did not have a dominant amino acid at HLA anchor residue positions (Fig. 3b, highlighted in yellow) as observed for true positives. In addition, falsely classified binders have significantly greater entropy at HLA anchor residues compared with true binders (Fig. 3c). In contrast, MUNIS largely avoids classifying peptides without expected HLA anchor residues as positives and, consequently, false positives were extremely low for MUNIS with fewer than 25 misclassified peptides per HLA-I allele.

**Fig. 3: Motif analysis of misclassified binders reveals inconsistent reliance of existing models on canonical HLA-I-binding motifs.**

MUNIS predicts epitope immunogenicity and immunodominance

Given that only a fraction of HLA-I-presented peptides elicit CD8⁺ T cell responses, we next measured the ability of MUNIS to predict immunogenic epitopes. To avoid biases from existing prediction algorithms in peptide selection, we constructed a test set of peptides using five immunopeptidomic datasets from the influenza viruses A and B that bind to HLA-A*02:01, HLA-A*11:01 and HLA-A*24:02 (524 presented peptides across the five datasets)^34,35,36. Each of these datasets comprises a list of HLA-I-presented peptides identified via mass spectrometry that were subsequently evaluated for immunogenicity using interferon-γ (IFNγ) enzyme-linked immunospot (ELISpot) assays, where 36 of the 524 peptides were reported as immunogenic in the IEDB. For each dataset, we ranked positive (that is, immunogenic) peptides against all other peptides in the viral proteome. When compiling the set of negatives, we considered only proteins with at least one immunogenic peptide to prevent confounding by inherent levels of protein immunogenicity.

We benchmarked MUNIS against immunogenicity predictors PRIME2.0¹⁴ and BigMHC_IM and the HLA presentation prediction tools (MixMHCpred2.2, NetMHCpan4.1, MHCflurry2.0, TransPHLA and BigMHC_EL). PRIME2.0 was benchmarked on immunogenicity test sets by running each allele–peptide pair independently using ‘%Rank_bestAllele’ as a readout of prediction strength. We found that MUNIS outperforms all other prediction algorithms in identifying presented epitopes across the 5 datasets, with a median average precision of 0.289, which is a 26% relative improvement in performance compared with the next-best tool (MHCflurry2.0; Fig. 4a). However, all tools showed a similar predictive capability across HLA-I alleles when predicting immunogenic epitopes against decoy peptides (Fig. 4b). Of note, the tools with higher median average precision scores than MUNIS were the tools with substantial overlap between evaluation and training sets. To further understand this result, we measured performance with only non-immunogenic HLA-I binders as negatives (Extended Data Fig. 2a). This resulted in increased average precision over several alleles, indicating the tendency of models to rank presented, immunogenic epitopes higher than presented, non-immunogenic ones. These results highlight the intricate relationship between an increased likelihood of presentation and downstream T cell recognition and also underscore the unresolved gap to achieve high accuracy for immunogenicity when filtered on known HLA-I binders.

**Fig. 4: MUNIS outperforms existing tools in predicting epitope immunodominance hierarchies.**

While notable differences in immunogenicity prediction for individual epitopes were not observed, we explored whether MUNIS could predict immunodominance hierarchies, as effective T cell vaccines would ideally elicit immune responses across many individuals. We therefore leveraged T cell response data for HIV given known immunodominance hierarchies for multiple HLA-I alleles³⁷. Specifically, we used a dataset where 119 HIV epitopes were tested for CD8⁺ T cell responses in 527 individuals with HIV split across acute and chronic infection groups^18,24, with each peptide eliciting a CD8⁺ T cell response in up to 81% of corresponding individuals with HLA-I⁺.

We evaluated our model by computing the Spearman rank correlation coefficient between the MUNIS-predicted score and response frequency for each epitope. MUNIS had a Spearman correlation coefficient of 0.35 compared with 0.34 and 0.295 for the next-best-performing tools (BigMHC_IM and NetMHCpan4.1) for epitopes targeted in the acutely infected HIV⁺ subgroup (Fig. 4c) and 0.33 compared with 0.34 and 0.28 (BigMHC and MixMHCpred 2.2) for epitopes targeted in the chronically infected HIV⁺ subgroup (Fig. 4d). Of note, three tools used for comparison (NetMHCpan4.1, MHCflurry2.0, BigMHC) were enriched for HIV epitopes in their training dataset, whereas these were excluded from the MUNIS training set. We also stratified performance on a per-allele basis and found that MUNIS outperforms in both the acute (Extended Data Fig. 2b) and chronic (Extended Data Fig. 2c) subgroups. Overall, these data demonstrate that MUNIS is competitive in identifying immunodominant CD8⁺ T cell epitopes despite not having been explicitly trained for this task.

Because of the numerous false-positive binders predicted by other tools, we hypothesized that the superior capability of MUNIS to recapitulate immunodominance hierarchies could be due to prediction scores that reflect the targeting frequency of epitopes. When we binned epitopes by targeting frequency, we again observed that more conservative prediction scores from MUNIS discriminate between subdominant and immunodominant epitopes in HIV (Fig. 4e,f), with BigMHC being the only other tool to share this property. Collectively, these data demonstrate that MUNIS can identify epitopes with higher likelihoods of eliciting CD8⁺ T cell responses broadly across individuals.

To deconvolute which features contribute to enhanced model performance, we performed an extensive ablation analysis with competing models (Extended Data Fig. 3). We selectively ablated individual features to query the impact of the pretrained Evolutionary Scale Modeling-2 (ESM-2) Transformer protein language model, size of the language model, flanking residues, negative sampling logic and model score versus percentage rank score, which attempts to normalize model scores across HLA-I alleles (Extended Data Fig. 3b). A larger ESM-2 model (35 million parameters) improved performance slightly over a smaller model (8 million parameters), with an average precision of 0.959 compared with 0.953, respectively. Pretraining the model also improved the prediction of HLA-I–peptide presentation from an average precision of 0.946 to 0.953. Introducing flanking residues enhanced HLA-I–peptide presentation prediction from an average precision of 0.947 to 0.953. Interestingly, many of these features (namely, pretraining and incorporating flanking residues) slightly reduced the average precision on immunogenicity prediction. With regards to immunodominance, where the results are not stratified by allele, we compared the performance of MUNIS using the native score and the percentage rank for any given peptide and found that both methods perform similarly on the HIV acute and chronic immunodominance datasets. Finally, we observed that sampling decoys from Swiss-Prot was generally equivalent to sampling solely from the human proteome.

MUNIS discriminates between HLA-I-binding and non-binding EBV peptides

To evaluate the practical utility of MUNIS, we experimentally assessed predictions of epitope binding and immunogenicity within the EBV proteome for several HLA-I alleles, given its relevance to immunocompromised populations³⁸, linkage to multiple sclerosis^39,40 and >90% prevalence in the human population⁴¹. For this evaluation, we made EBV a de novo virus by excluding epitopes from the EBV proteome from the MUNIS training set. We first scanned all 8–15mers from 5 immunogenic EBV proteins (BRLF1, B2LF1, EBNA1, LMP2 and EBNA3a) and selected 337 peptides predicted to bind 17 HLA-I alleles using an earlier iteration of MUNIS. All of these peptides had scores >0.01 upon model finalization (Fig. 5a). Predicted binders for one allele were used as predicted non-binders for another allele if the MUNIS score was less than 0.01, contributing to a robust immunogenicity evaluation set. Each peptide was evaluated for HLA-I binding using an established HLA-I–peptide stability assay²⁷ and subsequently tested for T cell immunogenicity by IFNγ ELISpot assays in peripheral blood mononuclear cells (PBMCs) obtained from HLA-haplotyped human participants (Fig. 5b).

**Fig. 5: Experimental HLA-I–peptide stability assay confirms the ability of MUNIS to discriminate between binding and non-binding peptides within EBV.**

For each HLA-I allele, we evaluated the HLA-I-binding and stabilization capacity of 9–52 predicted peptide binders and 3–5 predicted non-binders. When normalized to the HLA-I stabilizing mean fluorescence intensity (MFI) of a corresponding immunodominant HLA-I-restricted HIV epitope³⁷, the predicted binders have a significantly higher MFI than predicted non-binders (Fig. 5c–e and Extended Data Figs. 4 and 5). For the selected top-337 ranked peptides used in the HLA-I–peptide stability assay, the median MUNIS score for a predicted binder was 0.96, with 318 HLA-I–peptide pairs scoring ≥0.50, 285 pairs ≥0.70, 219 pairs ≥0.90 and 187 pairs ≥0.95. This provided the opportunity to assess the relationship between quantitative MUNIS score and experimental HLA-I–peptide stabilization, revealing that higher MUNIS score thresholds for classifying peptides led to increased discrimination between binders and non-binders (Fig. 5f).

MUNIS identifies established and novel immunogenic EBV epitopes

To assess the ability of MUNIS to identify immunogenic epitopes from EBV proteins (BRLF1, B2LF1, EBNA1, LMP2 and EBNA3a), we performed IFNγ ELISpot assays on PBMCs from 30 HLA-I-typed individuals using overlapping 15mer peptide pools from each protein. This demonstrated that all individuals had non-zero IFNγ ELISpot responses to at least one of the five overlapping EBV peptide pools (Extended Data Fig. 5), providing strong rationale for testing individual predicted peptides within these individuals. We therefore assessed T cell reactivity using an IFNγ ELISpot by matching individual peptides to participants with the requisite restricting HLA-I allele (Fig. 6a). Given the HLA-I haplotypes of our cohort, we expanded the set of HLA-I–peptide pairs to 370 to include HLA-A*11:01-, B*44:02- and C*08:02-peptide pairs, which were not present in HLA-I–peptide stability assessments. Of these 370 unique HLA-I–peptide pairs tested, we identified 27 HLA-I–peptide pairs and 25 unique peptides that elicited detectable T cell responses. Interestingly, 12 immunogenic peptides predicted by our model had not previously been identified or deposited in the IEDB (Fig. 6b), illustrating the ability of MUNIS to predict novel CD8⁺ T cell epitopes, even for an extensively studied pathogen such as EBV. Using ex vivo IFNγ ELISpot and proliferation assays, we further confirmed that one of the novel HLA-A*02:01-restricted EBV epitopes (SIIPRTPDV, BZLF1: 229–237) as well as a MUNIS-predicted, known immunodominant EBV epitope (YVLDHLIVV, BRLF1: 109–117) are capable of eliciting both effector (Fig. 6c) and memory (Fig. 6d) CD8⁺ T cell responses across multiple individuals, illustrating the ability of MUNIS to identify broadly reactive epitopes.

**Fig. 6: MUNIS identifies established and novel EBV CD8⁺ T cell epitopes.**

Because our cohort was limited to 30 individuals, we supplemented our list of IFNγ ELISpot-confirmed immunogenic epitopes with CD8⁺ T cell reactivity data from the IEDB. Thus, in addition to the 27 pairs found to be immunogenic, we included any peptide with a positive frequency of response reported in the IEDB via an established T cell assay (for example, tetramer staining, IFNγ ELISpot or intracellular cytokine staining). This revealed an additional 18 HLA-I–peptide pairs, for a total of 45 pairs with 42 unique peptides. Compared with other computational epitope prediction algorithms, MUNIS showed enhanced identification of immunogenic EBV epitopes. MUNIS ranked four immunogenic epitopes in the top 5 and 20 in the top-60 ranked peptides. When compared with other tools, MUNIS and BigMHC-IM perform best, with an average precision of 0.3 (Fig. 6e). In terms of novel immunogenic epitopes (that is, those not currently in the IEDB), MUNIS outperforms all other tools (Fig. 6f). While these comparisons are performed on epitopes with lengths of 8–14 residues (the peptide length training range for several tools), one of the highly ranked immunogenic peptides for HLA-A*02:01 is a 15mer (HSDYQPLGTQDQSLY, LMP2A: 71-85), illustrating the ability of MUNIS to predict longer epitopes. Notably, without restriction on peptide length, MUNIS outperformed the HLA-I–peptide stability assay in immunogenicity prediction (Fig. 6g) and had comparable results when stratified by HLA-I allele (Fig. 6h). Collectively, these results confirm that MUNIS can identify not only binders but also immunogenic epitopes from a de novo pathogen with high accuracy.

Discussion

In this study, we report the development of MUNIS, a deep learning model for CD8⁺ T cell epitope prediction within foreign pathogens. We utilized a deep learning architecture and curated dataset for HLA-I epitope presentation and implemented the encoding of HLA-I–peptide sequences with a protein language model, utilized as a deep transformer model²⁴. These models are trained on millions of protein sequences and learn features that are broadly applicable to property-prediction tasks²⁴, making them well suited for accurate immunogenic epitope prediction. Similar to MHCflurry2.0¹⁴, we used context from the parent protein to capture epitope processing. This composite of characteristics led to improved performance in the prediction of HLA-I-presented peptides with increased probability of being targeted by multiple individuals.

Despite notable improvements in predicting epitope presentation, we did not observe substantial discrimination between immunogenic and non-immunogenic epitopes when conditioned on peptides presented by HLA-I in influenza immunopeptidomic datasets. This may be due to the ability of MUNIS to enhance prediction of peptide presentation itself and not necessarily predict immunogenicity of filtered HLA-I binders. Similarly, this may explain its improved performance ranking immunogenic epitopes in EBV when seen as a de novo virus, where all aspects of the model could be leveraged. In addition, we also found that MUNIS showed noticeable improvement compared with contemporary tools in predicting immunodominance hierarchies in acute and chronic HIV infection. Immunodominance hierarchies can be partially dictated by the dependence of HLA-I alleles on the protein tapasin⁴² by aligning hierarchical responses with HLA-I–peptide stability⁴³. Because MUNIS ranks immunogenic epitopes comparably to experimental HLA-I–peptide stability, this may explain its increased efficacy in recapitulating immunodominance hierarchies. Notably, we find that BigMHC (when using both its eluted ligand EL and immunogenicity IM models) is the only predictor comparable to MUNIS across different evaluation settings. However, MUNIS performs consistently on all metrics using a single output score, making it broadly useful for presentation and immunodominance prediction.

An extensive ablation analysis revealed that the pretrained ESM-2 model and incorporation of flanking residues contribute to the outperformance of MUNIS on HLA-I–peptide prediction. As the model pretrained on 8 million parameters performed similarly to the one pretrained on 35 million parameters, we opted for the smaller model to increase efficiency. Given that the prediction task focuses on a single protein family, the benefits of deconvoluting underlying structural features across the full protein landscape by a larger language model may be lesser. Using percentage rank and the native score of the model to predict HLA-I–peptide binding partners performed similarly on immunodominance datasets, highlighting that MUNIS has learned to compare scores across HLA-I alleles despite inherent differences in sample sizes across alleles within the training data. Finally, ablation analysis on the decoy sampling method revealed similar results whether decoys were derived from Swiss-Prot or the human proteome. Overall, this analysis showed that while modelling features selectively benefit presentation prediction, some subtly detract from immunogenicity prediction. Thus, we offer versions of MUNIS trained with and without flanking residues.

Importantly, we note limitations to this work. First, the training and evaluation data of 205 prominent HLA-I alleles only partially captures the extent of HLA polymorphism in the population. In addition, while MUNIS captures features of immunogenic peptides, it does not substantially improve immunogenicity prediction when peptides are filtered on HLA-I binding. This would benefit from conditioning predictions based on the likelihood of HLA-I–peptide complexes engaging specific T cell receptors. However, the lack of available data^29,44,45,46 for T cell receptor–peptide–HLA-I binding makes this task challenging. We also note that our model is largely trained on mass spectrometry data, which may contain potential biases such as under-representation of cysteine residues⁴⁷. In addition, while we find that our architecture decisions result in improved presentation prediction, some features resulted in weaker immunogenicity prediction. This suggests that the transfer between the two tasks, although positive, is not completely linear.

Nonetheless, the ability of MUNIS to accurately predict immunogenic peptides and identify new epitopes has several implications. Rationally designing immunogens that will elicit robust T cell responses is pivotal to vaccine design, and MUNIS could therefore greatly accelerate this process. Surprisingly, we found that MUNIS outperformed an experimental HLA-I–peptide stability assay in predicting immunogenic epitopes. Thus, it may be possible to substantially reduce the experimental burden that accompanies binding and stability assays and directly perform immunogenicity studies on MUNIS-predicted peptides. Lastly, MUNIS was able to identify several novel CD8⁺ T cell epitopes in EBV, highlighting the potential of deep learning for epitope discovery. We envision that future efforts with expanded training datasets of immunogenic peptides will further improve model performance.

Methods

Datasets

The eluted ligand (that is, presentation) training set consists of four different datasets: the eluted ligand data deposited in the IEDB²⁰ as well as data from the NetMHCpan4.1¹³, MHCflurry2.0¹⁴ and MixMHCpred2.2¹⁶ studies. The final dataset after filtering out epitopes in the evaluation dataset is composed of 651,237 positive peptide–HLA-I pairs and 3,701,209 negative decoys with peptide lengths between 8 and 15. Five-amino-acid-long N- and C-terminal flanking sequences for each peptide were fetched from the parent protein sequence annotated in the IEDB. When not available, we attempted to find a matching sequence by searching the Swiss-Prot sequences using MMSeqs2⁴⁸. The dataset was filtered to data points where the parent protein could be identified. Sequences for the HLA alleles were obtained from the IMGT database⁴⁹. Contrary to previous work that uses pseudo-sequences, that is, sequences of HLA molecules that are within peptide-binding range⁵⁰, our model takes as input the full α1 and α2 domains of the HLA sequence, specifically the 180 residues ranging from positions 27 to 207.

Model architecture

The binding module uses the ESM-2 protein language model of 6 million parameters⁵¹. Language models are trained on millions of protein sequences and have been shown to implicitly learn various structural features of proteins from sequence alone. The language model takes as input the HLA and peptide sequences in the form: [CLS] HLA [SEP] PEPTIDE [EOS], where the [CLS], [SEP] and [EOS] tokens indicate the start, separator and end of the sequence, respectively. After encoding the sequence with the language model, we use the representation of the [CLS] token as sequence representation, which is then fed to a two-layer feed-forward network. The loss is the binary cross entropy between the output scores and the ground-truth labels. As opposed to using only the output embeddings from the language model as features, we fine-tune the full ESM-2 language model during training.

The processing module uses a bidirectional long short-term memory (LSTM) recurrent neural network⁵², which is fed the peptide sequence as one-hot encoded amino acids, including its left and right flanks, corresponding to five amino acids on the N- and C-terminal ends of the peptide in its parent protein of origin. To allow the network to determine which of the amino acids belong to the peptide sequence and not to the flanks, we include a binary feature at each amino acid position. The output of the LSTM is a sequence of vectors that we pool into a single vector representation for the sequence by averaging along the sequence length dimension. Finally, we concatenate this feature vector with one from the binding module and feed it into a two-layer feed-forward network to produce the presentation score.

Model training

We train our models using the PyTorch framework on 4× A6000 graphic processing units for 75,000 steps with a total batch size of 256 (2 hours of total training time). The binding model is initialized using the pretrained ESM-2 8 million parameter model and all the weights are fine-tuned during training. The LSTM used to model flanking residues is randomly initialized at the start of training. Our models are trained using half mixed-precision (fp16) using the Adam optimizer, with a learning rate of 1 × 10⁻⁴ that is constant throughout training.

Peptide synthesis reagents

Fmoc-protected amino acids and synthesis resin, 2-chlorotrityl chloride, were purchased from Akaal Organics. Dimethylformamide (DMF), N-methyl pyrrolidone (NMP), acetonitrile and methyl tert-butyl ether (MTBE) were purchased from Fisher Bioreagents. 2-(6-Chloro-1-H-benzotriazole-1-yl)-1,1,3,3-tetramethylaminium hexafluorophosphate (HCTU) was purchased from AAPPTEC. Piperidine and dichloromethane (DCM) were from EMD-Millipore. Diisopropylethylamine (DIEA), N-methyl-morpholine (NMM), triisoprpopyl-silane, 3,6-dioxa-1,8-octanedithiol (DODT) and trifluoroacetic acid (TFA) were purchased from Sigma-Aldrich.

Peptide synthesis and analysis

Peptides were synthesized on an automated robotic peptide synthesizer (AAPPTEC, Model 396 Omega) by using Fmoc solid-phase chemistry⁵³ on 2-chlorotrityl chloride resin⁵⁴. The C-terminal amino acids were loaded using the respective Fmoc amino acids in the presence of DIEA. Unreacted sites on the resin were blocked using methanol, DIEA and DCM (15:5:80 v/v). Subsequent amino acids were coupled using optimized (to generate peptides containing more than 90% of the desired full-length peptides) cycles consisting of Fmoc removal (deprotection) with 25% piperidine in NMP followed by coupling of Fmoc amino acids using HCTU/NMM activation. Each deprotection or coupling was followed by several washes of the resin with DMF to remove excess reagents. After the peptides were assembled and the final Fmoc group removed, peptide resin was then washed with DMF, DCM and methanol three times each and air-dried. Peptides were cleaved from the solid support and deprotected using odour-free cocktail (TFA/triisopropyl silane/water/DODT; 94/2.5/2.5/1.0 v/v) for 2.5 h at room temperature⁵⁵. Peptides were precipitated using cold MTBE. The precipitate was washed 2 times in MTBE, dissolved in a solvent (0.1% TFA in 30% acetonitrile/70% water) followed by freeze-drying. Peptides were characterized by ultra-performance liquid chromatography and matrix assisted laser desorption/ionization mass spectrometry. All peptides were dissolved initially in 100% DMSO at a concentration of 40 mM, before dilution at the appropriate concentration in RPMI-1640 medium.

HLA-I–peptide stability assay

HLA-I–peptide stability assays were performed as previously described²⁷. In brief, 5 × 10⁴ Transporter Associated with Antigen Processing (TAP)-deficient monoallelic HLA-I-expressing 721.221 cells were incubated with 100 μM of peptide and 3 μg ml⁻¹ of β2-microglobulin (Sino Biological) in RPMI-1640 medium overnight at 26 °C/5% CO₂ for 18 h. Controls without peptide but with DMSO were performed in parallel. Following overnight incubation, cells were incubated at 37 °C/5% CO₂ for 2 h before staining with live/dead violet viability dye (Life Technologies) and pan-HLA-ABC-APC antibody (Clone W6/32, BioLegend, 1:100 dilution). HLA-I surface expression was analysed by flow cytometry.

Calculation of Shannon entropy

Shannon entropy (that is, sequence conservation scores) at each HLA anchor residue position was calculated using the following formula:

$$S=-{sum }_{i}^{n}Pleft({x}_{i}right)log_{2}Pleft({x}_{i}right),$$

where n is the number of unique amino acids at any given position, x_i is the ith unique amino acid, and P(x_i) is the probability of amino acid x_i at that given position, calculated by dividing the number of peptides with amino acid x_i at the position of interest by the total number of peptides in the dataset.

Study participants

Study participants were recruited from outpatient clinics at local Boston area clinics and from outside Boston. The Institutional Review Board of Massachusetts General Hospital approved the studies of cells derived from human blood samples. PBMCs from HIV⁺ individuals with viral loads below 2,000 copies either on or off suppressive anti-retroviral therapy were collected by Ficoll gradient separation from acid citrate dextrose tubes or leukapheresis samples. They were then cryopreserved and stored in liquid nitrogen for future use. High-resolution HLA-I typing was performed for all patients as described previously⁵⁶. In brief, locus-specific PCR primers were used to amplify polymorphic exons of HLA-A, HLA-B and HLA-C genes with the Fluidigm Access Array (Fluidigm). PCR amplicons were pooled and sequenced on an Illumina MiSeq platform (Illumina). HLA alleles and genotypes were called using the Omixon HLA Explore (beta version) software (Omixon). Ambiguous calls were resolved by Sanger sequencing.

Ex vivo IFNγ ELISpot assay

IFNγ ELISpot assays were performed according to the manufacturer’s instructions (Mabtech). PBMCs were incubated with individual peptides from EBV at a final concentration of 0.5 μg ml⁻¹ for 16–18 h. Positive controls were anti-human CD3 (Clone OKT3, BioLegend, 0.5 μg ml⁻¹) and anti-human CD28 (Clone CD28.2, BioLegend, 0.5 μg ml⁻¹) antibodies. The number of spot-forming units in the highest-value DMSO control well was subtracted from each experimental well. Responses were considered positive if both replicates had greater than or equal to five spot-forming units per well above background.

Ex vivo proliferation assay

PBMCs were suspended at 1 × 10⁶ cells per ml in 1 μM CellTrace Far Red dye (ThermoFisher) in PBS and incubated at 37 °C for 20 min. Cells were protected from light and mixed every 5 min during the incubation. RPMI supplemented with 10% fetal bovine serum was added to quench the reaction for 5 min, followed by centrifugation at 1,500 rpm for 3 min before resuspension in culture media. Cells were plated into 96-well U-bottom plates (Corning) (200,000 cells per well in 200 μl of culture media) and incubated with individual EBV peptides at a concentration of 0.5 μg ml⁻¹ for 5 days. Cells were washed with PBS supplemented with 2% fetal bovine serum and stained with anti-human CD3-BUV395 (Clone UCHT1, BD Biosciences, 1:100 dilution), anti-human CD4-PE-Cy7 (Clone OKT4, BioLegend, 1:100 dilution), anti-human CD8-BV605 (Clone SK1, BioLegend, 1:100 dilution) and live/dead violet viability dye (Life Technologies). Cells were washed and fixed in 2% paraformaldehyde and flow cytometric analysis was performed on a BD LSR Fortessa (BD Biosciences).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.