Network analyses of brain tumor multiomic data reveal pharmacological opportunities to alter cell state transitions

Introduction
Glioblastoma multiforme (GBM) is a highly aggressive brain cancer and little progress has been made in improving survival outcomes. Temozolomide (TMZ), a DNA alkylating agent, is the primary standard of care drug along with radiotherapy (RT) but has only resulted in a median survival of 15 months and a 5-year survival rate of 5%1,2. Further attempts at improving patient outcomes have been fruitless. Bevacizumab, a monoclonal antibody which targets VEGFA, was approved for the treatment of recurrent GBM (rGBM) but has failed to achieve significant improvements in overall survival (OS)3. Both erlotinib and gefitinib have failed to improve patient outcomes in GBM, despite being EGFR inhibitors designed for patients with mutated EGFR4,5. Immune checkpoint inhibitors have fared poorly as well, as both nivolumab and pembrolizumab were ineffective in patients with rGBM6,7. The reasons for treatment failure are multi-faceted and intratumoral heterogeneity (ITH) and cellular plasticity are two phenomena that are known to contribute to drug resistance8,9,10,11.
ITH in GBM has been explored extensively and multiple classifications have emerged for interpatient classification of GBM subtypes. Bulk transcriptomic analyses, for example, have been used to identify three main molecular subtypes of GBM across patients: proneural, mesenchymal, and classical12,13,14. Other works have focused on common genetic alterations, such as the EGFRvIII mutation, EGFR copy number amplification, TP53 deletion, and NF1 alterations11,15,16,17. Mutant isocitrate dehydrogenase 1 (mIDH1) has become a distinct marker of prolonged survival in GBM as patients with this mutation have a median survival of 31 months18,19. Additionally, the methylation status of O6-methylguanine-DNA-methyltransferase (MGMT) is a valuable prognostic for TMZ efficacy20,21. All these events may be associated with molecular subtypes and clinical outcomes11,15 and more work is needed to understand these relationships in detail.
More recent research has given insight into the epigenetic plasticity of GBM tumor cells and the inherent heterogeneity within a single tumor. A high amount of stemness increases plasticity of the tumor and is positively associated with tumor grade22. This plasticity is thought to drive tumor self-renewal and therapeutic resistance through the existence of multiple phenotypes or cell states and transitions23,24,25,26,27. A recent effort by Neftel and colleagues28 utilized single cell sequencing (scRNA-seq) to identify four primary cell states for GBM (Mesenchymal-like, Astrocyte-like, Neural-progenitor-like, and Oligodendrocyte-like) and showed how it is possible for these cell states to transition to one another. Another work29 also tracked changes in single cell populations of GBM cells in vitro in response to TMZ exposure, showing how drugs may induce transitions and alter the phenotypic landscape of GBM.
Since drugs can induce cell state transitions, it may be possible to predict and control these transitions if their mechanisms are understood. For example, one treatment strategy proposed by Prager et al.23 is to use drugs to shift the GBM cell population into one attractor or phenotypic state and dose with another drug which is highly effective against that particular phenotype. Indeed, controlling cell state transitions has been of interest since C.H. Waddington first proposed his landscape of epigenetics30. Rukhlenko et al.31 showed it is possible to develop models which can describe phenotypic changes and proposed mechanisms on how to control these changes. However, these mechanisms are not yet fully understood in GBM.
Our lab has previously explored how epigenetics and cell state transitions may result in TMZ resistance in GBM, and how it may be possible to overcome this resistance via cell state-directed (CSD) therapy32,33. In this work, we utilized a multi-step computational pipeline that incorporated multi-omics data from GBM patients to more broadly explore mechanisms for cell-state transitions and identify key genetic and epigenetic mediators. First, for each patient the dominant cell state defined by Neftel and colleagues was determined using single nucleus RNA sequencing (snRNA-seq) in a subpopulation or reference dataset and bulk RNASeq data from the remaining patients by deconvolution. Next, by incorporating phosphoproteomic data for all patients, PPINs were built for each of the four cell state groups. These four PPINs were subsequently combined to create a single Boolean network capable of describing the relationships between the four cell states. We then used in silico simulations to propose testable pharmacological interventions to either promote or prevent cell state transitions. Finally, the results from these simulations were used as input to a machine learning model capable of predicting the dominant cell state of another independent clinical GBM dataset. By combining simulation results and the predictive machine learning model, we propose plausible causal mechanisms of action of cell state transitions in GBM patients that may form the basis for pharmacological targeting and CSD therapy.
Results
Generating protein–protein interaction networks for each cell state
Although every GBM tumor has multiple cell states within it, most tumors have a dominant cell state that is highest in proportion. We hypothesized that given data for enough patients that are dominant for the same cell state, even if bulk data, we could extract sufficient information to propose protein–protein interaction networks that are enriched for cell-state specific features corresponding to this dominant cell state. As a first step towards exploring this hypothesis, a multi-step pipeline – see Fig. 1 – was used based on a suite of publicly available data (Wang data set, described next) that yielded unique PPINs for each GBM cell state (based on the Neftel definitions). The Wang data set includes 100 GBM patients consisting of a collection of multiomics measurements and clinical information from each patient. Our work utilizes 92 of these samples that are from wtIDH1 GBM patients (the remaining 8 were mutant IDH1 samples and excluded), and incorporates bulk RNAseq, phosphoproteomics, and single nucleus RNA sequencing (snRNAseq). The results of this pipeline are described below and the analytical details in the Methods section.

All omics data (red box) were accessed via the CPTAC portal (https://proteomics.cancer.gov/data-portal) and originally published by Wang et al.60. The software used in each step (packages/databases) are noted in each green box in parentheses. Seurat v434, CopyKat35, Bisque36, limma64, and Deseq263, were used for snRNA-seq analysis, deconvolution, phosphoproteomic analysis, and bulk RNAseq analysis, respectively. Activity inferences for transcription factors and kinases were estimated using DoRothEA, OmniPath, and DecoupleR, while prior knowledge was combined from three databases (STICHdb, ReCon3D, and OmniPath).
Single cell analysis
The Wang snRNAseq data was designated as the reference data to label cells based on the Neftel cell types. The 18 snRNAseq samples from wtIDH1 GBM patients were imported into the R packages Seurat34 and CopyKat35 and analyzed as described in the methods section. In brief, cells were analyzed for quality and clustered based on similarity of gene expression and resulting in forty clusters. These forty clusters were tested for enrichment (hypergeometric test) of gene modules for Mesenchymal-like (MES), Astrocyte-like (AC), Neural-progenitor-like (NPC), and Oligodendrocyte-like (OPC) cell states along with other healthy cell types such as astrocytes (Astro), neurons, oligodendrocytes (oligo). The enrichment of all cell clusters can be found in the supplementary material (Supplementary Fig. 1). CopyKat was used to analyze samples for copy number variations (CNVs). Cells were labeled with the most enriched gene module if the CNA analysis and the gene enrichment were both indicative of a malignant or non-malignant cell type. When CNA analysis and gene enrichment were at odds, cells were evaluated for chromosome 7 amplification and chromosome 10 loss. Cells which displayed these traits were considered malignant. All the clusters were visualized together via UMAP and CNV scores for chromosomes 7 and 10 are also displayed on separate UMAPs (Supplementary Fig. 2).
Deconvolution and patient classification
Once all the clusters were labeled with the most enriched cell state, each individual cell in each cluster is labeled with the same cell state as that cluster. It is therefore simple to determine the predominant cell state for those 18 patients for whom there are snRNA-seq samples available, as each cell is marked with the original patient IDs. However, there are an additional 74 patients in the Wang dataset which do not have snRNAseq samples but do have bulk RNAseq samples. To get an estimate of the cell composition of all patient samples, the R package Bisque was used to deconvolve bulk samples into individual cell type estimates (Fig. 2a)36. These analyses suggest that each patient predominantly expresses one of the four major cell types (MES, AC, NPC, OPC). We then grouped the patients together based on their most dominant cell state. Figure 2b shows the results of the patient classification with most patients predominantly expressing either the MES-like or NPC-like cell state.

a The deconvolution results of the bulk RNAseq data from Wang60. Rows are the four cell states and columns are the patient IDs (as used in Wang). Patients were subsequently binned into four groups based on the predominant cell state. b The total number of patients binned into each of the four cell states.
Cell-state specific and integrated networks
Patients were divided into the four cell state groups based on their predominant cell state following the deconvolution analysis. Four PPINs were constructed using these patient subgroups and the COSMOS pipeline (see methods section)37. In brief, bulk RNAseq and phosphoproteomics for each subpopulation were analyzed for differentially expressed entities. The activity of both transcription factors and kinases were estimated using the WMEAN algorithm from the decoupleR R package38. These activity measurements are then used to identify the most likely interactions among all proteins and create the PPINs37. On average, the networks have 65 nodes and 122 edges (S4a–d).
We were curious as to how similar or different the PPINs were to each other. Using the 50 hallmark pathways from the Human Molecular Signatures Database (MSigDB) as a reference, we found that all the cell states included a high number of G2M checkpoint pathway proteins (Table 1)39. Perhaps surprisingly, all cell state PPINs contain many genes from the allograft rejection pathway. However, this pathway contains genes from common growth and proliferative pathways including EGFR and AKT1, which may explain its presence in the networks. Additionally, the AC, MES, and NPC cell states contained many proteins from the PI3K/AKT signaling pathway, but the OPC state did not. The AC state contained the least number of proteins from the TNFA/NFKB signaling pathway. The OPC and NPC states are upregulated for the interferon gamma and JAK/STAT pathways, but the MES and AC states are not. These four networks were subsequently merged to form the final PPIN (S4, e). All unique interactions across the four networks are included for a total of 144 proteins and 380 interactions.
Steady-state analysis
The final PPIN was converted into a Boolean network to enable simulations to steady-state and to perform in silico protein knockout simulations (next section). In this construction, each node may be in one of two states at a given time: “ON” (1) or “OFF” (0). A critical component of the conversion from a static PPIN to a Boolean model is generating the logical equations of the network. To accomplish this task, we aimed to match the steady-state of each node in the network to its state as defined by the COSMOS PPIN. In other words, when run to steady-state each node in the model should be either active or inactive. An active node is equivalent to an upregulated node in COSMOS, while an inactive node is equivalent to a downregulated node. By conducting this analysis for all four cell states, we can effectively define the logical rules that govern the network. MaBoSS (Version 2.0)40 was used to perform the steady-state simulations and create logical rules capable of achieving this goal. The full list of rules can be found in the supplemental material (Supplementary Table 1).
We then sought to compare the four different steady-states (one for each cell state) of the Boolean network. Using MaBoSS, the network was run to steady-state starting from each of the predominant cell states (four times) and the results were compared (Fig. 3b). A similarity score was defined as the percentage of nodes in the same “ON” or “OFF” state as at time zero. Notably, the MES and AC state were the most similar indicating greater overlapping activity. Interestingly, there are some differences in steady-state similarity depending on which cell state is the initial starting point, indicating non-linear responses within the network.

a Workflow for the Boolean Simulations. b The similarity between each cell state at steady-state. Similarity is defined as the percentage of nodes in the same “ON” or “OFF” state at steady-state compared to its initial state in the Boolean network. c Z-normalized Hamming Distances of selected knockout simulations conducted with the Boolean Network. Positive scores (red) indicate a knockout shifted the network toward that cell state, while negative scores (blue) indicate a shift away from a cell state. Each row represents an individual protein KO, and each column describes the changes with respect to each cell state. Each heatmap represents the results from starting from each of the cell states. Starting at the top left and going clockwise: MES, NPC, OPC, AC.
In silico knockout simulations
Starting from the steady-state of each predominant cell state, nodes were knocked out one at a time and the resulting state was compared to the original steady-state of the network. This was conducted for each active node in each cell state. The Hamming distance was calculated for each simulation and the z-normalized distances are shown in Fig. 3c for the 10 protein knockouts which caused the largest changes in each phenotype. Notably, knocking out TP53 or PTEN (two commonly mutated proteins in GBM41,42) drives the network toward the MES state and may be a key cause of resistance to TMZ and RT14,43,44. Interestingly, STAT3 knockout drives the network away from the MES state, toward the AC state. In addition to assessing how individual proteins altered cell state transitions, we examined how sets of proteins influenced transitions. Specifically, we mapped proteins to their hallmark pathways from MSigDB, and determined certain functional groups of proteins either promoting or preventing cell state transitions. For example, the transcription factors E2F1 and E2F4 are predicted to be critical for maintaining the NPC-like state and display similar effects when knocked out. Utilizing the same hallmark pathways from MSigDB as before, we found that both E2F1 and E2F4 knockouts primarily affect model proteins in the G2M pathway of the cell cycle. These effects were present when knocking out both proteins from the AC and NPC-like state. We also found other patterns of knockout effects. In the MES state, knockout of one of three MAP kinases (MAP2K1, MAP3K7, and MAPK8) resulted in decreased activity in the TNFA signaling pathway, the TGF beta signaling pathway, and the PI3K signaling pathway. On the contrary, knockout of both PTEN and TP53 from the OPC state increased signaling of the TNFA pathway. Together these results depict the importance of MAP kinases in the activity of the MES state and how prominent inhibitory singling proteins may affect this activity.
Our simulations indicate that even just single knockouts of certain proteins may cause shifts in cell states suggesting that these phenotypes are relatively unstable. In particular, the NPC state was the most susceptible to the protein knockouts. Of the 146 KO simulations we conducted, 100 (68%) caused a shift away from the NPC state. Comparatively, shifts away from the MES, OPC, and AC states occurred in 61%, 50%, and 53% of the simulations, respectively. Many of these shifts may not be a “complete” transition from one state to another but may suggest the existence of hybrid states which have been observed before28. To further understand if there may be differences in stability of the cell states, we used an independent dataset from the Glioma Longitudinal Analysis (GLASS) Consortium, which contains primary and recurrent glioma tumors including clinical characteristics, bulk RNAseq and other omics data. Within this dataset, there are 95 patients with measurements from both a primary (untreated) and recurrent tumor. We determined the predominant cell state in all these tumors and found that 39 of the 95 patients underwent a transition in predominant cell state upon receiving therapy (41%). Additionally, of the 33 patients whose primary tumor state was NPC, 18 (55%) underwent a cell state transition during therapy. Comparatively, of the 56 patients whose primary tumor state was MES, 17 (30%) underwent a cell state transition during therapy. Using the hypergeometric test, we found that the GLASS dataset was significantly enriched for patients transitioning away from the NPC state (p = 0.04), but not enriched for patients transitioning away from the MES state. Given the small sample size of both the AC and OPC state, we did not analyze these patients further. Altogether, the data suggest that the NPC state is less stable than other cell states, but partial or hybrid cell states exist within the GBM patient population.
Similarly, our simulations also point to single knockouts shifting a cell toward two different states. Given the existence of hybrid states and the overlap in pathway activity between the cell states, this is expected behavior. These shifts toward more than one state provide evidence of shared hubs between two or more states. For example, E2F1 and E2F4 activity is most indicative of the NPC-state. However, the AC state also displays high E2F1 and E2F4 activity and knockout of either of these proteins will facilitate transitions away from both these states. On the contrary, neither the MES nor OPC are as reliant on E2F1 or E2F4 activity and knockout of these proteins will cause changes towards these two states in the network.
Machine learning models predict cell state of GBM Patients
Given that our Boolean model was constructed from patient data (i.e., Wang), we sought to determine if our model simulations are representative of other patient samples as well. Our simulations suggest that many different perturbations could cause a change in cell state, but it remains unclear which of these perturbations, if any, may occur in patients. We developed a workflow (Fig. 4) to help clarify these two questions. Using the KO model simulations from the Wang data as input (Fig. 3c), we trained four different machine learning models, namely: multinomial logistic regression, K-nearest neighbor (KNN), random forest (RF), and boosted trees (XGBoost). The GLASS database of primary and recurrent glioma tumors is an ideal independent dataset to validate that our model predictions are accurately capturing clinical cell states. We used bulk RNAseq data from 201 patients. Just as with the Wang dataset, only wtIDH1 samples were considered. The data was processed to match the format of the Boolean model. Briefly, bulk RNAseq were deconvoluted to determine the predominant cell state of each patient. Differentially expressed genes were identified for each sample and used to estimate transcription factor activity. In lieu of phosphoproteomic data, we derived a relationship between RNAseq and phosphoproteomic activity measurements using the Wang dataset as reference (see Methods). The activity measurements are then converted to Boolean values. These values are the input to the ML models as our test cases. We tested the models’ abilities to correctly predict the predominant cell state of the GLASS dataset. Primary and recurrent tumors were tested independently, and additional datasets adding noise into the activity measurements were also tested.

Protein KO simulations are used to train machine learning models to predict cell state. RNAseq data from the GLASS database were used to determine the predominant cell state for each GLASS sample. These data were compared to healthy brain tissue to determine differentially expressed genes and estimate protein activity. These protein activities are booleanized to match the format of the Boolean simulations and become the test cases for machine learning performance.
Most GLASS samples were identified as either the MES or NPC cell state indicating that the dataset has substantial class imbalance. We evaluated our models using two objective metrics known to still be effective when there are massive class imbalances in the dataset: Matthew’s Correlation Coefficient (MCC) and weighted F-1 scores. MCC is bound between negative one and positive one, while F1 scores are bound between zero and 1. In both cases, a value closer to one indicates a better performing model. XGBoost performed the best out of all the machine learning models for every dataset (Fig. 5a). We also evaluated the performance of each model by breaking down the datasets into each individual phenotype. Using sensitivity, specificity and balanced accuracy, we found that the MES and NPC cell states were well predicted in both the primary and recurrent tumors, but the AC and OPC cell states were not well predicted. All the results from the four different ML models can be found in the supplemental material (Supplementary Fig. 4).

a Performance of Difference ML Models. The table shows the performance of each of the four machine learning models built for classification as scored by Matthew’s Correlation Coefficient (MCC) and by the weighted F-1 Score. Red scores are for the primary tumor dataset, while blue scores are for the recurrent tumor dataset. b The Mean contribution of each Feature to all Cell State Predictions from XGBoost. Shapley values were calculated for each of the 201 tumors and the average score is shown here. Each row represents the individual gene (feature) contribution to each of the 4 cell states and shown as a sum. c Shapley Plots of a Primary and Recurrent Tumor of One Patient. Each primary and recurrent (22 months later) tumor has four Shapley plots, one for each of the four cell states. Each plot depicts how the most critical proteins affect the prediction of each cell state from the machine learning model. For this patient, the primary tumor was correctly identified as an NPC-like tumor (2nd row, left) with CDK2 the most contributing positive feature, while the recurrent tumor was correctly identified as an MES-like tumor (3rd row, right) with HGF the most contributing positive feature.
Given the MES and NPC cell states are the two most common predominant states, our model performed well overall. One possible explanation for being unable to predict the AC and OPC cell states is the heavy class imbalance within the dataset. However, the model was trained with knockout simulations which were far more balanced than the clinical data. Furthermore, Fig. 5b indicates that the state of the androgen receptor (AR) and transcription factor RFX1 are important for predicting the AC and OPC states respectively. In our Boolean simulations of the AC state, AR is consistently inactive. However, AR is active in all our AC GLASS samples, indicating a discrepancy between model simulations and our independent GLASS data set. A similar phenomenon is observed for the OPC state. RFX1 is universally inactive in our OPC Boolean simulations, but active in all but one of the GLASS OPC samples. These results indicate that the inability of our XGBoost model to predict the OPC and AC cell states stems from the Boolean model rather than being an artifact of the class imbalance. More work is needed to understand how the model may be misinformed regarding these two proteins.
XGBoost interpretations provide mechanistic hypotheses for cell-state transitions
Historically, ensemble methods of decision trees have been difficult to interpret. However recent advances in machine learning techniques have opened the door for interpreting these traditionally “black box” models. One such method is Shapley Additive Explanations (SHAP) which uses game theory to understand the contribution of each feature to the final model predictions (Fig. 5b)45. While feature contributions for all states are calculated, we focused on the MES and NPC states as our model predicted these cell states well. Importantly, CDK2, HGF, JUNB, IRF1, FGF2, and TGFB1 are identified as important predictors of the MES and NPC states.
One additional benefit of the Shapley approach is that each individual patient sample can be investigated for which features contributed to the predicted cell state. Specifically, the method could indicate what features (genes) probabilistically are involved in a predicted cell-state transition (if a transition occurs). To illustrate this approach, we will use one example from a specific patient in the GLASS dataset who had both a primary and a recurrent tumor sample. As seen in Fig. 5c, this patient’s primary tumor was correctly classified as a predominately NPC-like tumor. In this case, CDK2 and IRF1 being in the active state were the largest contributors to correctly identifying the cell state of the tumor as NPC. On the other hand, HGF being in the inactive state was a major contributor to why XGBoost did not choose the MES state as the predominant state. Subsequently, this patient underwent standard of care therapy (TMZ and RT), and another tumor sample was taken 22 months later (Fig. 5c). In this tumor, we again correctly identified the predominant cell state, MES. Using the SHAP values we can see what key proteins changed state over the course of therapy, and at the very least were associated with the shift from NPC to MES. In this case, both CDK2 and HGF changed states. CDK2 became less active while HGF became more active indicating a switch to a MES-like tumor.
We found that the activity patterns of HGF and CDK2 were consistent throughout the GLASS dataset. Of the 95 fully matched samples within GLASS (in this case meaning both a primary and recurrent measurement are available), 14 of these patients underwent a transition from an NPC-like tumor to a MES-like tumor. In all these cases, HGF switched from inactive to active and CDK2 switched from active to inactive, as predicted from our analysis. These changes were significant by the hypergeometric test (p < 0.01) (Supplementary Fig. 5a). There were an additional 12 patients of the 95 who underwent a transition from an MES-like tumor to an NPC-like tumor. In these cases, HGF switched from active to inactive in 9 of the 12 patients, while CDK2 switched from inactive to active in every patient. These changes were also predicted by our analysis and significant by the hypergeometric test (p < 0.01) Supplementary Fig. 5b.
Combining SHAP interpretations with knock out simulation results
The protein knockout simulations provided a plethora of potential perturbations which could cause a cell-state transition, but did not provide information on which of these transitions may be most likely to occur in a clinical setting. On the other hand, our machine learning model was able to correctly identify the key proteins that switched state over the course of therapy, but do not consider the perturbations or dynamics that may have led to these changes. We wanted to combine these two pieces of information to create causal hypotheses of cell-state transitions. We have already shown that high CDK2 and HGF activity are positively correlated to the NPC and MES state, respectively, and that the inactivation of CDK2 and activation of HGF in response to standard of care are clincially relevant changes indicative of a cell-state transition from the NPC to the MES state (Fig. 5c). However, the question still remains as to how CDK2 and HGF become inactive and active. In other words, what are the transcriptional or signaling changes that are capable of bringing about significant decreases in CDK2 activity or increases in HGF activity? We can use our protein KO simulations from the NPC state to search for answers to this question. Looking at Fig. 3c, we can observe that many KOs from the NPC may shift the cell state toward the MES state. However, by focusing on the HGF pathway we found that a KO of transcription factor TFAP2A is capable of causing HGF activation. TFAP2A activates PTEN, which is a known inhibitor of HGF and HGF signaling Fig. 6). We thus can connect TFAP2A silencing with HGF actviation and and a transition from the NPC-like to the MES-like state. Another study has corroborated this mechanism of action, showing that TFAP2A transcriptional binding sites are hypermethylated in recurrent GBM tumors following treatment from standard of care46. Threrefore, the epigenetic silencing of TFAP2A is a plausible hypothesis for cell states transitions from the NPC to the MES state in GBM. Decitabine, a hypomethylating agent, has been shown to increase TFAP2A expression/activity in GBM and increase PD-L1 expression47. The effect of decitabine on PTEN expression GBM has not been observed, however it has increased PTEN expression in other cell lines48,49. Increasing PTEN activity with decitabine treatment may help prevent the MES state and be an effective method of CSD therapy.

The bolded proteins show the pathway involved in causing a cell-state transition from NPC to MES, as predicted by our knockout simulations and XGBoost predictions. Kinases are depicted by pink rectangles and transcription factors are depicted as blue triangles. Activating/stimulating interactions are depicted by arrows and inactivating/inhibitory interactions are depicted with T-bars.
Discussion
There remains a critical need to improve the treatment of GBM. Due to the high degree of heterogeneity and plasticity of GBM cells, controlling cell state transitions may be a viable treatment strategy, and previously referred to as CSD therapy33. There have been a few different classifications proposed to describe the cell states of GBM12,13,50. These classifications utilize different types of omics data. We chose the Neftel et al. classifications as these phenotypes are defined at the single cell level and are robust across multiple different datasets51,52,53. While other investigations have explored the transcriptional differences between GBM cell states, this is the first work to incorporate phosphoproteomic measurements into PPINs for a relatively large number of GBM patients (92). The combined PPIN represented four GBM cell states considered relevant to maintenance of tumor growth and treatment resistance. The combination of phosphoproteomics with RNAseq data provides not only a more robust network that connects gene regulatory networks and downstream signaling pathways but also possible protein targets amenable to pharmacological intervention. This network model could be a foundation to implement CSD therapy to either promote or prevent desired cell state transitions33. To that end, a Boolean network model derived from the combined PPIN was constructed and used to conduct in-silico protein knockout simulations that serve as proxies for pharmacological intervention. Finally, our machine learning model was able to identify the relationships between the activity of specific proteins and cell states to predict phenotypes in clinical samples. Connecting these predictions with our simulations provided a more complete robust picture of cell state transitions and more nuanced hypotheses.
Although the Boolean modeling approach is advantageous for large-scale pharmacodynamic simulations, it can be appreciated that converting the current model into a biochemical mechanistic model may be rewarding. Nonetheless, the current analyses have generated viable hypotheses from large multi-omics GBM patient datasets. The model simulations and ensuing machine learning analyses led to testable hypotheses of how pharmacological intervention of protein targets could prevent or promote a cell state transition. It will be important to associate these or other quantifiable cell states with aggressive cell proliferation and treatment resistance to fully implement CSD therapy.
There are limitations to this work. First, the PPIN model only includes intra-cellular signaling networks whilst it is known that other cell types exist in the tumor microenvironment including immune cells that affect the phenotypic behavior of cancer cells54,55,56,57. Such cell-cell interactions are ripe for future exploration and the multi-omics pipeline could be adapted to develop GBM cell and microenvironment interactions and ultimately a revised PPIN58. A second limitation is that the model was built with data mostly reliant on bulk tumor samples representing a conglomerate of cell populations. Given what is known about ITH in GBM and shown in this work, it is certain that no one tumor will ever express only one phenotype and single cell multi-omics datasets may more accurately capture how a population of cells in a single tumor may respond to perturbations. Of course, at the same time, single cell phosphoproteomic measurements are not yet readily available. Another limitation to this work is the lack of experimental data to validate the model predictions, i.e., could a gene or protein knockout or pharmacological inhibition of a target protein prevent or promote the model-predicted transition? Collecting time-course RNA sequencing and/or proteomic measurements from GBM cells either in the presence of drugs (inhibitors of proposed cell state drivers) or with proteins knocked out, would serve as an appropriate next step. Indeed, some of the proteins our model has predicted to be critical for cell state transitions have been previously identified as critical for certain phenotypes. For example, STAT3 activity has been associated with the mesenchymal phenotype59. The machine learning section of our work highlighted the difficulty of our model in predicting the AC and OPC states of GBM. In both the Wang and the GLASS datasets, only a small number of patients were classified in these cell states. Certainly, the small sample sizes contributed to the inability to effectively describe these patients in the clinical GLASS sample. However, as we previously noted, there were differences in activity predictions of certain key proteins. Future work may focus on adjusting our Boolean logic to account for these differences, so our simulations better match the observations from the clinical data set.
In conclusion, we built a Boolean model from a multi-omics dataset which permitted the definition of the four Suva cell states, previously identified in GBM patients. The ensuing model simulations supported by machine learning analyses of an independent GBM patient dataset provided hypotheses of the key drivers of cell state transitions and lay the groundwork for future experimental validation in vitro as well as the expanding the scope of the model to include the effect of other cell types in the microenvironment.
Methods
Data sources
Publicly available snRNA-seq, bulk RNA-seq, and phosphoproteomic data from healthy brain and GBM tissue were downloaded from the CPTAC portal (https://proteomics.cancer.gov/data-portal) as originally published in Wang et al.60. These data – referred to as the Wang dataset – were used in different analysis streams that culminated in a protein–protein interaction network (PPIN) for each of the four Neftel cell states or phenotypes. Each analysis stream is described in detail below. Bulk RNAseq and clinical data were downloaded from the GLASS Consortium (http://www.synapse.org/glass)61. These data were used as a test set for machine learning algorithms. Human hallmark gene sets were downloaded from the MSigDB for the hallmark pathway analyses. (https://www.gsea-msigdb.org/gsea/msigdb/index.jsp).
Single cell preprocessing
Eighteen snRNA samples (filtered feature matrices) from wild type (WT) IDH1 GBM patients as part of the Wang dataset, were imported into R via the Seurat package (v4)34 for analysis. Cells in which <200 genes were identified and cells which express >10,000 genes were excluded from the analysis. Any cell with mitochondrial gene expression exceeding 10% was excluded as these cells were likely apoptotic. Genes that are expressed in <10 cells across all samples were filtered out due to low expression across the population. Cells were then split into immune and non-immune cells using CD45 (pan-leukocyte marker) expression as a biomarker. Non-immune cells were further analyzed in Seurat, while immune cells were excluded, as we are primarily concerned with cancer cell phenotype. Of the 222,567 initial cells, 27.8% (61,897) were identified as immune cells. The distribution of CD45 in immune cells can be found in the supplement (Supplementary Fig. 2b).
Data transformation
The remaining feature counts were divided by the total feature counts in each cell and multiplied by a scale factor of 10,000. The result was transformed with the natural log +1.
Variable features
The 3000 most variable genes were identified with the variance stabilization transformation (VST). VST uses polynomial regression to fit a curve to the relationship of the log of the variance and mean of the dataset. The gene values were then standardized using the fitted line. Gene variance was estimated by using the standardized curve and any values exceeding the maximum (set to the square root of the number of cells) is set to this maximum. This approach helps account for the mean-variance relationship that exists in snRNA-seq data and helps to improve the ranking of the most highly variable genes.
The 3000 most variable genes are then linear transformed in the following manner:
-
1.
The average expression of each gene between all cells is 0.
-
2.
The variance of each gene between all cells is 1.
Principal Component Analysis (PCA)
The 3000 scaled genes are used to reduce the dimension of the dataset via principal component analysis, and the first 12 principal components were used. After the 12th principal component, the variance between principal components is <0.1% which indicates subsequent PCs add limited information while increasing the complexity of downstream analysis. This is shown in an elbow plot and included in the supplemental Fig. 2.
Cell clustering
Using the 12 principal components, a K-nearest neighbor (KNN) graph is built by calculating the Euclidean distance and Jaccard similarity. Cells are then clustered using the Louvain algorithm and a resolution of 0.5. 40 clusters were identified in this analysis. We found that even at lower resolutions, obtaining just one cluster of each cell type (including non-malignant cells) was not possible. We chose a resolution of 0.5 as it gave the most robust results in the subsequent downstream analysis, and is consistent with what is recommended when using the Louvain algorithm for single cell data34. The Wilcoxon rank sum test was used to find differentially expressed genes amongst the clusters.
Analysis of copy number alterations
Single cell copy number alterations (CNAs) were analyzed using CopyKat with default parameters35. CopyKat uses a Bayesian approach to identify cells with aneuploidy, a hallmark of malignant cells. Each patient sample was analyzed individually to avoid batch effects between samples.
Cell labeling
Clusters were labeled using a combination of CNAs and gene expression meta-modules as defined by Neftel et al.28 and Wang et al.60 for the following cell types/states: MES-like, AC-like, NPC-like, OPC-like, neuron, astrocytes, oligodendrocytes, and stromal cells. Cells displaying both aneuploidy and gene expression consistent with a malignant cell type were labeled with one of the four malignant cell states. Conversely, cells which exhibit diploidy and gene expression consistent with a non-malignant cell type were labeled as such. In cases where there was a discordance between the CNAs and gene expression methods, samples were analyzed for two specific copy number alterations which are known hallmarks of GBM: (1) Chromosome 7 amplification and (2) Chromosome 10 loss/deletion28,62. Cells with these two alterations were classified as malignant, while cells without these alterations were classified as non-malignant.
Deconvolution
The Bisque36 package in R was used to deconvolve 74 bulk RNAseq samples from wtIDH1 patients and estimate cell type compositions. Read counts were normalized by counts-per-million (CPM) and relative abundances were calculated by averaging gene CPM within each cell type. A linear relationship was then defined between each gene’s single cell expression and bulk expression. Single cell expressions were taken from the eighteen patient samples previously analyzed. In this manner, cell type compositions can be estimated for both the eighteen patients with snRNA-seq data and the 74 patients with bulk RNAseq data thereby yielding a total of 92 patients.
Patient classification
Based on the proportional cell type composition estimates, each patient was classified into a dominate cell state–- one of four phenotypes as previously defined by Neftel et al.28: Mesenchymal-like (MES-like), astrocyte-like (AC-like), Neural-progenitor-like (NPC-like), and Oligodendrocyte-progenitor-like (OPC-like).
COSMOS PIPELINE
Causal Oriented Search of Multi-Omics Space (COSMOS) is a computational package utilizing multi-omics data to generate mechanistic hypotheses relevant to disease, including cancer37. The COSMOS pipeline outputs a directed PPIN as described below. The PPIN includes details on whether each protein node, including transcription factors, is up- or down–regulated and whether each interaction or edge between proteins is activating or inhibiting. Each of the four cell states was analyzed separately with this pipeline, utilizing Bulk RNAseq and phosphoproteins.
Bulk RNAseq analysis
RNAseq data were filtered for low expressed genes; any gene with read counts <50 was excluded from the data set. Genes were normalized using the median of ratios. DESeq263 was used to analyze remaining genes and a differential analysis was conducted between genes from healthy brain and genes from each of the four phenotypes. A Benjamini-Hochberg correction was applied and any gene with an adjusted p < 0.05 is considered differentially expressed.
Phosphoproteomic analysis
Phosphoproteins which were expressed in less than 25% of samples were excluded from the analysis. A differential analysis was performed using limma64. A Benjamini-Hochberg correction was applied and any phosphoprotein with an adjusted p < 0.05 is differentially expressed.
Footprint based activity estimation
The DoRothEA R package65 was used to download transcription factor-gene interactions. TF-gene relationships are scored from A-E based on the available evidence in literature, with A being the highest score. Interactions categorized with confidence levels of A-C were included in the analysis which includes all TF-gene relationships that are defined as high confidence (A), likely confidence (B), or medium confidence (C). High confidence interactions include interactions that are supported by four independent sources of evidence (curated databases, ChIP-seq experiments, transcription factor binding sites (TFBS), and genotype-tissue expression (GTEx). Likely confidence interactions include three of these lines of evidence, while medium confidence interactions include two. Protein–protein interactions were downloaded using OmniPath66. The WMEAN algorithm from the decoupleR R package38 was used to calculate enrichment scores and activity scores for each transcription factor and protein.
Prior Knowledge Network (PKN)
In concordance with COSMOS37, the prior knowledge network was constructed from three databases: STICHdb (http://stitch.embl.de/), OmniPath66, and ReCon3D67. STICHdb and Recon3D primarily contain interactions between metabolites and enzymes while OmniPath contains protein–protein interactions. Since we do not utilize metabolomic data in this study, most of the interactions in our networks come from OmniPath. The PKN includes a total of 76,638 interactions.
COSMOS
The PKN and activity scores for both proteins and transcription factors are input into COSMOS. COSMOS uses CARNIVAL68 to convert the data into an integer linear programming (ILP) problem and the IBM CPLEX ILP solver (https://www.ibm.com/products/ilog-cplex-optimization-studio/cplex-optimizer) was used to optimize the objective function. COSMOS was run in both forward and backward mode for each of the four phenotypes and the two runs were combined to generate the final PPI network for each of the four cell states.
Boolean network construction
All four COSMOS networks were then combined into one final PPIN that was used to derive a Boolean logic model. The networks were combined by incorporating all unique interactions among the four networks. To elucidate the logical formalisms (e.g., AND vs OR gates) of the final network, MaBoSS (Version 2.0)40 was used to conduct steady-state simulations. For each phenotype, input nodes (species in the network without any upstream effectors) were set to either “ON” (1) or “OFF” (0) depending on their state for each respective phenotype. Proteins which are upregulated according to COSMOS are set to “ON”, while proteins which are downregulated are set to “OFF”. The model was then run to steady-state. For a logical formalism to be considered correct, the steady-state of the simulation must fully match the desired phenotype as described by the COSMOS PPI networks. In other words, each individual node must be correctly in the “ON” or “OFF” state, as defined by our COSMOS analysis.
Knockout simulations
Knockout simulations were performed in MaBoSS utilizing the Boolean model. Starting at steady-state, each node which is in the “ON” state, is turned “OFF” one simulation at a time and the model is run until a new steady-state is reached. These simulations were performed from the steady-state of all four cell states or phenotypes resulting in 146 total simulations. Each simulation was evaluated for changes as compared to the baseline steady-state (Hamming Distance) for each phenotype and z-normalized. Each new steady-state resulting from the knockouts was classified into one of the four cell states by calculating the overall similarity between itself and the original COSMOS cell states.
Machine learning model training
Results from the knockout simulations were implemented as the training set to a machine learning model capable of predicting the predominant phenotype in a clinical cohort. Each simulation can be considered one input sample of the larger training set. Each simulation contains the state of each protein (either on (1) or off (0)) in the Boolean Model. The state of each protein is thus a feature that may be predictive of any of the four cell states. We aim to build a machine learning model that can take the state of each protein in the Boolean model and predict the cell states of an independent data set. We trained and tested four different machine learning algorithms: multinomial regression, k-nearest neighbor (KNN), random forest (RF), and XGBoost. For multinomial regression, elastic net regularization was applied with an alpha = 0.14, which was optimized with a 10-fold cross validation on the training data69. For KNN, a 5-fold cross validation was used with recursive feature elimination to determine the optimal number of features (80) and neighbors (13). For both RF and XGBoost, the Boruta feature selection algorithm was used to select the most important features with a maximum runs of 10070. The Random Forest package in R was used for the RF model with the following parameters changed from default: ntree (number of trees) = 1000 and mtry (number of features sampled at each tree split) = 35. XGBoost was implemented in Python with the following hyperparameters changed from default: n_estimators (number of trees) = 100, learning_rate = 0.1, max_depth (tree depth) = 1, and subsample (proportion of samples used per tree) = 0.1.
Machine learning model testing and interpretation
Bulk RNAseq from 201 wtIDH1 tumors from the GLASS Consortium were used to test model performance. These samples included 107 primary (untreated tumors) and 114 recurrent tumors. 95 samples were from patients that had both primary and recurrent tumors. Using the same deconvolution method as previously described, each sample was classified as one of the four cell states. Next, log-fold changes for each sample were calculated as compared to healthy brain tissue. Footprint based activity measurements were estimated in the same manner as before using the DoRothEA R package. Since phosphoproteomic data was not available for the GLASS samples, we derived a simple ratio/relationship between phosphoproteomic activity and RNAseq activity estimates using the Wang dataset:
In this way, we obtained full activity estimates for the GLASS dataset. These activity estimates were then converted to Boolean values (positive activity = 1, negative activity = 0) to match the simulation data and to test the four machine learning models. The TreeSHAP method for calculating Shapley values from decision tree models was used with default parameters for model interpretation71.
Responses