An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results

Background & Summary
The development of pharmaceuticals faces numerous challenges, particularly the high incidence of adverse drug events (ADEs), which significantly contribute to the discontinuation of drug candidates1. ADEs are injuries resulting from medical intervention related to a drug, including those caused by the drug’s pharmacological properties, improper dosage, or interactions with other medications, whether from appropriate use or misuse2. Data show that about 96% of drug candidates do not receive market approval, underscoring the inefficiencies and financial risks in drug development3. The average investment to bring a new drug to market is estimated at $1.3 billion, with costs for specific drugs varying widely depending on the therapeutic area4. A recent analysis shows that safety concerns are responsible for 17% of clinical trial (CT) failures1, underscoring the critical need for improved predictive methods for managing ADEs. Such failures not only present substantial financial risks to pharmaceutical companies but also raise ethical issues, especially considering the human costs associated with ADEs during CTs3,5. Drug candidates deemed safe in preclinical stages can exhibit toxic effects in clinical phases, leading to their failure. A notable factor contributing to this problem is the discrepancy between animal models used in preclinical screenings and human physiological reactions, indicating a significant gap in translating preclinical safety data to human contexts, which can result in severe ADEs, including fatalities3,5,6,7. In this context, in-silico models emerge as a promising approach for a safer and more accurate prediction of ADEs, potentially minimizing the differences observed between preclinical and clinical outcomes in pharmaceutical research and development.
Recent advancements in artificial intelligence and machine learning have drawn interest in this area, with research now focused on these technologies to complement existing methods in forecasting ADEs8,9,10,11,12,13,14,15,16,17,18,19. Early research efforts were centered on particular use cases, such as specific medications8,9,10,11 and organ systems or routes of administration12,13,14. These methods have provided good explainability but have a limited range of applicability. To overcome these limitations, machine learning models that consider the molecular structure of drugs have been proposed15,16,17. These models work with the chemical space of drugs and are meant to enable predictions across a larger and more diverse set of compounds20. Drugs are encoded in standard representations such as SMILES21, SELFIES22, and molecular descriptors23, and are associated with ADEs, such as those reported in public registries. Despite their sophistication, they often struggle to significantly outperform simpler approaches.
Existing benchmark datasets such as SIDER24, AEOLUS25, and OFFSIDES26 have been used to analyze and predict drug-ADE associations using data-driven approaches. SIDER is a dataset comprising 1,430 unique drugs that compile ADEs reported in public documents and package inserts. It is designed through automated text mining and manual curation to link drugs with their reported ADEs. AEOLUS comprises 4,245 unique drugs and is derived from the FDA’s adverse event reporting system (FAERS) (https://www.fda.gov/), standardizing ADE reports to facilitate analysis. This dataset focuses on post-marketing surveillance, offering a broad view of ADEs collected in real-world settings. OFFSIDES, a dataset composed of 1,332 unique drugs, identifies overlooked ADEs by analyzing data from FAERS, focusing on ADEs not listed on the official drug labels. Despite their significant contributions, these datasets are limited to approved treatment regimens and lack information from controlled environments. Specifically, they do not always account for the total number of patients treated, the precise proportion of those who experienced ADEs, or detailed patient characteristics and treatment regimens, altogether. Furthermore, no comparative cases exist where identical drugs are used under different conditions. Still, it is known that various contextual factors such as demographics, medical history, drug dosage, body weight, alcohol consumption, ethnicity, smoking habits, and pre-existing conditions influence the occurrence of ADEs27.
To address these limitations, we developed CT-ADE28, a comprehensive dataset that uniquely integrates five features not collectively available in existing resources: i) Patient data, encompassing information such as demographics, pathologies, and allergies, enabling the study of population-specific ADE risks; ii) Treatment regimen data, detailing information such as dosage, route, duration, and frequency of administration to improve regimen-specific predictions; iii) Complete enumeration (census) of ADE outcomes, systematically capturing all positive and negative cases within the study population, unlike voluntary reporting systems; iv) Controlled monotherapy data, derived from clinician-controlled trials that ensure strict adherence to treatment regimens while eliminating the confounding effects of polypharmacy; and v) Comparative analysis opportunities, allowing the study of identical drugs under varying conditions, such as patient demographics or treatment regimens. To the best of our knowledge, and as highlighted in a recent review29, CT-ADE28 is the first benchmark dataset to consider patient, drug, and treatment regimen data collectively.
CT-ADE28 was compiled from CT results available through ClinicalTrials.gov (https://clinicaltrials.gov/), offering a rich resource for advancing risk assessment in pharmaceutical research and development. The dataset is structured to support a classification task, focusing on analyzing study groups within CTs that adhere to monopharmacy, i.e., the practice of using a single drug for treatment. In the dataset, study groups describing interventions and their respective regimens are enriched with molecular structure information of the drugs being used, linked via DrugBank30, PubChem31, and ChEMBL32. This approach enables a clearer understanding of how individual drugs and regimens can lead to patient-specific ADEs, free from the confounding effects of multiple concurrent medications and lack of census data. CT-ADE28 is designed as a multilabel classification dataset to reflect that a single drug can cause multiple ADEs. This is achieved by standardizing clinician-reported ADEs from clinical trials, aligning them with the system organ class (SOC) and preferred term (PT) levels of the Medical Dictionary for Regulatory Activities (MedDRA) (https://www.meddra.org/). The dataset encompasses up to 2,497 unique drugs and 168,984 drug-ADE pairs, providing an extensive resource for predictive modeling. CT-ADE28 comprehensively covers all system organ classes and drug pharmacological groups, offering a robust foundation for ADE prediction and enabling its application across diverse therapeutic areas and drug classes.
Methods
This section discusses the methodological framework for dataset creation, including the rationale for dataset splitting and quality assessment approaches. Moreover, it covers the selection and consolidation of source materials, data acquisition from CT results, DrugBank, PubChem, ChEMBL, and MedDRA ontology, and pre-processing steps for standardization.
CT-ADE resources
CT-ADE28 dataset is based on five primary resources: ClinicalTrials.gov, DrugBank, PubChem, ChEMBL, and the MedDRA ontology.
ClinicalTrials.gov
ClinicalTrials.gov is a comprehensive registry of CTs maintained by the U.S. National Library of Medicine that provides up-to-date information on ongoing, completed, and terminated trials across a diverse range of drugs, diseases, and medical conditions. It offers transparency and access to detailed information on study objectives, design, methodology, eligibility criteria, locations, and sponsors. It describes treatment regimens, including the duration, strength, form, and dosage of interventions for each study group and their corresponding ADEs.
DrugBank, PubChem, and ChEMBL
These knowledgebases cover a wide range of compounds and drug properties, including their chemical structure information. DrugBank is maintained by the University of Alberta and The Metabolomics Innovation Center, PubChem by the National Center for Biotechnology Information, and ChEMBL by the European Bioinformatics Institute.
MedDRA
MedDRA is an internationally recognized medical terminology system used extensively by health authorities and the biopharmaceutical industry. It supports the standardized classification of adverse event data through a hierarchical system ranging from specific symptoms to broad organ system categories.
Data acquisition and pre-processing
In our pre-processing pipeline, we selected the following information from the data sources:
ClinicalTrials.gov
We selected CTs with completed or terminated status, involving at least one monopharmacy intervention, and with results reporting adverse events. The data were downloaded on April 17, 2024.
DrugBank, PubChem, and ChEMBL
We downloaded the DrugBank database version 5.1 on March 14, 2024, and two specific subsets from ChEMBL – Approved and USAN – on April 18, 2024. For PubChem, we specifically selected entries annotated as linked to ClinicalTrials.gov, ensuring relevance and integration with CT data. The PubChem data were downloaded on April 18, 2024. ATC codes and relevant synonyms were extracted from each database.
MedDRA
We selected MedDRA’s English version 25.0. The MedDRA ontology was structured as a graph based on the hierarchical relationships from broader classifications (SOC) to detailed descriptions (LLT).
CT-ADE construction
As illustrated in Table 1, CT-ADE28 provides a detailed representation of individual study group instances from CTs, including information about the i) intervention name, which indicates the name of the drug under investigation, ii) ATC codes, which classify drugs based on their areas of action, iii) SMILES notation, which provides a computational representation of the drug’s chemical structure, iv) eligibility criteria, which specify the demographic and medical characteristics required for participant inclusion, thereby defining the target population for the intervention, v) group description, which describes the treatment regimen, including dosage and administration details, and vi) ADE label, which lists the ADE events associated with the groups at the SOC and PT MedDRA levels. Additionally, CT-ADE28 includes CT-level information such as participant health statuses, gender, age group, and trial phase. The dataset is segmented into two versions, i.e., CT-ADE-SOC28 and CT-ADE-PT28. All versions share the same features, but ADE labels vary depending on the MedDRA target level, indicating the occurrence (1) or absence (0) of statistically significant ADEs.
Deconstructing clinical trials
The unique configuration of CT-ADE28 enables a single CT to generate multiple data entries. As CTs can evaluate multiple drugs or distinct treatment regimens, separate data entries for each study group are required to capture this information. Study groups are defined by their specific intervention strategies and help study how variations in drugs or regimen details – such as dosage, administration, and duration – contribute to differing ADE profiles for the same population (i.e., same eligibility criteria). We developed a preprocessing pipeline to systematically deconstruct CTs into study groups, ensuring the accurate representation of group-specific ADE data. The pipeline processes CTs sourced from ClinicalTrials.gov, focusing exclusively on monopharmacy interventions and filtering for trials with completed or terminated statuses, classified as interventional, and reporting results.
As illustrated in Fig. 1, which depicts the CT preprocessing pipeline, the protocol section of each CT outlines the eligibility criteria shared across study groups, defining the target population. The trial is then divided into study groups based on distinct intervention strategies, each represented by a triplet (“arm group title,” “arm group description,” “raw intervention name”). The result section complements this by describing ADE data for each group, forming a second triplet (“ADE group title,” “ADE group description,” “ADE group report”).

Linking protocol and result sections of clinical trials to generate raw CT-ADE28 instances. The protocol section provides eligibility criteria and intervention details, while the result section reports ADE outcomes. The pipeline matches these sections to create structured associations between interventions and their ADEs.
The primary goal of the deconstruction process is to link protocol and result triplets into a dataset that combines CT-level metadata, intervention details, and group-specific ADE outcomes. Since direct links between intervention details and ADE reports are absent in raw data, the pipeline employs string matching between protocol triplets (“arm group title,” “arm group description,” “raw intervention name”) and result triplets (“ADE group title,” “ADE group description,” “ADE group report”). This ensures accurate associations between interventions and ADEs.
In single-intervention CTs, linking protocol and result triplets is straightforward. The pipeline matches the raw intervention name with the ADE group title or description using inclusion matching. Matches are accepted only when the intervention name uniquely appears in one ADE group, ensuring unambiguous associations. For trials with multiple interventions, the pipeline first identifies unique matches between arm and ADE group triplets. If an arm group title or description corresponds to only one ADE group title or description, the two are linked. When multiple matches occur, stricter criteria are applied – both the arm group title and description must align with a single ADE group.
The pipeline excludes instances that fail to meet unique match criteria, ensuring the final dataset contains only strict matches. Each successfully matched study group inherits CT-level metadata, such as eligibility criteria, participant health status, gender, age group, and trial phase, further enriching the dataset with comprehensive contextual information. This approach reliably captures multiple- and single-intervention CTs while maintaining high data integrity. By implementing strict matching criteria and systematically linking protocol and result triplets, the CT preprocessing pipeline provides a raw dataset that reflects the nuanced effects of patient- and regimen-level information on ADEs. Among the 491,535 clinical trials available on ClinicalTrials.gov as of April 17, 2024, 61,921 met the criteria of being completed or terminated, interventional, and reporting results. Of these, we extracted and linked ADE data on 31,419 monopharmacy study groups.
Standardizing raw intervention names
Deconstructing CTs provides essential information, including raw intervention names, eligibility criteria, group descriptions, participant health status, gender, age group, trial phase, and raw ADEs. However, to achieve fully mapped instances, as illustrated in Table 1, it is crucial to convert the raw intervention names into standardized representations. This process involves mapping the raw intervention names to their respective canonical names, ATC codes, and SMILES. Initially, we consolidated information from DrugBank, PubChem, and ChEMBL, unifying these knowledge bases by grouping identical compounds by their canonical SMILES or names, and merging their synonyms. Then, we performed an exact and inclusion match of raw intervention names against this unified database. If no match was found for a given instance, we normalized the raw intervention name by removing dosage details, forms, and routes of administration, and by cleaning the text from special characters to enhance compatibility (e.g., transforming “Diprosone® Cream 0.05%“ into “diprosone”). Then, we repeated the matching process. From the 31,419 monopharmacy study groups obtained through CT deconstruction, encompassing 13,110 unique raw intervention names, 7,081 were successfully mapped to their standardized representations. This mapping resulted in 2,825 unique drugs and 21,306 study groups with mapped interventions, populating the Intervention Name, ATC, and SMILES columns of Table 1.
Standardizing adverse drug events
The MedDRA ontology provides a comprehensive framework for standardizing ADE concepts, enabling reliable comparison and aggregation of ADE data across studies. Each of the 21,306 intervention-mapped study groups provides a census of ADE events, i.e., each reported ADE includes its term, the affected organ system, the number of affected patients, and the total number of participants in that study group. To harmonize this data, we standardized the reported ADEs by mapping each ADE term to its corresponding PT concept and linking each organ system to its respective SOC concept. This standardization process employed strict string matching to maintain precision and consistency.
Building on this standardized dataset, we ensured that ADE labels were statistically significant and clinically relevant. To achieve this, we used the Wilson interval for binomial proportion confidence33. The Wilson method ensures asymmetric confidence intervals and constrains the boundaries to a valid probability range. It is particularly suitable for near-boundary estimation and small population studies34, as is common in the early phases of clinical trials35. However, the Wilson lower bound never reaches zero when there is at least one observed event, posing a challenge for clinical datasets, where any occurrence – no matter how rare – is always flagged as non-zero. To address this, we labeled an ADE occurrence as positive if at least 1% of the population was affected with 95% confidence. We applied this threshold following internationally recognized standards from the Council for International Organizations of Medical Sciences36. These guidelines define common ADEs as those occurring in at least 1% of the population, ensuring that positive labels are both statistically significant and clinically relevant. Study groups with any unmapped positive ADEs were excluded to prevent the assignment of false negatives. However, groups reporting no ADEs, i.e., indicating that no ADEs have occurred, were retained.
Among the 21,306 intervention-mapped study groups, 103 were excluded due to missing data on the number of affected patients or total participants, preventing the application of the Wilson interval for statistical evaluation. Additionally, 6,006 study groups at the SOC level and 5,563 at the PT level were excluded due to the absence of strict matches to MedDRA concepts.
Data Records
CT-ADE28 is available on Figshare (https://figshare.com/articles/dataset/28142453) and HuggingFace (https://huggingface.co/anthonyyazdaniml). The dataset is organized into two distinct versions – CT-ADE-SOC28 and CT-ADE-PT28. Both versions are divided into training, validation, and test sets, stratified to ensure no common drugs are shared between splits, thereby avoiding data leakage. This stratification is consistent across levels, ensuring that drugs in the SOC training split are not found in the validation or test splits of the PT version.
The dataset is organized into the following directory structure:
-
ct_ade/soc/: Contains files for the SOC-level dataset.
-
train.csv: SOC-level training data.
-
val.csv: SOC-level validation data.
-
test.csv: SOC-level test data.
-
train_frequencies.csv: Raw ADE frequency data for the training split.
-
val_frequencies.csv: Raw ADE frequency data for the validation split.
-
test_frequencies.csv: Raw ADE frequency data for the test split.
-
-
ct_ade/pt/: Contains files for the PT-level dataset, with a structure identical to the SOC-level directory.
Each split file (train.csv, val.csv, and test.csv) includes metadata about clinical trials, drug information, and ADE labels:
-
nctid: Identifier for the clinical trial.
-
group_id: Identifier for study groups within a trial.
-
healthy_volunteers: Indicates whether the study involves healthy volunteers.
-
gender: Participant gender category.
-
age: Participant age category.
-
phase: Clinical trial phase.
-
ade_num_at_risk: Number of participants in the study group.
-
eligibility_criteria: Eligibility criteria for participant selection.
-
group_description: Description of the treatment regimen.
-
drug_info_source: Link of the drug under study to DrugBank, PubChem, and/or ChEMBL.
-
intervention_name: Name of the drug under study.
-
smiles: SMILES representation of the drug’s chemical structure.
-
atc_code: ATC classification code for the drug.
-
label_*: Binary indicators (1 for presence, 0 for absence) of ADEs, according to MedDRA SOC or PT levels.
The frequency files (train_frequencies.csv, val_frequencies.csv, and test_frequencies.csv) provide quantitative details about the occurrence of ADEs:
-
The frequency files include the nctid and group_id columns to enable linkage with the split files (train.csv, val.csv, and test.csv) but replace label_* columns with frequency_* columns.
-
Each frequency_* column provides raw ADE frequency data, reflecting the proportion of participants experiencing a specific ADE. Frequencies are omitted if the association between an ADE concept and its frequency cannot be confidently established.
Table 2 shows the statistics of the CT-ADE28 dataset across SOC and PT levels for the training, validation, and test splits. The number of drug-ADE pairs in the dataset varies between 40,187 at the SOC level and 168,984 at the PT level, including a maximum of 2,497 unique drugs associated with 15,640 study groups at the PT level.
Technical Validation
Quality control
Accurate mapping of raw intervention names to SMILES representations available in drug knowledge bases (DrugBank, PubChem, and ChEMBL) was critical to guarantee quality. To minimize mapping errors, we only used strict matching techniques. Similarly, the mapping strategy for ADE terms reported in CTs to the MedDRA ontology was performed using an exact match. To further improve the dataset’s integrity, we excluded study groups where we could not map all positive ADEs, preventing the inclusion of incomplete data. Moreover, only ADE reports documenting all necessary fields – specifically, the adverse event, the number of affected individuals, and the total patient count in the study group – were included. To ensure that the positive ADEs were statistically significant and clinically relevant, the Wilson interval for binomial proportion confidence was used to assign classification labels to ADEs. Specifically, an ADE occurrence was assigned a value of 1 if we were 95% confident that at least 1% of the population would experience the ADE; otherwise 0.
Dataset coverage
This section evaluates the extent and diversity of the CT-ADE-SOC28 dataset by analyzing the distribution of ADEs across SOCs and ATC main pharmacological groups of the included drugs. As shown in Fig. 2A, all the 27 SOC categories of MedDRA are covered in the CT-ADE-SOC28 dataset, with the top-3 most represented SOCs being “Gastrointestinal disorders” (Gastr) (38.33%), “Nervous system disorders” (Nerv) (34.49%) and “Infections and infestations” (Infec) (26.85%), and the top-3 least represented being “Product issues” (Prod) (0.33%), “Congenital, familial and genetic disorders” (Cong) (0.28%) and “Social circumstances” (SocCi) (0.19%). Similarly, Fig. 2B shows that all ATC main pharmacological groups are included in the dataset. The top-3 most represented groups are “Nervous System” (N) with 3,152 instances, “Alimentary Tract and Metabolism” (A) with 2,590 instances, and “Cardiovascular System” (C) with 1,486 instances. On the other hand, the top-3 least represented groups are “Systemic Hormonal Preparations, Excl. Sex Hormones and Insulins” (H) with 536 instances, “Various” (V) with 459 instances, and “Antiparasitic Products, Insecticides and Repellents” (P) with 154 instances. For brevity, we present the analysis based on the full CT-ADE-SOC28 dataset, but this coverage is consistent across all SOC splits. A similar analysis for CT-ADE-PT28 is shown in Supplementary Figure 1.

(A) Distribution of SOCs in CT-ADE-SOC28. (B) Representation of ATC main pharmacological groups in CT-ADE-SOC28. Abbreviation mappings for SOCs and ATC main pharmacological groups to their full terms are available in Supplementary Tables 4 and 5, respectively.
To validate the representativeness of the dataset, we compared the SOC-level frequencies of ADEs in CT-ADE-SOC28 to global population frequencies reported in a large-scale study by Aagaard et al.37. Our analysis revealed a Spearman correlation coefficient of 0.862 (p-value < 0.001), indicating a strong and statistically significant alignment between CT-ADE28 and global ADE patterns.
Experiments
To test the hypothesis that adding patient and treatment regimen information enhances ADE prediction, we conducted experiments with discriminative and generative large language models (LLMs). These experiments established baseline performance and evaluated the added value of contextual information compared to the standard use of chemical structure data. Since CT-ADE28 emphasizes patient-specific and treatment-related data, we focused our technical validation on LLMs pre-trained on biomedical corpora. We systematically tested three feature configurations:
-
SMILES only (S) configuration, focusing solely on the SMILES notation of drug compounds.
-
SMILES and group description (SG) configuration, which incorporates group descriptions to exploit both chemical properties and treatment regimens.
-
SMILES, group description, and eligibility criteria (SGE) configuration, providing a detailed context for ADE prediction by including the target population information.
To quantify the incremental improvements achieved by progressively adding contextual features, we employed the micro-averaged McNemar’s test. This method evaluates the impact of each feature configuration (S vs. SG and SG vs. SGE) on predictive performance, highlighting the contribution of each additional layer of contextual information. Discriminative models used distinct encoders for each feature modality (Fig. 3). The S configuration used ChemBERTa-77M-MLM38 as a backbone to encode the SMILES notations. The SG configuration combined ChemBERTa-77M-MLM with PubMedBERT-base39 backbones, integrating the treatment regimen alongside the SMILES notations. The SGE configuration provided the most complete set of input features by incorporating patient eligibility criteria and used the same backbones as for SG.

Discriminative model pipeline used for ADE prediction. The model uses dedicated encoders for text (handling both eligibility criteria and group descriptions sequentially) and SMILES strings. The encoded features are concatenated and sent to a feed-forward neural network. Final output probabilities are computed using the sigmoid activation function.
Generative models were based on OpenBioLLM-8B40, an open-source biomedical instruction model, which was used with the official chat template. Similar to the discriminative approach, the models were fine-tuned to generate a list of ADEs based on S, SG, and SGE scenarios (Fig. 4). OpenBioLLM-8B was fine-tuned with bf16 precision41, using low-rank adapters42, flash attention 243, and gradient checkpointing44 in a completion-only framework.

Generative model pipeline used for ADE prediction. A prompt is created based on the input features, and OpenBioLLM-8B is tasked to generate a list of ADEs.
As shown in Fig. 5, for the S configuration, the discriminative model achieves an F1-score of 31.96%, while the generative model achieves an F1-score of 24.16%. The results establish a baseline performance, providing a reference point to assess the incremental contributions of patient- and treatment-specific information. For the SG configuration, the discriminative model improves significantly compared to the S configuration, with a micro F1-score of 46.09% (p-value < 0.001), whereas the generative model achieves a micro F1-score of 49.74% (p-value = 0.15). This demonstrates the substantial impact of integrating treatment regimen information, balancing precision and recall, and enhancing predictive performance. Finally, the SGE configuration provides the best performance for both model types. Compared to the SG configuration, the discriminative model achieves an F1-score of 53.46% (p-value = 0.11), while the generative model achieves an F1-score of 53.43% (p-value < 0.001).

Performance comparison of discriminative (ChemBERTa-77M-MLM & PubMedBERT) and generative (OpenBioLLM-8B) models on the CT-ADE-SOC28 test split using different feature sets (S, SG, SGE). The AUROC metric cannot be computed for the generative model as it does not produce raw probabilities. All metrics are micro-averaged. Tabular values are available in Supplementary Table 3.
Since only 11.32% of instances are positive in the CT-ADE-SOC28 test set, we also tested the models on the subset where ADEs were observed, focusing on their ability to identify positive cases. In this scenario, the performance differences for any comparison – S vs. SG, S vs. SGE, and SG vs. SGE – are statistically significant (p-values < 0.001) for both discriminative and generative models. These results demonstrate the improvements in ADE prediction achieved by adding contextual information. This in-silico analysis is consistent with findings from existing in-vivo models27 and highlights the critical role of incorporating patient- and treatment-specific information to improve ADE prediction. Similar conclusions hold for the CT-ADE-PT28 dataset, with detailed results using discriminative models provided in Supplementary Tables 1, 2.
Alternative evaluation scenarios
Performance results across the SOC levels and ATC main pharmacological groups for the best discriminative model (SGE) are provided in Fig. 6. At the SOC level, the discriminative SGE model demonstrates strong performance in predicting common ADEs. For example, it achieves an F1-score of 71.95% for “Gastrointestinal disorders” (Gastr) and 71.28% for “Nervous system disorders” (Nerv). However, the model’s performance is weaker for rarer SOCs, such as “Social circumstances” (SocCi), where it fails to predict any ADE in this category. Similarly, performance is weaker for “Cardiac disorders” (Card) with an F1-score of 31.50%, despite their relative frequency. This suggests that, while the model handles common ADEs, refinement is needed to improve minority class prediction performance. Performance across ATC main pharmacological groups further highlights the model’s strengths and areas for improvement. The SGE model achieves higher F1-scores in categories like “Blood and Blood Forming Organs” (B) (55.17%) and “Systemic Hormonal Preparations” (H) (58.54%). Conversely, the model performs worse in predicting ADEs for “Antiinfectives for Systemic Use” (J), with an F1-score of 32.38%. These insights emphasize the challenge of predicting ADEs in clinical research, and the importance of novel strategies to address specific ADE categories and therapeutic areas beyond LLM fine-tuning.

(A) F1-score of the SGE discriminative model on individual SOCs in CT-ADE-SOC28 test set. (B) F1-score of the SGE discriminative model on individual ATC main pharmacological groups in CT-ADE-SOC28 test set. Supplementary Tables 6, 7 provide tabular values for the results shown in A and B and additional metrics. Abbreviation mappings for SOCs and ATC main pharmacological groups to their full terms are available in Supplementary Tables 4, 5, respectively.
Effect of model scaling and domain pre-training on ADE prediction
To assess the impact of model scaling and domain-specific pre-training on ADE prediction, we additionally fine-tuned a range of generative models, including Llama-345 (8B, 70B), Meditron46 (7B, 70B), and OpenBioLLM-70B. Due to computational constraints, these experiments were restricted to the full feature set (SGE). Performance was evaluated relative to our best discriminative model (SGE) and a baseline using a majority-class prediction approach (MAJ), which assumes no ADEs occur. As shown in Table 3, the Llama-3-8B model obtains the highest performance, with an F1-score of 55.90%, which is 2.4 percentage points above the SGE discriminative model (p-value < 0.01). Interestingly, despite their substantial parameter count – 70 times larger than the discriminative model – 8B generative models achieve comparable performance. This suggests that increasing the number of parameters does not necessarily lead to proportional performance improvements. As illustrated in Table 3, the 70B models demonstrate even more diminishing returns with parameter scaling. Although biomedical LLMs have been shown to outperform general domain models in biomedical tasks47, we found that specialized domain models such as Meditron and OpenBioLLM do not provide a performance advantage compared to general domain models in CT-ADE-SOC28. Due to the imbalanced nature of ADE datasets, the majority class model (MAJ) tends to achieve strong performance in terms of accuracy. Llama-3-8B, with an accuracy of nearly 90%, improves only 0.89 percentage points upon the MAJ model. However, it can identify around 58% of ADEs, while the MAJ model does not predict any (p-value < 0.001). These findings suggest that increasing model sizes or pre-training models on domain-specific corpora does not necessarily improve ADE predictive performance for this task.
Limitations
Several limitations present opportunities for refinement of the CT-ADE28 dataset. Firstly, CT-ADE28 does not incorporate preclinical information, such as data from in vitro assays, into its drug features. However, by linking chemical databases like DrugBank, PubChem, and ChEMBL to study groups, CT-ADE28 establishes a foundation for future integration of preclinical data, which may enhance the predictive power of ADE models. Secondly, since CT-ADE28 is derived from controlled CT settings, it may not fully capture real-world variability due to strict inclusion and exclusion criteria, standardized treatment regimens, and closely monitored conditions. These characteristics, while ensuring data consistency and reliability, also limit the dataset’s ability to reflect the complexities of routine medical practice. Additionally, CT-ADE28 focuses exclusively on monopharmacy interventions, facilitating precise ADE attribution but excluding polypharmacy scenarios that are common in clinical practice. Expanding the dataset to encompass polypharmacy cases would enable models to account for drug-drug interactions and more complex treatment regimens, thereby enhancing their applicability to real-world settings. Lastly, the dataset is restricted to drugs with SMILES representations, thereby excluding compounds that lack such encodings due to their structural complexity. Incorporating alternative representations, such as amino acid sequences for biologics, could extend the applicability of CT-ADE28 to a broader range of therapeutic agents.
Responses