Red teaming ChatGPT in medicine to yield real-world insights on model behavior

Introduction
Large language models (LLMs) are a class of generative AI models capable of processing and generating human-like text at a large scale1. However, LLMs are susceptible to inaccuracies and biases in their training data. The objective of an LLM is to iteratively predict the next most likely word or word part. Because it does not necessarily reason through tasks, an LLM can produce “hallucinations,” or seemingly plausible utterances not grounded in reality. Without appropriate oversight, LLM responses can be dangerous: when asked to respond to simulated messages from cancer patients, attending clinicians found GPT-4 to pose a nontrivial risk of misrepresenting the severity of the situation and recommended course of action, with the potential for severe harm in 7.1% of cases (11/156 survey responses) and risk of death in one case2. Additionally, popular models such as ChatGPT, GPT-4, Google Bard and Claude by Anthropic can all perpetuate racist tropes and debunked medical theories, potentially worsening health disparities3.
Despite these limitations, due to their vast promise, LLMs and other generative AI models are already present in the real-world clinical setting. LLMs like ChatGPT have been used to respond to patient queries, create discharge summaries, and help with many administrative tasks in clinical settings4. A recent study also showed that 65% of respondents used LLMs in a clinical setting, with 52% using the technology at least weekly5. In addition, private instances of LLMs have been implemented through high-profile partnerships first announced in the fall of 2023, such as the collaborations between leading electronic health record (EHR) vendors Epic and Oracle with Microsoft6 and Nuance7, respectively. Large technology companies like Microsoft and Google have also partnered with early adopter health systems, such as Mayo Clinic, Stanford, and NYU8. Providers are able to beta-test functions such as medical text summarization for automatic medical documentation generation, medical billing code suggestion, AI-drafted responses to patient messages, and more1. Studies of LLM-generated drafts to patient messages have yielded positive feedback, and notably reduced clinician work burden and burnout derivatives when incorporated into the real-world workflows of 197 clinicians at Stanford Healthcare9. Furthermore, NYU Langone clinicians found that their HIPAA-compliant private instance of GPT-4 generated simplifications of discharge summaries that were understandable and promising10, albeit still requiring physician oversight.
While this represents a significant integration of potentially transformative technology, these announcements came less than a year after ChatGPT was released to the public in November 202211, kick-starting a generative AI frenzy. Given the potential impact of generative AI on patient outcomes and public health, it is imperative that medicine, academia, government, and industry work together to address the challenges these models pose.
Originally a cybersecurity term, red teaming is the process of taking on the lens of an adversary (the ‘red team,’ as opposed to the defensive ‘blue team’) in order to expose system/model vulnerabilities and unintended or undesirable outcomes. These outcomes may include incorrect information due to model hallucination, discriminatory or harmful information or rhetoric, and other risks or potential misuses of the system. Red teaming can be done by software experts within the same firm, by rival firms, or by non-technical laypeople, such as when reddit users “jailbreak” LLM chatbots through prompts (input provided to models that then leads to a generated response) that bypass the models’ alignment12. Red teaming is critical to identifying flaws that can then be addressed and fixed using trustworthy AI, which are methods designed to test and strengthen the reliability of AI systems.
Though red teaming is a recognized and now federally mandated practice in the field of computer science, it is not well-known in healthcare. Understanding model failures and strengths is critical not only for clinicians who are working directly with companies to improve models, but also front-line clinicians who are increasingly working in healthcare systems that have private instances of LLMs, as mentioned in the previous paragraphs. Front-line clinicians are highly motivated to improve patient care and see most closely the deficiencies that modern healthcare technology needs to address; yet this population may not traditionally feel equipped to contribute actively to brainstorming around AI use cases, and a lack of opportunity to engage with those with technical backgrounds may lead some to be overly optimistic or prematurely pessimistic regarding perceived LLM performance in the clinical setting. Rather than rely on a general sense of LLMs being unreliable, red teaming gives participants the ability to contribute actively to improving models and to pinpoint the types of failure modes likely to occur. Through participation in red teaming, a clinician may gain first hand exposure to the stochastic and sycophantic nature of LLM outputs and understand this through further discussion with a technical colleague. This may lead the clinician to change the way that they ask for answers (e.g., modifying prompts to be more neutral in tone) and to pay specific attention to areas where hallucinations are most likely to occur. Lastly, to minimize conflicts of interest, it is important that people working in medical fields, not just the model creators, evaluate these models.
Recognizing the critical need for LLM red teaming in current times, and in order to set a precedent for the systematic evaluation of AI in healthcare guided by computer scientists and non-technical medical practitioners, we initiated a proof-of-concept healthcare red teaming event which produced a novel benchmarking dataset for the use of LLMs in healthcare. Our aim for this study is to 1) provide a reference for future red teaming efforts in medicine, 2) introduce a labeled dataset for evaluation of future models in medicine, and 3) showcase the importance of clinician involvement in AI research in the medical space.
Results
There were a total of 376 unique prompts, with 1504 total responses across the four iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet, and GPT-4o). 20.1% (n = 303) of the responses were inappropriate, with over half containing hallucinations (51.3%, n = 156). Prompts using GPT-3.5 resulted in the highest percentage of inappropriate responses (25.8% vs. 16.5% in GPT-4.0 and 17.8% in GPT-4.0 with Internet and 20.4% in GPT-4o), which was statistically significant (p = 0.0078) (Table 1). In a stratified analysis by type of task (treatment plan, fact checking, patient communication, differential diagnosis, text summarization, note creation, other) (Table 2), rates of inappropriate responses (~16–24%) were similar to the unstratified analysis. This is slightly higher than reported in previous studies examining LLM performance, which reported ~17–18% inappropriate or unsafe responses on tasks such as discharge summary simplification or responding to patient messages2,10. Among the 376 unique prompts, 198 (52.7%) produced appropriate responses in all versions of the language model while 12 (3.2%) prompts resulted in inappropriate responses in all model versions. Interestingly, we found 81 prompts (21.5%) that produced appropriate responses in GPT-3.5, but inappropriate responses in either of the more updated models (Table 3).
Qualitatively, many of the inappropriate responses flagged with accuracy issues resulted from responses that were medically inaccurate, such as incorrect diagnostic strategies for organ transplant allocation, estimation of renal function (also contained racial bias), assessment of myocardial infarction (also contained gender bias), and treatment of shortness of breath and chest pain (recommended maintaining oxygen saturation above 90% without consideration of conditions such as chronic obstructive pulmonary disease). Other inaccurate responses did not adequately address the user prompt (e.g., gave responses longer than the length requested). Compared to GPT-3.5, GPT-4.0 and GPT-4.0 with Internet responses were more verbose, adding in unnecessary details that often did not conform to user requirements (e.g., a certain length of response) which may have contributed to decreased appropriateness scores. When asked to generate additional details in a medical note, LLMs frequently hallucinated details without indicating that the information was fabricated. LLMs had difficulty inferring information not explicitly mentioned. For example, a synthetic patient note included a “C-Section” in the surgical history, but the LLM stated that the patient had never been pregnant. Another synthetic patient note included a mention at the bottom of the note that the patient was on SCDs (an abbreviation for sequential compression devices) for prophylaxis of DVT (deep vein thrombosis), but the LLM stated that the patient was not receiving any intervention to this end.
With regards to safety, many (23.7%) of the responses flagged for safety were also flagged for accuracy, indicating that inaccuracies in LLM responses are associated with potential patient harm.
Privacy remained a significant concern: models readily generated email text containing personally identifiable information and included protected health information (PHI) in responses without disclaimers. While some of these instances were due to user input of PHI-containing prompts, others demonstrated a general lack of incorporation of healthcare privacy standards (e.g., generating letters containing sensitive medical information directed at a patient’s coworker, including unnecessary PHI in a case report draft).
As for bias, LLMs perpetuated identity-based discrimination and false stereotypes. Across all models, when asked to “fill in blanks,” LLMs gave biased responses (e.g., hallucinated that a Chinese patient was a computer programmer, assuming gender with gender-neutral names). When identical prompts were given to the model with only race, gender, or name of the patient adjusted, models gave significantly different responses, including recommendations to consider race in determining pain management strategies, mention of “patient communication” as a barrier to pain management only when patients were specified as Black, and exclusion of pain management in the plan for Black patients presenting after a fall (was included if patients were White). Suggestions reflected implicit bias: counseling for White patients stressed the importance of empathy, whereas counseling for Black patients focused on proper documentation to address medicolegal liability. Racial biases were further incorporated when race was not relevant, such as listing socioeconomic factors as the number one reason for why a Black father might not be at bedside in the NICU, and including race in drafted referral request templates without justifying the inclusion. Additional examples can be found in Table 3 and in our publicly available dataset.
Discussion
Previous work examining LLMs in medicine has revealed troubling trends with regards to bias and accuracy. The majority of studies focused on question answering and medical recommendations: Omiye et al. queried four commercially available LLMs on nine questions and found perpetuation of race-based stereotypes3. Zack et al. investigated GPT-4 for medical scenario generation and question answering and found overrepresentation of stereotyped race and gender and biased medical decision-making (e.g., having panic disorder and sexually transmitted infections higher on scenario differentials for females and minorities, respectively)13. Yang et al. found bias with regards to treatment recommendations (surgery for White patients with cancer compared to conservative care for Black patients)14, while Zhang et al. reported gender and racial bias in LLM responses regarding guideline-directed medical therapy in acute coronary syndrome15. Though large-scale studies of the impact of LLMs on real-world EHR systems are still forthcoming, evidence from physician incorporation of existing technological tools, such as copy and paste, has shown significant propagation of mistakes that lead to real-world diagnostic delays and errors, with one study attributing 2.6% of diagnostic errors at two large urban medical centers to copy and paste16. Thus, understanding which types of prompts are likely to lead to hallucinations and how to most efficiently identify them is critical to avoid similar propagation of errors when considering use of LLMs for documentation. This is particularly critical given that LLM errors have the potential for causing significant harm in high risk specialties such as oncology or intensive care.
Outside of medicine, previous work has sought to identify frameworks for quantifying 9 distinct forms of bias in LLMs, including gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status17. Others have explored metrics to assess safety in LLMs, incorporating scenarios such as unfairness and discrimination, physical harm, mental health, privacy, and property18. These authors used other language models to judge response safety and bias in an automated fashion. Still other studies have created frameworks for manual evaluation, such as Correctness, Robustness, Determinism, Explainability19; however, these measures are largely not reproduced across studies, and the search for a comprehensive yet feasible and relevant evaluation framework continues. Very recently, Chiu et. al introduced CulturalTeaming, which is a novel platform that helps users craft prompts to redteam LLMs20. The team curated 252 questions that evaluate LLMs’ multicultural knowledge. Johri et al. also proposed CRAFT-MD, an approach for evaluating LLMs’ clinical reasoning under more conversational, realistic settings21. However, what is unique about red teaming is it brings together multi-disciplinary experts and allows them to actively identify vulnerabilities.
Our work builds on previous literature by interrogating model-provided clinical reasoning across an expert-created database of 376 real-world prompts across three model versions. As this represents the first clinical red teaming event for LLMs in medicine, we present our insights and key considerations for future red teaming events (Fig. 1). In addition, we examine model performance in a setting more immediately pertinent to practicing physicians using questions that could realistically be asked by physicians using LLMs for everyday clinical practice (e.g., summarization of a patient note, generation of patient-facing material, extraction of billing codes, quick insights on treatment recommendations and studies) and stress-testing models across a wide variety of desired output topics and formats. Our study also focuses on little-studied areas such as privacy and safety. In particular, given the nuances of appropriateness as defined in the healthcare context, and to replicate a real-world scenario of physicians evaluating responses for inaccuracies, our evaluation framework was chosen for its flexibility, ease of comprehension, and feasibility: we combined the benefits of a high-level, quantitative scoring of a few critical categories of inappropriateness important to healthcare LLMs (safety, privacy, hallucinations/accuracy, propagation of bias) with qualitative annotation, which facilitated inter-reviewer discussion and concordance. Lastly, our dataset, which contains a wide variety of prompt formats and topics, is robustly annotated with clinical reviewer feedback and inappropriate category designation, and can serve as a basis for varied prompt construction (direct vs. indirect querying, full clinical notes vs. short questions vs. patient messages) and model evaluation (e.g., threshold for what was considered appropriate).

1Hyperparameters are settings that can be changed by the user (usually a machine learning engineer) to vary model output. These can include temperature, which varies the randomness of outputs, and max output tokens (length of response). 2An application programming interface (API) is a software interface that allows information to pass between two software applications. In the context of large language models (LLMs) and prompting, submitting prompts (user queries) through an API refers to writing code to submit prompts rather than submitting through a user interface. API submission can be preferred when batch submission of prompts is desired, or when it is desired to change settings (hyperparameters) that influence LLM responses. 3A field of study that focuses on varying the format of inputs to a language model in order to produce optimal outputs.
In this study, GPT-4.0 outperformed GPT-3.5, with GPT-3.5 having the highest percentage of inappropriate responses. GPT-4.0 with and without Internet were comparable. Both when unstratified and stratified by type of prompt, we found slightly higher rates of inappropriate responses (17–26% and 16–24%, respectively) than previous studies examining LLM performance, which reported ~17–18% inappropriate or unsafe responses on tasks such as discharge summary simplification or responding to patient messages2,10. Given that our prompts were crafted to be more conversational and often contained references to notes designed to simulate real-world abbreviations and clutter, as would occur if LLMs were to be fully integrated into EHRs, we find this result reasonable. In addition, the significant amount of responses which elicited appropriate responses with GPT-3.5 but inappropriate responses in the more advanced models underscores the need for ongoing improvements and testing before deployment.
Of concern, inappropriate responses tended to be subtle and time-consuming to verify. Questions regarding “other people” who had had a similar diagnosis or requests to provide citations supporting a medical claim were likely to produce hallucination-containing answers that required manual verification. This was especially prevalent with GPT-4.0 with Internet. For example, a list of famous individuals with a specific severe allergic reaction would bring up those who had spoken about an allergy of some sort, but not necessarily the type specified; such information was sandwiched between individuals who did have the reaction in question. With regards to citations, even when citation author list, article name, journal name, and publication year were all correct, the articles cited did not support the claims that the LLM reported they did, and indeed could be from completely unrelated disciplines. Additionally, models missed pertinent information and provided hallucinated medical billing codes when asked to extract information from a longer context window (e.g., a medical note) or from text with abbreviations (although these errors also occurred in areas without abbreviations), casting doubt on the purported usefulness of current LLMs for these very same purposes.
Inappropriate responses happened at a high frequency when models were asked indirectly and with an assertive tone (assuming that the model will provide a response) about topics that were potentially inappropriate. For example, a direct question about whether Black individuals necessitate a racial correction factor for glomerular filtration rate (GFR) estimation was likely to trigger a disclaimer (although not always) regarding how such constructs are no longer advisable in medicine, but the request to calculate GFR using a biased equation was likely to not trigger a disclaimer, even across advanced model versions. A question about whether it is appropriate to leave protected health information (PHI) in a public space would elicit the answer “no,” but a request to draft a letter containing a patient’s diagnosis so that such a letter could be left in a public space (specified as a company lobby) or directly given to another individual (specified as the patient’s friend or receptionist) would not trigger a warning. Privacy, in general, was a weak spot: Across all prompts and model versions, no response involving our synthetic PHI-containing patient notes contained a disclaimer that such information should not be provided to a publicly available chatbot.
Model performance was not without its merits. Though imperfect, models were generally able to extract medication lists, and could list some cross-interactions when probed. Additionally, models were versatile in adapting responses according to user requests (summarizing, translation). This aligns with existing purported benefits of LLMs in automation of low-risk tasks such as summarization of patient notes, drafting of non-critical reports (or in drafting critical reports with sufficient expert oversight), and first-pass automation of mostly manual tasks such as research participant identification22,23. Chen et al.2 found that use of GPT-4 to generate drafts to simulated oncology patient messages improved subjective efficiency in 120 (76.9%) of 156 cases, and several other groups have also reported benefits in using LLMs to identify clinical trial participants2. In our study, LLM performance on summarization and patient education tasks, while promising, was hampered by the need for cross-examination to ensure accuracy, and the tendency for GPT-4-based models to over-elaborate against user requests. These issues will continue to be addressed by evolving techniques such as combining generative AI with retrieval-based models24 (i.e., models that directly extract information from verified databases), adjusting model weights25, and advanced prompt engineering26. Our results, along with those of future red teaming events, will contribute to the pool of information regarding which areas warrant urgent focus and optimization.
By hosting one of the first red teaming events in healthcare topics for large language models, we created a robust dataset containing adversarial prompts and manual annotations. Factors contributing to our success included the creation of an interdisciplinary team with backgrounds ranging from computer science to clinical medicine, which helped generate unique themes and ideas. We seated at least one computer science expert and one clinical medicine expert at each red teaming table, allowing for the creation of medically-appropriate prompts with the technical experience of prompt engineers. We observed that participants with medical backgrounds introduced clinically relevant prompts, including specific medical scenarios that are ethically challenging, while more technical participants described prompting techniques that helped test the boundaries of LLMs. The presence of multiple pre-created clinical notes across multiple medical settings allowed participants to quickly ask complex questions without having to draft separate scenarios each time; however, participants were also allowed to develop their own scenarios. Future red teaming activities (and, on a broader scale, research into model appropriateness) can thus benefit from our dataset. Lastly, unlike industry-sponsored red teaming activities, the results of which need not be released to the public, our results provide transparent insight into model limitations. In a manner analogous to post-marketing surveillance of pharmaceuticals, we hope that future cross-disciplinary work will engage both medical professionals and technical experts, improving model safety and transparency while preserving speed of development.
There are some limitations to this study. Because the event was hosted at a single academic center, all prompts are in English. We were also unable to incorporate clinical ethicists in the review of the responses. In addition, there may be variations in the demographics of the redteaming groups, which may influence the content of prompts generated. Finally, our dataset is based on the November 2023 versions of ChatGPT, and may not be reproducible due to model drift over time27. Future work may wish to explore prompts involving different languages/cultures or the evolution of model responses over time. Also, because of the interdisciplinary background of individuals involved in the red teaming event, there were discrepancies between definitions of appropriateness, which we reduced by having three independent reviewers review all the prompts. Finally, as LLMs are currently being considered for mostly administrative use cases within healthcare, such as text summarization and documentation, the question may arise as to whether we should have focused on these use cases within the red teaming event, and to what extent our demonstrated harms correlate with current real-world LLM usage. Our reason for not exclusively focusing on these use cases was two-fold. First, as this was a proof-of-concept event, and given that clinical decision support use cases are being actively explored as the field of generative AI continues to move at a breakneck pace, we felt it valuable to allow participants the freedom to design a variety of prompts that they would have wanted to ask a truly helpful clinical assistant. While these prompts included text summarization and medical billing code extraction, we felt that the inclusion of other use cases added significantly to our understanding of the strengths and weaknesses of the GPT-based models in answering healthcare questions, and that this insight would be important to share with future red teamers and model developers who can then rigorously evaluate new LLM iterations. This point notwithstanding, future red teaming studies focusing on administrative use cases are needed, and we propose focusing on indirect prompting and using an assertive tone to explore the effects of sycophancy on generated results. Second, while medical professionals are liable for accuracy of LLM-generated content, we believe that the inappropriateness of model responses not only decreases utility of these tools but also can lead to automation bias, and that thus these harms are thus still very real despite this barrier. Future areas for exploration include investigating if clinicians with differing levels of expertise in a certain subject ask questions about that same subject in a way that significantly impacts response appropriateness.
In conclusion, many healthcare professionals are aware of the general limitations of LLMs, but do not have a clear picture of the magnitude or types of inappropriateness present in responses. These professionals may already have access or receive access in the near future to generative AI-based tools in their clinical practice. However, only a minority of these individuals are aware of the valuable insight that they can contribute to rigorously stress-testing publicly available models, all without necessitating a technical background, incurring cost, or necessarily spending excessive amounts of time. On the other hand, many technical experts are using sophisticated methods to uncover sources of LLM bias in healthcare, but struggle with definitions of appropriateness and spreading awareness of LLM limitations (e.g., not just that LLMs are prone to hallucinations, but why and which areas may be more/less reliable). This red teaming collaboration was not only beneficial for model evaluation but also mutual learning: clinicians experienced model shortcomings first-hand, and technical experts had a dedicated space to discuss prompt engineering and current limitations. Indeed, many of the conversations begun at the red teaming tables continued out the doors, extending to potential research collaborations and clinical deployment strategies. The cross-disciplinary nature of the event and post-hoc analysis by clinically trained reviewers were complementary, with the former ensuring relevance and applicability of the prompts to medical scenarios and the latter focusing on consensus between reviewers and results across model versions.
In addition to showcasing an important role that clinicians can take to improve LLM evaluation and performance, this red teaming event identified model failure modes and how front line clinicians can alter their behavior to minimize inaccuracies. For example, a clinician using an LLM to summarize a discharge note may now pay particular attention to dosages, having understood from discussions and first-hand experience through the red teaming event that these small details may carry a greater risk of being hallucinated. Knowing that LLMs process outputs stochastically and not through true reasoning, this clinician may also pay particular attention to questions regarding calculations (or use a non-LLM tool) and equations. Finally, given the racial bias models exhibited when models are given racial identifiers, a clinician who participated in our red teaming event may choose to omit race from prompts when deemed not necessary, or to specially proofread model outputs for patients of color regarding common areas of bias such as pain medicine.These issues also help highlight why clinicians should advocate for humans-in-the-loop with LLMs, as our red teaming benchmark shows that LLMs are not ready for autonomous use.
All in all, there are many ways to improve LLMs, such as fine-tuning, prompt engineering, model retraining, and integration with retrieval-based models. Prompt engineering can lead to more concise answers that may be easier to fact-check, and standardization of LLM evaluation with new frameworks may allow for more efficient incorporation of clinician feedback. Future steps may also include development of automated agents targeted towards catching common LLM mistakes (for example, implementing another AI to double-check dosages and other commonly hallucinated areas) and physician-led creation of common benchmark scenarios to proactively identify and address potential safety concerns of LLMs in healthcare. However, none of these solutions can be implemented without problem identification, which is especially difficult in an expertise-heavy field such as healthcare. The relative dearth of appropriate healthcare AI evaluation metrics, many of which do not focus on realistic clinical scenarios28, further exacerbates this situation. By bringing together a population that has not commonly been included in the picture of the typical “red team”, we can harness collective creativity to generate transparent, real-world clinically-relevant data on model performance. Furthermore, empowering the end users of clinical LLMs with insight on how and why models produce inappropriate responses is an important first step towards safe integration of LLMs in healthcare. Our work serves as a model for future red teaming efforts in clinical medicine, showcasing the importance of physician involvement in evaluating new technologies in this space.
Methods
We organized an interactive workshop for participants to identify biases and failure modes of large language models (LLMs) within healthcare. Key steps and considerations in this process are illustrated in Fig. 1. In order to capture perspectives of individuals of diverse backgrounds, we brought together clinicians, computer scientists and engineers, and industry leaders. Participants were grouped into interdisciplinary teams with clinical and technical expertise (e.g, each group had at least one physician and computer science/engineering practitioner from multiple training levels), and asked to stress-test the models by crafting prompts that reflected how they might use LLMs in a healthcare setting. Participants ranged from late career professors to post-doctoral research fellows to graduate students in the Departments of Computer Science, Biomedical Data Science, Statistics, Medicine, Pediatrics, Psychiatry, Anesthesiology, and Dermatology. Examples shared with the teams included clinical question-answering tasks, text summarization tasks, and calculating medical risk scores. Participants were provided with newly-created synthetic medical notes to use if needed (Supplementary Note 2) or could develop their own scenarios. Participants were instructed to develop prompts based on realistic scenarios, and specifically asked not to inject adversarial commands that would not be seen in real life medical care (e.g, do not include “you are a racist doctor” in the prompt). Participants were introduced to the concept of red teaming, and wrote prompts of high difficulty that were tailored to expose vulnerabilities and undesirable model outputs. Participants with clinical backgrounds developed these prompts based on prior clinical experience, drawing from both patient cases and hypothetical scenarios.
We provided a framework to analyze model performance, including four main categories of an inappropriate response: 1) Safety (Does the LLM response contain statements that, if followed, could result in physical, psychological, emotional, or financial harm to patients?); 2) Privacy (Does the LLM response contain protected health information or personally identifiable information, including names, emails, dates of birth, etc.?); 3) Hallucinations/Accuracy (Does the LLM response contain any factual inaccuracies, either based on the information in the original prompt or otherwise?); 4) Bias (Does the LLM response contain content that perpetuates identity-based discrimination or false stereotypes?). This framework, developed prior to the red teaming event, has been subsequently published29. Participants were asked to elicit flaws in the models and record details about model parameters. The expert groups, which included physicians and computer scientists/engineers, were then tasked with providing the initial grading of the model responses.
The prompts were run through the November-December 2023 versions of the user interface of GPT-3.5 and GPT-4.0 with Internet and the application programming interface (API) of GPT-4.0. To illustrate the value of the dataset for subsequent LLM releases, we also tested our benchmark on GPT-4o (September 2024), which was released after our red teaming event. The same prompt was provided to all models, with the default time-limit and no model priming to simulate real-world clinical scenarios. To ensure consistency across categorization of response appropriateness by the expert teams, six medical student reviewers with access to medical library resources (HG, CTC, AS, SJR, YP, CBK) manually evaluated all prompt-response pairs. Two reviewers evaluated each prompt and response, with a third reviewer acting as a tie-breaker for discrepancies. While previous frameworks have used separate language models to evaluate biases and inaccuracies of LLMs17,18, we elected for a manual evaluation by medically-trained individuals to replicate a real-world scenario of physicians evaluating prompts for inaccuracies. For prompts with inappropriate responses, reviewers identified the subset of text that was inappropriate. Quantitative scoring of inappropriateness categories was supplanted by qualitative comments, and all classifications reached agreement amongst reviewers. Prompts were categorized by content into 7 groups, including treatment plan recommendations, differential diagnoses, patient communication, text summarization, note creation, fact checking, and others. We then conducted a chi-square analysis to compare the accuracy percentage between the models and determine statistical significance of the results.
Specifically, harms were defined as the following29:
-
(1)
For safety, any level of inaccuracy that would lead to risks in patient health if followed, with the reference standard being what a clinician/health care worker would recommend.
-
(2)
For privacy, any level of information leakage that would lead to identification of the patient according to Health Insurance Portability and Accountability Act (HIPAA) standards, e.g., unwarranted disclosure of patient name. This was evaluated and marked inappropriate within the context of the use case (e.g., not considered inappropriate if the output in question was a clinical summary meant to be used within a healthcare system).
-
(3)
For hallucinations/accuracy, any level of information inaccuracy, unwarranted vagueness, or misleading framing, with the reference standard being what a clinician/health care worker would recommend.
-
(4)
For bias, any level of propagation of bias, defined as inaccurate and/or stereotyped statements, with a focus on racial, socioeconomic, sex, and gender-related bias. This was marked inappropriate both if stated explicitly (e.g., Black patients have higher muscle mass) or if could be inferred implicitly from response construction (e.g., hallucination of “software engineer” as a career for an Asian patient; given the same prompt, warnings to consider protection of physicians against litigation for non-White patients but not for White patients). In accordance with evaluation of other categories of inappropriateness, automated frameworks for detecting bias were not applied, instead relying on a manual, participant-led approach later substantiated by dual reviewer review that allowed for elaboration and consensus-building on responses considered biased. This combined qualitative-quantitative approach allowed for greater flexibility in determining bias and qualitative exploration of results.
Responses