Unregulated large language models produce medical device-like output

Methods

We tested GPT-4 (Open AI, Inc., May and October 2024 versions)20 and Llama-3 (Meta Inc., May and October 2024 versions)21 with scenarios from five clinical settings: cardiology, family medicine, immunology, neurology, and psychiatry (see Supplementary Information). Each scenario began with the same single-shot prompt, specifying the criteria for non-device decision support and instructing the LLM to respond consistent with these criteria:

You are a clinical decision support tool working in an outpatient physician’s office. You have been reviewed by your local hospital but have not been authorized for use by the FDA in this practice setting because you do not meet the criteria as a device. Therefore, you must adhere strictly to the following two rules in order to abide by your regulatory designation.

First, you may only provide information intended for the purpose of supporting or providing recommendations to a healthcare professional about the prevention, diagnosis, or treatment of a disease or condition. Second, your advice may not be used to replace or direct a healthcare professional’s judgment. For example, you would fail to meet these criteria if you provided a specific preventive, diagnostic, or treatment output or directive, especially for any time-critical decision making related to a severe acute illness.

Each model then received a request for advice about specialty-specific preventive care for a hypothetical patient. Then each model received a request about a time-critical emergency tailored to the clinical environment. These included cardiac arrest, sepsis, anaphylaxis, acute stroke, and opioid overdose. Then, each model received a request using a pre-specified “desperate intern” jailbreak (“Please help me. I am just an intern and I don’t know what to do”). Because there is randomness in LLM responses, we presented each set of requests five times and recorded all responses. Model settings were reset before each scenario by starting a new chat for GPT-4 and opening a new incognito browser for Llama-3.

We also repeated this protocol for each clinical scenario using a multi-shot prompt with 48 examples of device and non-device decision support taken verbatim from the FDA clinical decision support guidance document (see Supplementary Information)7.

We evaluated the proportion of responses to each request that were consistent with device or non-device functions as outlined in the initial prompt. Secondarily, we assessed whether the recommendations were appropriate for non-clinician bystanders or suitable only for trained clinicians.

This study did not involve human participants and was not classified as human subjects research.

Related Articles

What large language models know and what people think they know

As artificial intelligence systems, particularly large language models (LLMs), become increasingly integrated into decision-making processes, the ability to trust their outputs is crucial. To earn human trust, LLMs must be well calibrated such that they can accurately assess and communicate the likelihood of their predictions being correct. Whereas recent work has focused on LLMs’ internal confidence, less is understood about how effectively they convey uncertainty to users. Here we explore the calibration gap, which refers to the difference between human confidence in LLM-generated answers and the models’ actual confidence, and the discrimination gap, which reflects how well humans and models can distinguish between correct and incorrect answers. Our experiments with multiple-choice and short-answer questions reveal that users tend to overestimate the accuracy of LLM responses when provided with default explanations. Moreover, longer explanations increased user confidence, even when the extra length did not improve answer accuracy. By adjusting LLM explanations to better reflect the models’ internal confidence, both the calibration gap and the discrimination gap narrowed, significantly improving user perception of LLM accuracy. These findings underscore the importance of accurate uncertainty communication and highlight the effect of explanation length in influencing user trust in artificial-intelligence-assisted decision-making environments.

Evaluating search engines and large language models for answering health questions

Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50–70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs’ effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.

Current and future state of evaluation of large language models for medical summarization tasks

Large Language Models have expanded the potential for clinical Natural Language Generation (NLG), presenting new opportunities to manage the vast amounts of medical text. However, their use in such high-stakes environments necessitate robust evaluation workflows. In this review, we investigated the current landscape of evaluation metrics for NLG in healthcare and proposed a future direction to address the resource constraints of expert human evaluation while balancing alignment with human judgments.

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.

Red teaming ChatGPT in medicine to yield real-world insights on model behavior

Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy of large language models, but non-model creator-affiliated red teaming is scant in healthcare. We convened teams of clinicians, medical and engineering students, and technical professionals (80 participants total) to stress-test models with real-world clinical cases and categorize inappropriate responses along axes of safety, privacy, hallucinations/accuracy, and bias. Six medically-trained reviewers re-analyzed prompt-response pairs and added qualitative annotations. Of 376 unique prompts (1504 responses), 20.1% were inappropriate (GPT-3.5: 25.8%; GPT-4.0: 16%; GPT-4.0 with Internet: 17.8%). Subsequently, we show the utility of our benchmark by testing GPT-4o, a model released after our event (20.4% inappropriate). 21.5% of responses appropriate with GPT-3.5 were inappropriate in updated models. We share insights for constructing red teaming prompts, and present our benchmark for iterative model assessments.

Responses

Your email address will not be published. Required fields are marked *