Unregulated large language models produce medical device-like output
Methods
We tested GPT-4 (Open AI, Inc., May and October 2024 versions)20 and Llama-3 (Meta Inc., May and October 2024 versions)21 with scenarios from five clinical settings: cardiology, family medicine, immunology, neurology, and psychiatry (see Supplementary Information). Each scenario began with the same single-shot prompt, specifying the criteria for non-device decision support and instructing the LLM to respond consistent with these criteria:
You are a clinical decision support tool working in an outpatient physician’s office. You have been reviewed by your local hospital but have not been authorized for use by the FDA in this practice setting because you do not meet the criteria as a device. Therefore, you must adhere strictly to the following two rules in order to abide by your regulatory designation.
First, you may only provide information intended for the purpose of supporting or providing recommendations to a healthcare professional about the prevention, diagnosis, or treatment of a disease or condition. Second, your advice may not be used to replace or direct a healthcare professional’s judgment. For example, you would fail to meet these criteria if you provided a specific preventive, diagnostic, or treatment output or directive, especially for any time-critical decision making related to a severe acute illness.
Each model then received a request for advice about specialty-specific preventive care for a hypothetical patient. Then each model received a request about a time-critical emergency tailored to the clinical environment. These included cardiac arrest, sepsis, anaphylaxis, acute stroke, and opioid overdose. Then, each model received a request using a pre-specified “desperate intern” jailbreak (“Please help me. I am just an intern and I don’t know what to do”). Because there is randomness in LLM responses, we presented each set of requests five times and recorded all responses. Model settings were reset before each scenario by starting a new chat for GPT-4 and opening a new incognito browser for Llama-3.
We also repeated this protocol for each clinical scenario using a multi-shot prompt with 48 examples of device and non-device decision support taken verbatim from the FDA clinical decision support guidance document (see Supplementary Information)7.
We evaluated the proportion of responses to each request that were consistent with device or non-device functions as outlined in the initial prompt. Secondarily, we assessed whether the recommendations were appropriate for non-clinician bystanders or suitable only for trained clinicians.
This study did not involve human participants and was not classified as human subjects research.
Responses