Advantages and limitations of large language models for antibiotic prescribing and antimicrobial stewardship
Introduction
Imagine you are a hospital-based infectious diseases specialist receiving a consultation request from another ward. When you first read the request for consultation on your computer screen, an intelligent artificial assistant, leveraging large language models (LLMs) technology, has already prepared a coherent summary of the patient’s medical and microbiological history, relevant laboratory and instrumental test results, and the evolution of their acute phase conditions in the last few days1,2,3. This summary immediately provides you with an initial idea about what to do, without laboriously spending many minutes searching for information across clinical notes in the patient’s clinical chart.
Subsequently, you go to the other ward to visit the patient and gather additional information from the patient and their treating physicians. During the consultation, your intelligent artificial assistant can (i) directly register and summarize the additional information provided by the patient and their treating physicians and (ii) suggest additional relevant questions to be posed. After coherently merging the already known information from the patient’s history and tests results with the new information collected during the consultation, your artificial intelligent assistant can explicitly offer some suggestions for your revision (e.g., prescribing a given antibiotic at a certain dosage and for a certain duration), supported by reasonable, summarized explanations.
This is only a hypothetical example of how LLMs could aid physicians in the near future in prescribing antibiotics, likely not exhaustive of all potential applications of LLMs for this purpose3,4,5,6,7,8. Since the advantages (above all, dramatic reduction of repetitive tasks for clinicians, thereby making time for more sophisticated clinical reasoning) of introducing LLMs in daily clinical practice could be transformative in healthcare, and considering that profound implementation of LLMs within electronic health records has already been announced9, a thorough understanding of both potential advantages and relevant limitations is essential for current and future clinicians who will very likely deal with this emerging technology in their daily clinical practice1,10.
In this perspective, we focus on the potential advantages and limitations of introducing LLMs to support antibiotic prescribing, both in terms of improving the efficacy and safety of the therapeutic approach to the single patient and in terms of the appropriate use of antibiotics in line with antimicrobial stewardship principles (i.e., responsible and appropriate antibiotic prescribing at both patient and global levels, to ensure availability in the present and preservation of efficacy in future populations11). Notably, these are complex medical tasks, requiring dedicated medical expertise and involving a multi-component and dynamic clinical reasoning process (Box 1)12,13,14,15,16.
Brief history of LLMs and how they work
Natural language processing (NLP) studies how to elaborate and produce natural human language through a computer, including healthcare-related text17,18,19. NLP is considered part of the domain of artificial intelligence (AI) since it tries to reproduce tasks typically performed by humans. Consequently, the evolution of language models progresses alongside the implementation of dedicated AI algorithms. The first NLP models were rule-based systems, relying on pre-written rules defined by domain experts. These models performed well on specific simple tasks but poorly on unseen data20.
This limitation was overcome with the application of neural networks (NNs) for this task. NNs are machine learning algorithms designed to emulate the biological architecture of the human brain, i.e., networks of interconnected ‘nodes’ capable of transferring information21. NNs are considered “black box” models because the composition and computations of features within the initial (input) layer and the final (output) layer may be partly or sometimes totally unclear to data scientists building and testing the model, as well as to physicians assessing how a NN model arrived at a given output, e.g., suggesting a certain antibiotic prescription22,23. The need to improve understanding of how such models arrive at their outputs/predictions, fundamental in healthcare to reduce the risk of overlooking biases and misinformation possibly perpetuated by black box models, has led to the expansion of research on explainable AI24,25.
Regarding the task of human language recognition, recurrent NNs (RNNs), which are directed graphs that process sequential inputs, and long short-term memory (LSTM) NNs, which can store past information, improved prediction skills connected to text decoding. However, RNNs and LSTMs were proven unable to make accurate predictions over extended sequences of text20.
In late 2017, Vaswani and colleagues introduced transformers, deep NNs with architectures able to handle long-range dependencies26. Transformers rely on attention mechanisms, methods which assign “weights” to each token to achieve better predictions and decisions by determining the importance of a word in its context, a mechanism that has also been applied to RNNs. Transformers exploit self-attention and calculate the weights considering all the words in a sentence, while RNNs consider a selected context window. Thanks to this new mechanism, that allowed the modeling of dependencies between words, independently from their distance or sequence, NLP models started to perform both as decoders and encoders of textual information. Through transformers it was possible to design architectures capable of both taking in input and producing in output natural language texts, e.g., conversational transformers, a function that was anticipated by task-oriented dialog systems which relied on encoders such as bidirectional encoder representations from transformers (BERT)27.
Transformers show better generalization and prediction ability than previous NLP models but were limited by the lack of large-scale datasets and adequate computational resources26,28. They laid the basis for the advent of LLMs, along with the introduction of graphic processing units that increased the performance of mathematical calculations, allowing the processing of huge quantities of data. Public awareness of LLMs was maximized after the release of OpenAI’s GPT-3.5 in 2022. Technically, LLMs are AI algorithms that work by predicting the next tokens in a sentence and are able to extract, summarize, and generate human-like text based on patterns and relationships learned from vast amounts of data. LLMs are considered “large” because they are trained on massive amounts of data and comprise a huge number of learnable parameters, with popular LLMs reaching hundreds of billions of parameters. This allows them to improve outputs and generalization of responses than previously proposed NLP models. Currently, some main companies releasing LLMs are OpenAI, NVIDIA/Microsoft, Meta, Google, Cohere, Anthropic (public benefit company), and EleutherAI (non-profit company). For interested readers, more details on the components and functioning architecture of LLMs are available in Box 220,29,30,31.
Current literature on the use of LLMs for supporting antibiotic prescribing
The scientific literature already includes examples of using general-purpose or domain-specific LLMs, as well as chatbots powered by LLMs, to support antibiotic prescribing. For example, the performance of generative pretrained transformer (GPT) was recently assessed by Maillard and colleagues using a GPT-4-based chatbot (ChatGPT-4) to provide appropriate antimicrobial therapy recommendations in 44 retrospective cases of bloodstream infection (BSI)32. In this study, ChatGPT-4 was provided with all the (anonymized) information available to clinicians who performed the consultation (without the aid of LLMs as per standard clinical practice), and the chatbot’s performance in terms of appropriateness was classified (appropriate vs. inappropriate) by infectious diseases specialists not involved in the care of that given patient. Standardized prompts were provided (once for each case) to ChatGPT-4, contextualizing the need for a comprehensive response regarding the management of a specific case of bloodstream infection in a French hospital, to be provided as if ChatGPT was the infectious diseases specialist consulting on that given patient. Appropriateness was measured according to local and international guidelines. Furthermore, recommendations provided by ChatGPT-4 were also classified in terms of their harmfulness (potentially harmful for patients vs. not harmful). Overall, the appropriateness of suggestions for empirical and targeted therapy was 64 and 36%, respectively, whereas 2 and 5% of empirical and targeted prescriptions, respectively, were considered potentially harmful. For example, a potentially harmful suggestion for empirical therapy was narrowing the spectrum of antibiotic therapy to a regimen not covering Gram-negative bacteria in a patient with febrile neutropenia while waiting for culture results, whereas for targeted therapy, a potentially harmful suggestion was de-escalating from cefepime and vancomycin to cloxacillin in a neutropenic patient with a non-bacteremic infection by Staphylococcus aureus and concomitant ongoing sepsis of suspected unrelated origin32.
In another study, Fisch and colleagues evaluated LLMs’ adherence to good clinical practice principles and guidelines from the Infectious Diseases Society of America (IDSA) and the European Society of Clinical Microbiology and Infectious Diseases (ESCMID) when providing management indications for a clinical case (hypothetical) of pneumococcal meningitis originating from mastoiditis33. No definite diagnosis was provided to LLMs. Several LLMs (Llama, Bard, Claude-2, PaLM, Bing, GPT-3.5, GPT-4) were presented with the same case thrice, and, besides appropriateness of recommendations, the heterogeneity of the suggested management provided by the same LLM across the three different sessions was evaluated. Regarding prompting, LLMs were asked to act as expert medical assistants to suggest to a junior doctor how to manage a 52-year-old patient with headache and confusion, with subsequent conversation on the case with a more specific illustration of signs and symptoms. Among questions inherent to antibiotic prescribing, LLMs were evaluated based on: (i) whether or not antibiotic prescribing was necessary; (ii) whether, if antibiotic administration was suggested, the type and dosages of suggested empirical antibiotics were in line with IDSA and ESCMID guidelines. Overall, a total of 21 responses were collected for each question, from the three different sessions for each of the seven evaluated LLMs. The need for rapid antibiotic administration was correctly recognized in 81% of cases. The correct type of empirical antibiotics (in line with IDSA and ESCMID guidelines) was suggested in 38% of cases, with correct dosages (whenever correct antibiotics were suggested) being suggested in almost 90% of cases. Some misleading statements were also identified. For example, hallucinations included the presence of Kernig’s sign and a stiff neck (not depicted in the presented case), and misleading interpretations included recognizing herpes ophthalmicus instead of bacterial meningitis. Heterogeneity was observed for all models during the three different sessions, impacting the rate of adherence to guidelines. Among evaluated LLMs, ChatGPT-4 provided the most consistent responses across the three sessions33.
In another study, the performance of the LLM-based chatbots ChatGPT-3.5 and ChatGPT-4 in replying to different questions regarding antibiotic prophylaxis in patients undergoing spine surgery was evaluated against the North American Spine Society (NASS) guidelines, which served as the reference standard for evaluating the accuracy of responses34. Prompts were formulated exactly as the 16 original questions of the NASS guidelines (with the addition of a reference to spine surgery whenever not included in the questions, to provide the necessary context present in the guidelines but not in isolated prompts). The accuracy of responses was 63% (10/16) and 81% (13/16) for ChatGPT-4 and ChatGPT-3.5, respectively. ChatGPT-3.5 showed a tendency towards overconfident but potentially erroneous or contradictory responses, whereas ChatGPT-4 showed an increased tendency to support its statements with references, including the NASS guidelines34.
Recently, Lai and colleagues assessed the accuracy and repeatability of responses provided by ChatGPT-3.5 to queries about Helicobacter pylori, including those regarding treatment of H. pylori infections (queries regarding treatment were six out of a total of 22 queries)35. Regarding repeatability, the same question (prompt) was presented to ChatGPT-3.5 2 weeks after the first session. Responses provided by ChatGPT-3.5 were independently assessed by two expert gastroenterologists using the following scoring system: (i) comprehensive (four points); (ii) correct but inadequate (three points); (iii) mixed correct/incorrect or outdated (two points); (iv) completely incorrect (one point). Confirmation vs. rejection of repeatability between the two responses to the same query provided two weeks apart was also based on the independent judgment of two expert gastroenterologists (another expert with >20 years of experience in H. pylori infection was involved for the final decision in case of disagreement). Notably, responses regarding treatment showed the lowest score (mean 3.25, standard deviation ±0.48). Over 80% of these responses were rated as comprehensive (four points) or correct but inadequate (three points), but 16.6% were rated as mixed, correct/incorrect, or outdated (two points). Regarding repeatability, ChatGPT-3.5 provided similar responses between the two sessions in 95.2% of cases35.
In another recent paper, Chakraborty and colleagues asked two questions to ChatGPT (version not provided) regarding the management of antibiotic-resistant infections36. For the first question, ChatGPT was provided with susceptibility test results for several antibiotics without clinical context or bacterial genus and species. While ChatGPT appropriately suggested a thorough evaluation of the patient’s condition and consultation with an infectious diseases specialist, it also recommended meropenem without sufficient context, which could be inappropriate without more information. The second question was similar, with resistance to carbapenems included. Again, ChatGPT emphasized the need for more context and specialist consultation but recommended colistin, not aligning with recent guidelines for managing carbapenem-resistant Gram-negative infections, which no longer include colistin as a first-line therapy37,38. No other sessions were performed to assess response consistency to the same prompt36.
Finally, De Vito and colleagues recently evaluated ChatGPT-4’s performance in responding to true/false and open-ended questions regarding clinical cases of bacterial infections, with susceptibility test results available, totaling 96 questions39. Experts in antibiotic prescribing formulated the questions, that were also posed to four senior residents and four infectious diseases specialists. Responses from humans and ChatGPT-4 were assessed by the experts (blinded to whether responses were from humans or ChatGPT-4) for accuracy and completeness. ChatGPT-4 showed similar accuracy to humans in true/false questions (approximately 70% correctness) and provided more complete and accurate responses to open-ended questions than human participants. However, ChatGPT-4 struggled with recognizing resistance mechanisms and tended not to prescribe recently approved antibiotics for multidrug-resistant Gram-negative infections, favoring older, more toxic antibiotics such as polymyxins. ChatGPT-4 also tended to suggest longer-than-necessary antibiotic treatment durations compared to human participants39.
Discussion and perspective
With the emergence of LLMs in healthcare decision-making, researchers have also started investigating their potential to support antibiotic prescribing (Box 3)32,33,34,35,36,39,40,41,42. Several fundamental points should be considered, based on the initial literature on this topic.
The first point is the lack of standardization in research on the use of LLMs to support antibiotic prescribing. Standardization is likely required in building prompts, the number of sessions in which the same prompt should be presented to a given LLM or LLM-based chatbot, how subsequent questions should be prepared and posed, and how to measure the accuracy and consistency of responses. The term used to describe the comparison of responses to the same prompt varies across studies (e.g., consistency vs. repeatability). This heterogeneity is not unique to evaluating LLMs for antibiotic prescribing but also affects research on LLMs supporting healthcare decisions more generally. Initiatives such as the Chatbot Assessment Reporting Tool (CHART) project aim to improve the standardization of research methodology on using LLMs to support healthcare decisions, which could be fundamental in improving the generalizability and comparability of research findings on LLMs supporting antibiotic prescribing43.
The second point is the need to improve the human ability to identify biases or misinformation in confident and convincing outputs from black box models, which may theoretically mislead even expert antibiotic prescribers when subtle errors or biases are perpetuated. Some authors advocate for relying on interpretable models only, arguing that the assumption of increased accuracy of black box models over interpretable models should not be taken for granted22. Nonetheless, while interpretable models should be preferred when their accuracy is similar to that of black box models, this may not always be the case for machine learning models working on unstructured data like LLMs23,24,25. This highlights the issue of LLM explainability, or more specifically, the level of explanation accuracy and correctness that can be deemed acceptable for healthcare decisions. Establishing such a standard would require transparency about the datasets used for training, clarity regarding model architectures, and acknowledgment of potential biases in the training data to avoid their perpetuation. However, complicating this picture is the fact that, in the lack of sufficient reliability of explanations, the need to consider explainability as an absolute requirement for AI models has also been questioned, with some authors suggesting focusing on rigorous internal and external validation of AI models instead44,45. Independent of the final trajectory (favoring either explanations or validation), preprocessing of data will need to ensure privacy preservation, grammatical correction of errors, and proper recognition of medical terms and abbreviations.
Other techniques besides, in alternative, or potentially synergistic to improving explainability have emerged to reduce hallucinations and biases in supporting healthcare decisions. Retrieval-augmented generation (RAG) combines the pretrained parametric memory of LLM with external non-parametric memory (e.g., we can imagine a link to the most updated guidelines on antibiotic prescribing for different infectious diseases), also in the form of knowledge graphs, as a fine-tuning approach leading to LLM responses more grounded in real factual knowledge46,47. Chain-of-thought prompting has been shown to elicit multi-step reasoning behavior in LLM, that could improve their performance independent of fine-tuning48. Dedicated evaluation frameworks to assess the reliability of LLMs as assistants in the biomedical domain have also been developed, prioritizing prompt robustness, high recall, and lack of hallucinations as necessary criteria for this use case49, although it is likely that specific evaluation frameworks will be needed for antibiotic prescribing considering the peculiar need to adequately balance benefits for individual patients and global benefits in terms of reducing development and selection of antimicrobial resistance. In this light, while very high recall would likely be essential when supporting the selection of effective empirical antibiotics in patients with severe clinical presentation and reduced survival in case of delayed initiation of an effective therapy, the same may not always hold true in case of a less severe clinical presentation with no immediate prognostic repercussions, that could require a different balance between recall and other performance metrics from an antimicrobial stewardship perspective.
Expert human evaluation of responses provided by LLMs during development and before/during implementation in clinical practice would also be crucial. Against this background, initiatives like the Translational Evaluation of Healthcare AI (TEHAI) have been taken with the aim to develop and standardize a comprehensive and multi-stage evaluation of AI models, including LLMs50,51. Scaling human evaluation through crowdsourcing and the development of dedicated benchmarks to assess AI alignment to human preferences are also been explored as a way to expand and possibly improve the global evaluation of models’ performances52. Finally, in our opinion, exploring dedicated design and metrics for randomized studies to assess LLMs’ performance in clinical practice may also prove essential for evaluating, with the highest certainty of evidence, the efficacy and safety of LLMs or LLM-based chatbots in improving appropriate antibiotic prescribing.
A third point specific to antibiotic prescribing, as introduced in previous paragraphs, involves balancing the best possible treatment for individual patients with reducing antimicrobial resistance at a more global level, in line with antimicrobial stewardship core objectives. This peculiar challenge will require dedicated and standardized guidance for measuring the appropriateness of LLMs’ suggestions for antibiotic prescribing. Overall, all the above considerations necessitate a multidisciplinary approach to LLM development, approval, and clinical use for antibiotic prescribing and antimicrobial stewardship. Collaborations between clinicians and data scientists should be supported by structured governance, regulatory, and ethical frameworks that can keep pace with the rapid development of LLMs and their application in healthcare. Furthermore, the complex and nuanced nature of antibiotic prescribing will likely require the use of complex platforms that connect and coordinate different models dedicated to the various dynamic phases of prescribing, which should act globally as moral agents balancing the needs of an individual with those of the broader society53. Notably, no LLM tools are currently approved by regulators for use in antibiotic-prescribing settings. Transparency in LLM models (data openness, data quality, and model explainability) and clear accountability for LLM-supported decisions could be crucial from a regulatory standpoint, and intended across all phases, from model development to the evaluation of the trustworthiness of responses/suggestions provided by fully developed models. Educating future medical professionals on these aspects will also play a fundamental role in improving the proper use of LLM-based support for antibiotic prescribing by ensuring a deeper understanding of their strengths and limitations (Box 4)54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79. Despite all these current limitations and areas needing improvement, the theoretical potential advantages of LLM-based support of antibiotic prescribing are undeniable, for example: (i) reduced cognitive-load on automatic tasks in busy hospitals, with more time for clinicians to focus on more complex, high-value tasks related to antibiotic prescribing and management of complex infections; (ii) enhanced efficiency, with reduced time spent manually searching antibiotic guidelines; (iii) improve and rapid provision of contextual insights for clinical reasoning based on the elaboration and combination of medical records information with external knowledge based on the most updated research findings and guidelines; (iv) integration within healthcare records with automated monitoring for early identification of adverse events and clinical events or new laboratory results requiring changes in antimicrobial therapy, aiming both at improving patients’ outcomes and at reducing selection and dissemination of antimicrobial resistance. Recent estimates indicate that more than 8 million deaths annually could be associated with antimicrobial resistance by 2050, surpassing deaths from other widespread diseases80. Achieving a balanced and safe use of LLM support in antibiotic prescribing and antimicrobial stewardship initiatives is thus an opportunity not to be missed.
Responses