Current and future state of evaluation of large language models for medical summarization tasks

Current and future state of evaluation of large language models for medical summarization tasks

Introduction

The rapid development of Large Language Models (LLMs) has led to significant advancements in the field of Natural Language Generation (NLG). In the medical domain, LLMs have shown promise in reducing documentation-based cognitive burden for healthcare providers, particularly in NLG tasks such as summarization and question answering. Summarizing clinical documentation has emerged as a critical NLG task as the volume of medical text in Electronic Health Records (EHRs) continues to expand1.

Recent advancements, like the introduction of larger context windows in LLMs (e.g., Google’s Gemini 1.5 Pro with a 1 million-token capacity2), allow for the processing of extensive textual data, making it possible to summarize entire patient histories in a single input. However, a major challenge in applying LLMs to high-stakes environments like medicine is ensuring the reliable evaluation of their performance. Unlike traditional approaches, generative AI (GenAI) offers greater flexibility by generating natural language narratives that use language dynamically to fulfill tasks. Yet, this flexibility introduces added complexity in assessing the accuracy, reliability, and quality of the generated output where the desired response is not as static.

The evaluation of clinical summarization by LLMs must address the intricacies of complex medical texts and tackle LLM-specific challenges such as relevancy, hallucinations, omissions, and ensuring factual accuracy3. Healthcare data can further complicate the LLM-specific challenges because they can contain conflicting or incorrect information. Current metrics, like n-gram overlap and semantic scores, used in summarization tasks are insufficient for the nuanced needs of the medical domain4. While these metrics may perform adequately for simple extractive summarization, they fall short when applied to abstractive summarization5, where complex reasoning and in-depth medical knowledge are required. They are also unable to differentiate the needs of various users or provide evaluations that account for the relevancy of generations.

In the era of GenAI, automation bias further complicates the potential risks posed by LLMs, particularly in clinical settings where the consequences of inaccuracies can be severe. Therefore, efficient and automated evaluation methods are essential. In this review, we examine the current state of LLM evaluation in summarization tasks, highlighting both its applications and limitations in the medical domain. We also propose a future direction to overcome the labor-intensive process of expert human evaluation, which is time-consuming, costly, and requires specialized training.

Search strategy and selection criteria

Comprehensive literature searches were conducted across multiple databases focused on summarization and question-answering tasks with a special focus on clinical applications (Fig. 1). From April 20, 2023 through August 3, 2023, searches were conducted across the Association for Computational Linguistics (ACL) anthology, Medline, and Scopus databases for literature that employed human frameworks or pre-LLM automated metrics for evaluative efforts related to these tasks. This search resulted in 262 abstracts for review. From April 16, 2024 through June 6, 2024, searches were conducted across the Association for Computational Linguistics (ACL) anthology, Association for Computing Machinery (ACM) Digital Library, Web of Science, Institute of Electrical and Electronics Engineers (IEEE) Xplore, and Scopus databases for literature that utilized large language models in evaluative processes related to these tasks. This search resulted in 95 abstracts for review. The free text, filters, and queries by database for each search can be found in Supplementary Tables 1 and 2. Materials were selected for inclusion in this review if they (1) introduced novel human evaluation frameworks or automated metrics, (2) centered on a clinically-relevant summarization task, or (3) demonstrated improvements over Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Following abstract reviews with these criteria, there were 82 and 48 papers respectively that underwent full text review for a total of 130. We also choose to include any materials that were referenced by multiple other articles when presenting potential improvement comparisons or fundamental knowledge pertaining to the task of evaluation.

Fig. 1: Literature search overview.
Current and future state of evaluation of large language models for medical summarization tasks

A high level view of the two literature searches conducted for this review and their results.

Full size image

Human evaluations in electronic health record documentation

The current human evaluation frameworks for human-authored clinical notes are largely based on pre-GenAI rubrics that assess clinical documentation quality. These frameworks vary depending on the type of evaluators, content, and the analysis required to generate evaluative scores. Such flexibility allows for tailored evaluation methods, capturing task-specific aspects that ensure quality generation. Expert evaluators, with their field-specific knowledge, play a crucial role in maintaining high standards of assessment.

Some commonly used pre-GenAI rubrics include the SaferDx6, Physician Documentation Quality Instrument (PDQI-9)7, and Revised-IDEA8 rubrics. The SaferDx rubric focuses on identifying diagnostic errors and analyzing missed opportunities in EHR documentation through a 12-question retrospective survey aimed at improving diagnostic decision-making and patient safety. The PDQI-9 evaluates physician note quality across nine criteria questions, ensuring continuous improvement in clinical documentation and patient care. The Revised-IDEA tool offers feedback on clinical reasoning documentation through a 4-item assessment. All three of these rubrics place emphasis on the omission of relevant diagnoses throughout the differential diagnosis process and the relevant objective data, processes, and conclusions associated with those diagnoses. They also require clinical documentation to be free of incorrect, inappropriate, or incomplete information emphasizing the importance of the quality of evidence and reasoning that is present in clinical documentation. Each rubric includes additional questions based on the origin and usage of specific clinical documentation — like the PDQI-9’s assessment of organization to ensure a reader is able to understand the clinical course of a patient. Each of the three also uses different assessment styles based on the granularity of the questions and intention behind the assessment. For instance, the Revised-IDEA tool uses a count style assessment for 3 of the 4-items to guarantee the inclusion of a minimum number of objective data points and inclusion of required features for a high-quality diagnostic reasoning documentation. In recent publications, the SaferDx tool has been used as a retrospective analysis of the use of GenAI in clinical practice9, whereas the PDQI-9 and Revised-IDEA tools have been utilized to compare the quality of clinical documentation that is written by clinicians versus GenAI methods10,11,12. While each of these rubrics was not originally designed to evaluate LLM-generated content, they offer valuable insights into the essential criteria for evaluating text generated in the medical domain.

Human evaluations remain the gold standard for LLM outputs13. However, because these rubrics were initially developed for evaluating clinician-generated notes, they may need to be adapted for the specific purpose of evaluating LLM-generated output. Several new and modified evaluation rubrics have emerged to address the unique challenges posed by LLM-generated content, including evaluating the consistency and factual accuracy (i.e., hallucinations) of the generated text. Common themes in these adapted rubrics include safety14, modality15,16, and correctness17,18.

Criteria for human evaluations

In general, the criteria that are used to make up evaluation rubrics for LLM output fall into seven broad criteria: (1) Hallucination4,17,18,19,20,21,22, (2) Omission14,19, (3) Revision23, (4) Faithfulness/Confidence15,16,23, (5) Bias/Harm14,16,22, (6) Groundedness14,15, and (7) Fluency15,17,20,23. Hallucination encompasses any evaluative questions that intend to capture when information in a generated text does not follow from the source material. Unsupported claims, nonsensical statements, improbable scenarios, and incorrect or contradictory facts would be flagged by questions in this criteria. Omission-based questions are used to identify missing information in a generated text. Medical facts, important information, and critical diagnostic decisions can all be considered omitted when not included in generated text, if those items would have been included by a medical professional. When an evaluator is asked to make revisions or estimate the number of revisions needed for a generated text, the evaluative question would fall under Revision. Generated texts are revised until they meet the standards set forth by a researcher, hospital system, or larger government body. Faithfulness/Confidence is generally characterized by questions that capture whether a generated text has preserved the content of the source text and presented conclusions that reflect the confidence and specificity present in the source text. Questions about Bias/Harm evaluate whether generated text is introducing potential harm to a patient or reflecting bias in the response. Information that is inaccurate, inapplicable, or poorly applied would be captured by questions that fall under this criteria. Groundedness refers to evaluative questions that grade the quality of the source-based evidence for a generated text. Any evidence that contains poor reading comprehension, recall of knowledge, reasoning steps, or is antithetical to scientific consensus would result in a poor groundedness score. In addition to the content of a generated text, the Fluency of a generated text is also included in evaluations. Coherency, readability, grammatical correctness, and lexical correctness fall under this criteria. In many cases, Fluency is assumed to be adequate in favor of focusing on content-based evaluative criteria.

Analysis of human evaluations

The method of analysis for evaluation rubrics can also vary based upon the setting and task. Evaluative scores can be calculated using binary/Likert categorizations14,15, counts/proportions of pre-specified instances22, edit distance23, or penalty/reward schemes similar to those used for medical exams24. Binary categorizations answer evaluative questions using True/False or Yes/No response schema. This set-up allows complex evaluations to be broken down into simpler and potentially more objective decisions. A binary categorization places more penalization on smaller errors by pushing responses to be either acceptable or unacceptable. Likert-scaled categorizations allow for a higher level of specificity in the score by providing an ordinal scale. These scales can consist of as many levels as necessary, and in many cases there are between 3 and 9 levels including a neutral option for unclear responses. Scales with a higher number of levels introduce more problems with meeting assumptions of a normal distribution into an analysis, along with complexity and disagreement amongst reviewers. Count/proportion-based evaluations require an evaluator to identify pre-specified instances of correct or incorrect key phrases related to a particular evaluative criteria. A precision, recall, f-score, or rate can then be computed from an evaluator’s annotations to establish a numerical score for a generated text. Edit distance evaluations also require an evaluator to make annotations on the generated text that is being evaluated. In these cases, an evaluator makes edits to the generated text until it is satisfactory or no longer contains critical errors. These edits can be corrections on factual errors, inclusion of omissions, or removal of irrelevant items. The evaluative score is the distance from the original generated text and the edited version based upon the number of characters, words, etc. that required editing. The Levenshtein distance25 is an example of an algorithm used to calculate the distance between the generated text and its edited version. This distance is calculated as the minimum number of substitutions, insertions, and deletions of individual characters required to change the original to the edited version. Finally, one of the more complex ways to compute evaluative scores is to use a Penalty/Reward schema. These schema award points for positive outcomes to evaluative questions and penalize negative outcomes. This schema is similar to those seen on national exams which account for positive and negative scores, using the importance and difficulty associated with different questions. For example, the schema used to evaluate LLMs on the Med-HALT dataset is an average of the correct and incorrect answers which are assigned +1 and −0.25 points respectively24. This evaluation schema provides a high level of specificity for assigning weights representative of the trade-off between false positives and false negatives.

Drawbacks of human evaluations

While human evaluations provide nuanced assessments, they are resource-intensive and heavily reliant on the recruitment of evaluators with clinical domain knowledge. The experience and background of an evaluator can significantly influence how they interpret and evaluate generated text. Additionally, the level of guidance and specificity in evaluative instructions determines how much of the assessment is shaped by the evaluators’ personal interpretations and beliefs about the task. Although increasing the number of evaluators could mitigate some of these biases, resources—both time and financial—often limit the scale of human evaluations. These evaluations also require substantial manual effort, and without clear guidelines and training, inter-rater agreement may suffer. Ensuring that human evaluators align with the evaluation rubric’s intent requires training, much like annotation guidelines for NLP shared tasks26,27,28. In the clinical domain, medical professionals are typically used as expert evaluators, but their time constraints limit their availability for large-scale evaluations. The difficulty of recruiting more medical professionals, compounded by the time needed for thorough assessments, makes frequent, rapid evaluations impractical.

Another concern is the validity of the evaluation rubric itself. A robust human evaluation framework must possess strong psychometric properties, including construct validity, criterion validity, content validity, and inter-rater reliability, to ensure reproducibility and generalizability. Unfortunately, many frameworks used in clinical evaluations do not provide sufficient details about their creation, making it difficult to assess their validity15,24. Often, human evaluation frameworks are developed for specific projects with only one evaluator, and while metrics like inter-rater reliability are crucial to establish validity, they are not always reported18,23. Moreover, clinically relevant evaluation rubrics have not been specifically designed to assess LLM-generated summaries. Most existing evaluation rubrics focus on assessing human-authored note quality, and they do not encompass all the elements required to evaluate the unique aspects of LLM-generated outputs6,7,8.

Pre-LLM automated evaluations

Automated metrics offer a practical solution to the resource constraints of human evaluations, particularly in fields like Natural Language Processing (NLP), where tasks such as question answering, translation, and summarization have long relied on these methods. Automated evaluations employ algorithms, models, or heuristic techniques to assess the quality of generated text without the need for continuous human intervention, making them far more efficient in terms of time and labor. These metrics, however, depend heavily on the availability of high-quality reference texts, often referred to as “gold standards.” The generated text is compared against these gold standard reference texts to evaluate its accuracy and how well it meets the task’s requirements. Despite their efficiency, automated metrics may struggle to capture the nuance and contextual understanding required in more complex domains, such as clinical diagnosis, where subtle differences in phrasing or reasoning can have significant implications. Therefore, while automated evaluations are valuable for their scalability, their effectiveness is closely tied to the quality and relevance of the reference texts used in the evaluation.

Categories of automated evaluation

Automated evaluations in the clinical domain can be categorized into five primary types (Fig. 2), each tailored to specific evaluation goals and dependent on the availability of reference and source material for the generated text: (1) Word/Character-based, (2) Embedding-based, (3) Learned metrics, (4) Probability-based, (5) and Pre-Defined Knowledge Base.

Fig. 2: Pre-LLM automated evaluation metric taxonomy.
figure 2

A structured organization of pre-LLM automated evaluation metrics categorized by their bases and the need for ground truth references. Those metrics that were built for or have been applied in the clinical domain are in bold. The taxonomy includes Recall-Oriented Understudy for Gisting Evaluation (ROUGE)29, Metric for Evaluation of Translation with Explicit Ordering (METEOR)66, Jensen-Shannon (JS) Divergence67, Consensus-based Image Description Evaluation (CIDEr)68, PyrEval69, Standardized Bilingual Evaluation Understudy (sacreBLEU)70, Summarization Evaluation by Relevance Analysis (SERA)71, POURPRE72, Basic Elements (BE)73, Bilingual Evaluation Understudy (BLEU)70, General Text Matcher (GTM)74, Word Error Rate (WER)75/ Translation Edit Rate (TER)76, Improving Translation Edit Rate (ITER)77/CDER (Cover-Disjoint Error Rate)78, chrF (character n-gram F-score)79, characTER (Character Level Translation Edit Rate)80, Extended Edit Distance (EED)81, YiSi82, Q-metrics83, Concept Unique Identifier (CUI) F-Score37, Crosslingual Optimized Metric for Evaluation of Translation (COMET)32, Bilingual Evaluation Understudy with Representations from Transformers (BLEURT)84, Combined Regression Model for Evaluating Responsiveness (CREMER)85, Better Evaluation as Ranking (BEER)86, BLEND87, Composite88, Neural Network Based Evaluation Metric (NNEval)88, Enhanced Sequential Inference Model (ESIM)89, Regressor Using Sentence Embeddings (RUSE)90, Bidirectional Encoder Representations from Transformers for Machine Translation Evaluation (BERT for MTE)91, ClinicalBLEURT19, Conditional Bilingual Mutual Information (CBMI)92, NIST93, BERTScore30, MoverScore94, AUTOmatic SUMMary Evaluation based on N-gram Graphs (AutoSumm ENG)95, Merge Model Graph (MeMoG)95, Semantic Propositional Image Caption Evaluation (SPICE)96, BERTr97, Word Embedding-based automatic MT evaluation using Word Position Information (WEWPI)98, Word Mover-Distance (WMD)99, SIMILE100, NeUral Based Interchangeability Assessor (NUBIA)101, SapBERTScore102, ClinicalBERTScore103, PubMedBERTScore104, UMLSScorer38, MIST19, Summary-Input Similarity Metrics (SIMetrix)67, BARTScore33, Hallucination Risk Measurement+ (HARiM+)105, ClinicalBARTScore33, MedBARTScore19, Semantic Normalized Cumulative Gain (SEM-nCG)106, Intrinsic Knowledge Graph107.

Full size image

Word/Character-based evaluations rely on comparisons between a reference text and the generated text to compute an evaluative score. These evaluations can be based on character, word, or sub-sequence overlaps depending on the need of the evaluation and the nuance that may be present in the text. Recall Oriented Understudy for Gisting Evaluation (ROUGE)29 is a prime example of a word/character-based metric. The many variants of ROUGE — N-gram Co-Occurrence (N), Longest Common Sub-sequence (L), Weighted Longest Common Sub-sequence (W), Skip-Bigram Co-Occurrence (S) — represent the level of comparison between the reference and generated texts. ROUGE-L is the current gold standard for automated evaluation, especially in summarization, and relies on the longest common subsequence between the reference and generated texts. The evaluative score is computed as the fraction of words in the text that are in the longest common subsequence. Edit distance metrics25 would also fall under this category as they are based on the number of words or characters that would need to be changed to match the reference and generated texts. Edits can be classified as insertions, deletions, substitutions, or transpositions of the words/characters in the generated text.

Embedding-based evaluations create contextualized or static embeddings for the reference and generated texts for comparison rather than relying on exact matches between words or characters. These embedding-based metrics are able to capture semantic similarities between two texts since the embedding for a word or phrase would be based on the text that surrounds it as well as itself. The BERTScore30 is a commonly used metric that falls under this category. For this metric, a Bidirectional Encoder Representations from Transformers (BERT) model31 is used to generate the contextualized embeddings before computing a greedy cosine similarity score based on those embeddings.

Learned metric-based evaluations rely on training a model to compute the evaluations. These metrics can be trained on example evaluation scores or directly on the reference and generated text pairs. Regression and neural network models are the foundation of these metrics providing varying degrees of complexity for the learnable parameters. The Crosslingual Optimized Metric for Evaluation of Translation (COMET)32 is a metric that would fall under this category as it is a neural model trained for evaluation. It was originally created for evaluation of machine translations, but has since been applied to other generative tasks. COMET uses a neural network with the generated text as input to produce an evaluative score. This metric can be applied to datasets that are reference-less as well as those with reference texts.

Probability evaluations rely on calculating the likelihood of a generated text based on domain knowledge, reference texts, or source material. These metrics equate high-quality generations with those that have a high probability of being coherent or relevant to the reference or source text. They also penalize the inclusion of off-topic or unrelated information. An example is BARTScore33, which calculates the sum of log probabilities for the generated output based on the reference text. In this case, the log probabilities are computed using the Bidirectional and Auto-Regressive Transformer (BART) model, which assesses how well the generated text aligns with the expected content34.

Pre-Defined Knowledge Base metrics rely on established databases of domain-specific knowledge to inform the evaluation of generated text. These metrics are particularly valuable in specialized fields like healthcare, where general language models may lack the necessary depth of knowledge. By incorporating domain-specific knowledge bases, such as the National Library of Medicine’s Unified Medical Language System (UMLS)35, these metrics provide more accurate and contextually relevant evaluations. Pre-defined knowledge bases can enhance other evaluation methods, such as contextual embedding, machine learning, or probability-based metrics, by grounding them in the specialized terminology and relationships unique to the domain. This combination ensures that evaluations account for both linguistic accuracy and the specialized knowledge required in fields like clinical medicine. BERTScore has a variant that was trained on the UMLS called the SapBERTScore36. The score functions similarly to the general domain BERTScore but leverages a BERT model fine-tuned using UMLS data to generate more domain-specific embeddings. Other metrics based on the UMLS include the CUI F-Score37 and UMLS Scorer38. The UMLS Scorer utilizes UMLS-based knowledge graph embeddings to assess the semantic quality of the text19, providing a more structured approach to evaluating clinical content. Meanwhile, the CUI F-Score represents text using Concept Unique Identifiers (CUIs) from the UMLS, calculating F-scores that reflect how well the generated text aligns with key medical concepts. This enables a more granular evaluation of the relevance and accuracy of medical terminology within the generated content.

Drawbacks of automated metrics

Prior to the advent of LLMs, automated metrics would generate a single score meant to represent the quality of a generated text, regardless of its length or complexity. This single-score approach can make it difficult to pinpoint specific issues in the text, and in the case of LLMs, it is nearly impossible to understand the precise factors contributing to a particular score13. While automated metrics offer the benefit of speed, this comes at the cost of relying on surface-level heuristics, such as lexicographic and structural measures, that fail to capture more abstract summarization challenges in medical text. Abstractive summarization introduces unique evaluative challenges because the generated text may not directly correspond to any part of the original documentation. This contrasts with extractive summarization, where generated content is explicitly drawn from the source text, making quality assessments more straightforward. Consequently, automated metrics developed prior to the advent of LLMs are typically optimized for extractive approaches, limiting their ability to fully capture the inferences and new language generated by abstractive summarization. Furthermore, the subjective nature of assessing clinical reasoning and coherence in abstractive summaries presents additional challenges. Automated metrics often fail to account for the alignment of generated content with clinical logic or decision-making pathways, which are critical in the medical domain. This raises the importance of complementing automated metrics with strong human evaluation processes. Specifically, ensuring alignment with subject matter experts and achieving high inter-rater reliability are essential to mitigate subjectivity and provide robust evaluations.

Future directions: LLMs as evaluators to complement human expert evaluators: prompt engineering LLMs as judges

LLMs are versatile tools capable of performing a wide range of tasks, including evaluating the outputs of other LLMs (Fig. 3). This concept, where an LLM acts as a model of a human expert evaluator, has gained traction with the advent of instruction tuning and reinforcement learning with human feedback (RLHF)39. These advancements have significantly improved the ability of LLMs to align their outputs with human preferences, as seen in the transition from GPT-3 to GPT-4, which marked a paradigm shift in LLM accuracy and performance40.

Fig. 3: Stages of prompt engineering LLMs as judges.
figure 3

The three different aspects of prompt engineering expanded upon in section 5. The three sections – Zero-Shot and In-Context Learning (ICL), Parameter Efficient Fine Tuning (PEFT), and PEFT with Human Aware Loss Function (HALO) – fit together into a larger schema for training and prompting an LLM to serve as an evaluator to complement human expert evaluators.

Full size image

LLMs have the potential to bridge critical gaps in evaluation methodologies for generative clinical text. Human evaluation frameworks, while reliable, demand significant time and effort from expert reviewers, creating a paradoxical bottleneck: LLMs designed to reduce the cognitive burden on clinicians inadvertently introduce additional workload in their evaluation. Automated metrics, as they currently exist, are often insufficient for assessing the abstractive nature of generative outputs in clinical contexts. LLMs, when aligned with expert human preferences, offer an opportunity to augment evaluation processes, reducing the reliance on manual review while maintaining accuracy and relevance to clinical needs.

An effective LLM evaluator would be able to respond to evaluative questions with precision and accuracy comparable to that of human experts, following frameworks like those used in human evaluation rubrics. LLM-based evaluations could provide many of the same advantages as traditional automated metrics, such as speed and consistency, while potentially overcoming the reliance on high-quality reference texts. Moreover, LLMs could evaluate complex tasks by directly engaging with the content, bypassing the need for simplistic heuristics and offering more information into factual accuracy, hallucinations, and omissions.

Although the use of LLMs as evaluators is still emerging in research, early studies have demonstrated their utility as an alternative to human evaluations, offering a scalable solution to the limitations of manual assessment41. As the methodology continues to develop, LLM-based evaluations hold promise for addressing the shortcomings of both traditional automated metrics and human evaluations, particularly in complex, context-rich domains such as clinical text generation.

Zero-shot and in-context learning

One method for designing LLMs to perform evaluations is through the use of manually curated prompts (Fig. 4). A prompt consists of the task description and instructions provided to an LLM to guide its responses. Two primary prompting strategies are employed in this context: Zero-Shot and Few-Shot3. In Zero-Shot prompting, the LLM is given only the task description without any examples before being asked to perform evaluations. Few-Shot prompting provides the task description alongside a few examples to help guide the LLM in generating output. The number of examples varies based on the LLM’s architecture, input window limitations, and the point at which the model performs optimally. Typically, between one and five few-shot examples are used. Prompt engineering, through both Zero-Shot and Few-Shot (“in-context learning”) approaches (collectively referred to as “hard prompting”), enables an LLM to perform tasks that it was not explicitly trained to do. However, performance can vary significantly depending on the model’s pre-training and its relevance to the new task.

Fig. 4: Anatomy of an evaluator prompt.
figure 4

An evaluator prompt consists of three sections: prompt, information, and evaluation. All three components are essential for an LLM serving as an evaluator. The Evaluator Prompt needs to instruct the LLM on the task (Prompt), provide the LLM will all the necessary information to make an evaluation (Information), and all the information that defines the guidelines and formatting of the evaluation (Evaluation).

Full size image

Beyond these manual approaches, a more adaptive strategy involves “soft prompting,” also known as machine-learned prompts, which includes techniques like prompt tuning and p-tuning42. Soft prompts are learnable parameters added as virtual tokens to a model’s input to signal task-specific instructions. Unlike hard prompts, soft prompts are trained and incorporated into the model’s input layer, enabling the model to handle a broader range of specialized tasks. Soft prompting has been shown to outperform Few-Shot prompting, especially in large-scale models, as it fine-tunes the model’s behavior without altering the core weights.

Through these methods, LLMs can be instructed to serve as evaluators with instructions on the dimensions and scale needed for a thorough evaluation. Promising results for such methods have already been seen in general domain applications like LLM-EVAL43 and TALEC44. LLM-EVAL is a single prompt approach to employing LLM-based evaluators that can consist of multiple dimensions. This approach reported correlation coefficients between human evaluators and LLM evaluators to have an average increase of nearly 30 points over ROUGE-L. TALEC is a GPT-4 based method that incorporates in context learning for establishing evaluation criteria and has shown correlation coefficients of nearly 0.9. Even though applications in the clinical domain are significantly more complex, there have also been positive reports of LLM-based evaluators on clinically relevant tasks. Models like Llama-2, ChatGPT-4o, and Claude-3 have been applied for evaluations on medical question answering and clinical note generation. Brake et al.45 experimented with model size, quantization, and multiple in context learning varieties on Llama-2 with final reports showing a Cohen Kappa of 0.79 with their human evaluators. Krolik et al.46 prompted ChatGPT-4o to perform evaluations of medical responses generated for a question answering system on dimensions such as hallucinations, completeness, and coherence.

Parameter efficient fine-tuning

When prompting alone does not achieve the desired performance, fine-tuning the entire LLM may be necessary for optimal task execution. Even though an LLM may be pre-trained on a vast corpus, it can struggle with tasks requiring domain-specific knowledge or handling nuanced inputs. To address these challenges, Supervised fine-tuning (SFT) methods with Parameter Efficient Fine-Tuning (PEFT) using quantization and low rank adaptors can be employed, where the model is trained on a specialized dataset of prompt/response pairs tailored to the task at hand. Fine-tuning every weight in a LLM can require a large amount of time and computational resources. In these instances, quantization and low rank adaptors are added to the fine-tuning process for PEFT. Quantization reduces the time and memory costs of training by using lower precision data types, generally 4-bit and 8-bit, for the LLMs weights47. Low rank adaptors (LoRA) freeze the weights of a LLM and decompose them into a smaller number of trainable parameters ultimately also reducing the costs of SFT48. PEFT helps refine an LLM by embedding task-specific knowledge, ensuring the model can respond accurately in specialized contexts. The creation of these datasets is critical–performance improvements are directly tied to the quality and relevance of the prompt/response pairs used for fine-tuning. The goal is to adjust the LLM to perform better in specific use cases, such as medical diagnosis or legal reasoning, by narrowing its focus to task-specific behaviors through PEFT.

Training an LLM to serve as an evaluator could require task-specific training, especially in very specialized domains like healthcare, where the evaluation rubrics, scales, or other required definitions are part of the training dataset. PHUDGE49 and FENCE50 are examples of PEFT methods applied in general domain tasks for an LLM to serve as the evaluator. PHUDGE is fine-tuned from Phi-3 as a cost-efficient alternative to closed source prompting methods for models like GPT-4. Therefore, performance comparisons were done against human and GPT-4 evaluations both of which had high reported correlations with PHUDGE. FENCE is an example of a framework developed specifically for evaluating factuality. This methodology focuses on using synthetic data to augment public datasets and provide feedback to language generation models. When applied to Llama3-8b-chat, factuality was reported to see more than a 14% increase.

Extensions to traditional PEFT methods have continued to be introduced as research progresses towards specialized domains with specific needs. Methodologies such as preference-based learning, probability calibration, text reprocessing, or some combination have emerged to refine the capabilities of LLMs51. Preference-based learning is focused on the adaptation of LLMs using human preference data. In this style of training, human preference datasets are curated to train LLMs for specialized evaluations. Probability calibration and text reprocessing are post-processing methodologies employed to refine LLMs through targeted adjustments following analysis of initial outputs. Probability calibration quantifies discrepancies in LLM generations and ground truth texts through mathematical derivations for adjustments. Text reprocessing hinges on integrating various iterations of evaluative outputs to improve accuracy. Because of the nuanced nature of the clinical domain, this review will focus on preference-based learning methods where clinician’s preferences are incorporated in training the evaluator. This allows the LLM serving as the evaluator to be guided by feedback for understanding clinical relevancy.

Parameter efficient fine-tuning with human-aware loss function

In certain applications, the focus of fine-tuning is to align the LLM with human values and preferences, especially when the model risks generating biased, incorrect, or harmful content. This alignment, known as Human Alignment training, is driven by high-quality human feedback integrated into the training process. A widely recognized approach in this domain is Reinforcement Learning with Human Feedback (RLHF)52. RLHF is applied to update the LLM, guiding it toward outputs that score higher on the reward scale. In the reward model stage, a dataset annotated with human feedback is used to establish the reward, typically scalar in nature, of a particular response. The LLM is then trained to produce responses that will receive higher rewards through a process known as Proximal Policy Optimization (PPO)53. This iterative process ensures the model aligns with human expectations, but it can be resource-intensive, requiring significant memory, time, and computational power.

To address these computational challenges, newer paradigms have emerged that streamline Human Alignment training by directly optimizing the LLM-based on human preferences, without the need for a reward model with Direct Preference Optimization (DPO)54. DPO reformulates the alignment process into a human-aware loss function (HALO), optimized on a dataset of human preferences where prompts are paired with preferred and dis-preferred responses (Fig. 5). This method is particularly promising for aligning LLMs with human preferences and can be applied to ordinal responses, such as the Likert scales commonly seen in human evaluation rubrics. While PPO improves LLM performance by aligning outputs with human preferences, it is often sample-inefficient and can suffer from reward hacking55. DPO, in contrast, directly optimizes model outputs based on human preferences without needing an explicit reward model, making it more sample-efficient and better aligned with human values. DPO simplifies the training process by focusing directly on the desired outcomes, leading to more stable and interpretable alignment. While these methods have been successfully applied in other domains56,57,58, their use in the medical field is under-explored. Training data from the human evaluation rubric on a much smaller scale to overcome labor constraints can be incorporated into a loss function designed for human alignment using DPO.

Fig. 5: Alignment workflow: PPO v. DPO.
figure 5

An overview of the processes for aligning an LLM through Reinforcement Learning Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO).

Full size image

In the last year, many variants of DPO have emerged for alignment training methods that can prevent over-fitting and circumvent DPO’s modeling assumptions with modifications to the underlying model and loss function (Fig. 6). Alternative methods such as Joint Preference Optimization (JPO)59 and Simple Preference Optimization (SimPO)60 were derived from DPO. These methods introduce regularization terms and modifications to the loss function to prevent premature convergence and ensure more robust alignment over a broader range of inputs. Other alternative methods such as Kahneman-Tversky Optimization (KTO)61 and Pluralistic Alignment Framework (PAL)62 use alternatives to the Bradley-Terry preferences model that underlies DPO. The alternative modeling assumptions used in these methods can prevent the breakdown of DPO’s alignment in situations without direct preference data and heterogeneous human preferences.

Fig. 6: Human aware loss functions (HALOs) from PPO to present.
figure 6

The development timeline for HALOs from the advent of Proximal Policy Optimization (PPO) in 2017 through 2024. Each HALO is connected to it’s precursor (either DPO or PPO) by a dotted line. If a HALO has an algorithmic basis in reinforcement learning, it is presented as white text on a solid color background. If a HALO has an algorithmic basis that is reinforcement learning free, it is presented as colored text on a white background. Each color, either text or background, corresponds to the data requirements for that HALO. Blue corresponds to HALOs that only use prompt/response pair data. Orange corresponds to HALOs that use response preference pairs in addition to the prompt. Finally, green corresponds to HALOs that use binary judgement data in addition to the prompt/response pair. The figure includes PPO Proximal Policy Pptimization53, DPO Direct Preference Optimization54, RSO Statistical Rejection Sampling108, IPO Identity Preference Optimization109, cDPO Conservative DPO110, KTO Kahneman Tversky Optimization61, JPO Joint Preference Optimization59, ORPO Odds Ratio Preference Optimization111, rDPO Robust DPO112, BCO Binary Classifier Optimization113, DNO Direct Nash Optimization62, TR-DPO Trust Tegion DPO114, CPO Contrastive Preference Optimization115, SPPO Self-Play Preference Optimization116, PAL Pluralistic Alignment Framework62, EXO Efficient Exact Optimization117, AOT Alignment via Optimal Transport118, RPO Iterative Reasoning Preference Optimization119, NCA Noise Contrastive Alignment120, RTO Reinforced Token Optimization121, SimPO Simple Preference Optimization60.

Full size image

Drawbacks of LLMs as evaluators

LLMs hold promise for automating evaluation, but as with other automated evaluation methods, there are significant challenges to consider. One major issue is the rapid pace at which LLMs and their associated training strategies have evolved. This rapid development often outpaces the ability to thoroughly validate LLM-based evaluators before they are used in practice. In some cases, new optimization techniques are introduced before their predecessors have undergone peer review, and these advancements may lack sufficient mathematical justification. The speed of LLM evolution can make it difficult to allocate time and resources for proper validation, which can compromise their reliability. The specific method of validation for LLM-based evaluators is another open area of research. In the case of multiple human evaluators, inter-rater reliability metrics have been utilized to identify when different evaluators diverge. LLM-based evaluation output can be compared against that of expert human evaluators, but the standard to which LLM-based evaluators must be held has yet to be determined. In cases of ordinal, count, or other numerical evaluation scoring outputs, validation metrics like root mean squared error are also a possibility. One existing gap towards the reliable validation of LLM-based evaluators is the existence of datasets tailored for this task where the entire evaluative rubric is present and a highly-reliable ground truth exists.

Moreover, despite their advancements, LLMs remain sensitive to the prompts and inputs they receive. As LLMs continue to update and change their internal knowledge representations and as their prompts also change, the output can be highly variable. The exact LLM, or model version, that is used can also add another layer of variability. The same prompts and inputs can produce different results based on the LLM’s internal structure and pre-training schema. LLMs have also been noted for egocentric bias which could affect evaluations as more and more LLM-generated text appears in source texts63. As a result, the use of LLMs as evaluators must be accompanied by stringent testing and safety checks to mitigate risks. Ensuring fairness in their responses is also critical, particularly in sensitive domains like healthcare, where biased or stigmatizing language could have serious consequences. These challenges highlight the need for continuous evaluation, testing, and refinement to make LLM-based evaluators both reliable and safe for medical evaluations.

Evaluation needs for the clinical domain

The development of reliable evaluation strategies is becoming increasingly important as the pace of innovation in GenAI outstrips the speed at which these technologies are validated. In health systems, the focus on clinical safety must also contend with the time constraints placed on healthcare professionals. While human evaluation rubrics offer a high degree of reliability and accuracy, they are significantly limited by the time commitment required from medical professionals serving as evaluators. Ironically, the technologies being evaluated often aim to reduce the cognitive load on these same professionals, yet they demand further time investment for their performance evaluation.

Automated evaluations, if properly designed for the clinical domain, present a promising alternative to human evaluations. However, traditional non-LLM automated evaluations have thus far fallen short, failing to consistently match the rigor of human evaluation rubrics5,13. These metrics frequently overlook hallucinations, fail to assess reasoning quality, and struggle to determine the relevance of generated texts. As LLMs are introduced as potential alternatives for human evaluators, it is critical to consider the unique requirements of the clinical domain. Systematic reviews of LLM-based applications for healthcare reveal that evaluation dimensions like safety, bias, and information quality are of particular importance64,65. Since patient safety is at the forefront of many clinical NLG tasks, clinically deployed LLMs must be evaluated for their potential to produce incorrect information or lead to negative patient outcomes. They are also susceptible to adopting biased behavior based on non-objective or non-comprehensive training data. This could infuse LLM generations with stereotypes and biased results that are harmful to patients. The quality of information in an LLM generation is also important in clinical applications. Aspects like factuality, relevancy, usefulness, consistency, and completeness are employed to capture the extent to which clinical text is representative of a patient’s clinical course. These factors can be significantly more complex to evaluate using heuristics and require some level of clinical knowledge to judge clinical impact. Evaluation frameworks must incorporate assessments along these dimensions in addition to those generally associated with high-quality text generations. These considerations require evaluation methodology designed specifically for health system applications that will prioritize such clinically relevant concerns over exact string matching or structural similarities that have been the mainstay of general domain evaluation metrics. A well-designed LLM evaluator—an “LLM-as-a-judge”—could potentially combine the high reliability of human evaluations with the efficiency of automated methods, while avoiding the pitfalls that have limited existing automated metrics. If executed effectively, such LLM-based evaluations could offer the best of both worlds, ensuring clinical safety without sacrificing the quality of assessments.

Related Articles

Self-reports map the landscape of task states derived from brain imaging

Psychological states influence our happiness and productivity; however, estimates of their impact have historically been assumed to be limited by the accuracy with which introspection can quantify them. Over the last two decades, studies have shown that introspective descriptions of psychological states correlate with objective indicators of cognition, including task performance and metrics of brain function, using techniques like functional magnetic resonance imaging (fMRI). Such evidence suggests it may be possible to quantify the mapping between self-reports of experience and objective representations of those states (e.g., those inferred from measures of brain activity). Here, we used machine learning to show that self-reported descriptions of experiences across tasks can reliably map the objective landscape of task states derived from brain activity. In our study, 194 participants provided descriptions of their psychological states while performing tasks for which the contribution of different brain systems was available from prior fMRI studies. We used machine learning to combine these reports with descriptions of brain function to form a ‘state-space’ that reliably predicted patterns of brain activity based solely on unseen descriptions of experience (N = 101). Our study demonstrates that introspective reports can share information with the objective task landscape inferred from brain activity.

Evaluating search engines and large language models for answering health questions

Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50–70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs’ effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Large language models (LLMs) have emerged as transformative tools with the potential to revolutionize philosophical counseling. By harnessing their advanced natural language processing and reasoning capabilities, LLMs offer innovative solutions to overcome limitations inherent in traditional counseling approaches—such as counselor scarcity, difficulties in identifying mental health issues, subjective outcome assessment, and cultural adaptation challenges. In this study, we explore cutting‐edge technical strategies—including prompt engineering, fine‐tuning, and retrieval‐augmented generation—to integrate LLMs into the counseling process. Our analysis demonstrates that LLM-assisted systems can provide counselor recommendations, streamline session evaluations, broaden service accessibility, and improve cultural adaptation. We also critically examine challenges related to user trust, data privacy, and the inherent inability of current AI systems to genuinely understand or empathize. Overall, this work presents both theoretical insights and practical guidelines for the responsible development and deployment of AI-assisted philosophical counseling practices.

Preserving and combining knowledge in robotic lifelong reinforcement learning

Humans can continually accumulate knowledge and develop increasingly complex behaviours and skills throughout their lives, which is a capability known as ‘lifelong learning’. Although this lifelong learning capability is considered an essential mechanism that makes up general intelligence, recent advancements in artificial intelligence predominantly excel in narrow, specialized domains and generally lack this lifelong learning capability. Here we introduce a robotic lifelong reinforcement learning framework that addresses this gap by developing a knowledge space inspired by the Bayesian non-parametric domain. In addition, we enhance the agent’s semantic understanding of tasks by integrating language embeddings into the framework. Our proposed embodied agent can consistently accumulate knowledge from a continuous stream of one-time feeding tasks. Furthermore, our agent can tackle challenging real-world long-horizon tasks by combining and reapplying its acquired knowledge from the original tasks stream. The proposed framework advances our understanding of the robotic lifelong learning process and may inspire the development of more broadly applicable intelligence.

Emotions and individual differences shape human foraging under threat

A common behavior in natural environments is foraging for rewards. However, this is often in the presence of predators. Therefore, one of the most fundamental decisions for humans, as for other animals, is how to apportion time between reward-motivated pursuit behavior and threat-motivated checking behavior. To understand what affects how people strike this balance, we developed an ecologically inspired task and looked at both within-participant dynamics (moods) and between-participant individual differences (questionnaires about real-life behaviors) in two large internet samples (n = 374 and n = 702) in a cross-sectional design. For the within-participant dynamics, we found that people regulate task-evoked stress homeostatically by changing behavior (increasing foraging and hiding). Individual differences, even in superficially related traits (apathy–anhedonia and anxiety–compulsive checking) reliably mapped onto unique behaviors. Worse task performance, due to maladaptive checking, was linked to gender (women checked excessively) and specific anxiety-related traits: somatic anxiety (reduced self-reported checking due to worry) and compulsivity (self-reported disorganized checking). While anhedonia decreased self-reported task engagement, apathy, strikingly, improved overall task performance by reducing excessive checking. In summary, we provide a multifaceted paradigm for assessment of checking for threat in a naturalistic task that is sensitive to both moods as they change throughout the task and clinical dimensions. Thus, it could serve as an objective measurement tool for future clinical studies interested in threat, vigilance or behavior–emotion interactions in contexts requiring both reward seeking and threat avoidance.

Responses

Your email address will not be published. Required fields are marked *