Evaluating large language models for criterion-based grading from agreement to consistency

Methods

Essay

This study used sample answers derived from the IELTS book series, spanning from IELTS 8 to IELTS 18. The chosen essays met specific criteria: (1) originating from the academic writing task 2 in IELTS and (2) having been assigned a score by an official IELTS examiner. In total, 31 essays were included in the study, with a mean score of 6.0 and a standard deviation of 1.1, ranging from 3.5 to 8.0. These essays and their corresponding writing prompts were extracted for subsequent analysis.

ChatGPT prompt

To systematically assess the impact of criteria knowledge on ChatGPT’s grading, we employed a three-stage incremental prompt design. Prompt 1 adhered to best practices for interacting with ChatGPT by instructing it to simulate an IELTS examiner using zero-shot reasoning. Building upon this initial setup, Prompt 2 introduced the official IELTS grading criteria – namely “task response,” “coherence and cohesion,” “lexical resource,” and “grammatical range and accuracy”. Prompt 3 expanded on Prompt 2 by incorporating comprehensive band descriptors for each criterion. This progressive approach allowed us to assess how varying levels of criterion knowledge influence the alignment of LLMs with criterion-based grading. Detailed descriptions of each prompt are available in the supplementary note 1-3.

Procedure

Initial assessments were conducted using ChatGPT 3.5. Each essay was evaluated in a new chat session to prevent potential influence from chat history. Since Prompt 3 exceeded the maximum word count for a single chat input, the ChatGPT PROMPTs Splitter (https://chatgpt-prompt-splitter.jjdiaz.dev/) was used to segment the prompt. If ChatGPT’s responses did not conform to the IELTS scoring guidelines (e.g., rounding scores to the nearest whole or half band score), the respective essay was re-evaluated until a compliant score was provided.

After identifying the most effective prompt condition, we reran it using ChatGPT 4 to assess whether more advanced models (e.g., more model parameters) can improve the grading. To testify the generalizability of the results to other LLMs, the most effective prompt was also conducted in Claude 3 Haiku, another publicly available LLM with fewer model parameters but a more recent knowledge base.

As the preceding analyses primarily focused on the efficacy of LLM-generated grading, a preliminary assessment of its effectiveness was conducted by having a high school English teacher from China evaluate the same set of essays. The invited teacher, who held general qualifications in English education but lacked specific IELTS training or teaching experience, provided grading that reflects the typical feedback an English learner might receive from a general educational resource. This grading provides an initial reference for contextualizing the effectiveness of our results.

Statistical analysis

Three pairwise ICC analyses were first conducted, each involving one ChatGPT rating given by one of the prompts and the official rating. The two-way random model (also known as ICC2) and two-way mixed model (also known as ICC3) were conducted. Compared to ICC3, ICC2 assumes fixed biases, which gives rise to the distinction – ICC2 gauges the absolute agreement, while ICC3 primarily assesses consistency17. Evaluating ICC2 and ICC3 together offers insights into potential biases17. ICCs were conducted using the single-score formula. The point estimates and their 95% confidence intervals were reported.

We initially examined whether each prompt demonstrated significant absolute agreement, defined by a 95% confidence interval that excludes 0. The prompt that generated significant absolute agreement was subjected to follow-up analyses. The values of ICC2 and ICC3 were first inspected. In the presence of a discrepancy between ICC2 and ICC3, indicating potential biases, the Bland-Altman plot and t-tests were conducted to examine the distribution and the tendency of the biases19.

After identifying the most effective prompt, we assessed its test-retest reliability (i.e., the intrarater agreement) by rerunning the prompt using ChatGPT 3.5 and applying the average-score ICC2 formula. Subsequently, we extended our analysis by rerunning this prompt with ChatGPT 4.0 and Claude 3.0 Haiku, as well as having a representative secondary middle school English teacher from China evaluate the same set of essays. The ICCs were calculated for each condition. To determine whether these ICCs differed significantly from those of the initial ChatGPT 3.5 grading, we applied Fisher’s Z transformation and conducted statistical comparisons.

ICC qualitative interpretation was guided by Koo & Li20: scores below 0.5 are considered poor, 0.5-0.75 moderate, 0.75-0.9 good, and above 0.90 excellent.

Related Articles

Self-reports map the landscape of task states derived from brain imaging

Psychological states influence our happiness and productivity; however, estimates of their impact have historically been assumed to be limited by the accuracy with which introspection can quantify them. Over the last two decades, studies have shown that introspective descriptions of psychological states correlate with objective indicators of cognition, including task performance and metrics of brain function, using techniques like functional magnetic resonance imaging (fMRI). Such evidence suggests it may be possible to quantify the mapping between self-reports of experience and objective representations of those states (e.g., those inferred from measures of brain activity). Here, we used machine learning to show that self-reported descriptions of experiences across tasks can reliably map the objective landscape of task states derived from brain activity. In our study, 194 participants provided descriptions of their psychological states while performing tasks for which the contribution of different brain systems was available from prior fMRI studies. We used machine learning to combine these reports with descriptions of brain function to form a ‘state-space’ that reliably predicted patterns of brain activity based solely on unseen descriptions of experience (N = 101). Our study demonstrates that introspective reports can share information with the objective task landscape inferred from brain activity.

Emotions and individual differences shape human foraging under threat

A common behavior in natural environments is foraging for rewards. However, this is often in the presence of predators. Therefore, one of the most fundamental decisions for humans, as for other animals, is how to apportion time between reward-motivated pursuit behavior and threat-motivated checking behavior. To understand what affects how people strike this balance, we developed an ecologically inspired task and looked at both within-participant dynamics (moods) and between-participant individual differences (questionnaires about real-life behaviors) in two large internet samples (n = 374 and n = 702) in a cross-sectional design. For the within-participant dynamics, we found that people regulate task-evoked stress homeostatically by changing behavior (increasing foraging and hiding). Individual differences, even in superficially related traits (apathy–anhedonia and anxiety–compulsive checking) reliably mapped onto unique behaviors. Worse task performance, due to maladaptive checking, was linked to gender (women checked excessively) and specific anxiety-related traits: somatic anxiety (reduced self-reported checking due to worry) and compulsivity (self-reported disorganized checking). While anhedonia decreased self-reported task engagement, apathy, strikingly, improved overall task performance by reducing excessive checking. In summary, we provide a multifaceted paradigm for assessment of checking for threat in a naturalistic task that is sensitive to both moods as they change throughout the task and clinical dimensions. Thus, it could serve as an objective measurement tool for future clinical studies interested in threat, vigilance or behavior–emotion interactions in contexts requiring both reward seeking and threat avoidance.

Automatic recognition of cross-language classic entities based on large language models

Large language models (LLMs) hold immense potential for the intelligent processing of classical texts. They offer new approaches for digital research on classical literature resources, cross-linguistic understanding, text knowledge mining, and the promotion and preservation of cultural heritage. To explore the performance of named entity recognition (NER) tasks supported by LLMs, this study first fine-tuned four LLMs—Xunzi-Baichuan, Baichuan2-7B-Base, Xunzi-GLM, and ChatGLM3-6B—using supervised fine-tuning methods based on open-source models. Zero-shot, one-shot, and few-shot prompting methods were then employed to validate the performance of these models in the NER tasks. Finally, the applicability of fine-tuning LLMs in specific domains for NER tasks was examined using BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, precision, recall, and F1 scores as evaluation metrics for model performance and applicability. The experimental results indicated that fine-tuned LLMs achieved high scores across multiple metrics, demonstrating strong performance in text generation. In entity extraction, the Xunzi-Baichuan model performed optimally across several metrics and also exhibited generalization capabilities. In addition, we have open-sourced our models for community research. https://github.com/Xunzi-LLM-of-Chinese-classics/XunziALLM.

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.

Evaluating search engines and large language models for answering health questions

Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50–70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs’ effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.

Responses

Your email address will not be published. Required fields are marked *