Evaluating large language models for criterion-based grading from agreement to consistency
Methods
Essay
This study used sample answers derived from the IELTS book series, spanning from IELTS 8 to IELTS 18. The chosen essays met specific criteria: (1) originating from the academic writing task 2 in IELTS and (2) having been assigned a score by an official IELTS examiner. In total, 31 essays were included in the study, with a mean score of 6.0 and a standard deviation of 1.1, ranging from 3.5 to 8.0. These essays and their corresponding writing prompts were extracted for subsequent analysis.
ChatGPT prompt
To systematically assess the impact of criteria knowledge on ChatGPT’s grading, we employed a three-stage incremental prompt design. Prompt 1 adhered to best practices for interacting with ChatGPT by instructing it to simulate an IELTS examiner using zero-shot reasoning. Building upon this initial setup, Prompt 2 introduced the official IELTS grading criteria – namely “task response,” “coherence and cohesion,” “lexical resource,” and “grammatical range and accuracy”. Prompt 3 expanded on Prompt 2 by incorporating comprehensive band descriptors for each criterion. This progressive approach allowed us to assess how varying levels of criterion knowledge influence the alignment of LLMs with criterion-based grading. Detailed descriptions of each prompt are available in the supplementary note 1-3.
Procedure
Initial assessments were conducted using ChatGPT 3.5. Each essay was evaluated in a new chat session to prevent potential influence from chat history. Since Prompt 3 exceeded the maximum word count for a single chat input, the ChatGPT PROMPTs Splitter (https://chatgpt-prompt-splitter.jjdiaz.dev/) was used to segment the prompt. If ChatGPT’s responses did not conform to the IELTS scoring guidelines (e.g., rounding scores to the nearest whole or half band score), the respective essay was re-evaluated until a compliant score was provided.
After identifying the most effective prompt condition, we reran it using ChatGPT 4 to assess whether more advanced models (e.g., more model parameters) can improve the grading. To testify the generalizability of the results to other LLMs, the most effective prompt was also conducted in Claude 3 Haiku, another publicly available LLM with fewer model parameters but a more recent knowledge base.
As the preceding analyses primarily focused on the efficacy of LLM-generated grading, a preliminary assessment of its effectiveness was conducted by having a high school English teacher from China evaluate the same set of essays. The invited teacher, who held general qualifications in English education but lacked specific IELTS training or teaching experience, provided grading that reflects the typical feedback an English learner might receive from a general educational resource. This grading provides an initial reference for contextualizing the effectiveness of our results.
Statistical analysis
Three pairwise ICC analyses were first conducted, each involving one ChatGPT rating given by one of the prompts and the official rating. The two-way random model (also known as ICC2) and two-way mixed model (also known as ICC3) were conducted. Compared to ICC3, ICC2 assumes fixed biases, which gives rise to the distinction – ICC2 gauges the absolute agreement, while ICC3 primarily assesses consistency17. Evaluating ICC2 and ICC3 together offers insights into potential biases17. ICCs were conducted using the single-score formula. The point estimates and their 95% confidence intervals were reported.
We initially examined whether each prompt demonstrated significant absolute agreement, defined by a 95% confidence interval that excludes 0. The prompt that generated significant absolute agreement was subjected to follow-up analyses. The values of ICC2 and ICC3 were first inspected. In the presence of a discrepancy between ICC2 and ICC3, indicating potential biases, the Bland-Altman plot and t-tests were conducted to examine the distribution and the tendency of the biases19.
After identifying the most effective prompt, we assessed its test-retest reliability (i.e., the intrarater agreement) by rerunning the prompt using ChatGPT 3.5 and applying the average-score ICC2 formula. Subsequently, we extended our analysis by rerunning this prompt with ChatGPT 4.0 and Claude 3.0 Haiku, as well as having a representative secondary middle school English teacher from China evaluate the same set of essays. The ICCs were calculated for each condition. To determine whether these ICCs differed significantly from those of the initial ChatGPT 3.5 grading, we applied Fisher’s Z transformation and conducted statistical comparisons.
ICC qualitative interpretation was guided by Koo & Li20: scores below 0.5 are considered poor, 0.5-0.75 moderate, 0.75-0.9 good, and above 0.90 excellent.
Responses