Technology-supported self-triage decision making

Technology-supported self-triage decision making

Introduction

Symptom-Assessment Applications (SAAs) are digital tools that help medical laypeople diagnose their symptoms and decide on appropriate care pathways1. Their primary goal is to empower individuals to make better decisions and to guide them to the most suitable care settings. On a systemic level this may ultimately alleviate pressure on overcrowded healthcare systems2,3,4,5,6. By optimizing care pathways, SAAs cannot only save time for patients but also free up vital resources for urgent healthcare needs. The financial costs of redirecting patients to more appropriate care settings is estimated to be more than $4 billion annually in the US7,8,9.

To achieve this goal, SAAs/LLMs must ensure that (1) the advice provided is accurate and (2) that it really improves users’ decisions. The accuracy of these applications has been extensively tested: Semigran et al. conducted a seminal study, which highlighted variability in performance – some symptom checkers were highly accurate while others performed poorly5. This initial study has been followed by numerous others that have replicated these findings, examined how accuracy evolves over time, and integrated real patient cases into their evaluations3,10,11,12,13,14. Further research has addressed methodological concerns to increase the validity and reliability of accuracy evaluation studies, such as disparities in testing procedures and the use of case vignettes that do not accurately reflect real-world scenarios15,16,17,18.

In addition to examining the accuracy of SAAs, several studies have explored human self-triage decision-making capabilities. For instance, Schmieding et al. discovered that on average, both SAAs and laypeople have a similar accuracy in making self-triage decisions, although the best-performing SAAs outperformed laypeople19. Levine et al. expanded this comparison to include the large language model (LLM) GPT-3 as an alternative to SAAs and found comparable accuracy levels between laypeople and GPT-320. Additionally, Kopka et al. compared SAAs, LLMs, and laypeople and found that both SAAs and LLMs – although not all of them – achieved slightly higher accuracy in self-triage decisions than laypeople18. Thus, if the best performing SAAs and LLMs are chosen, they have the potential to improve laypeople’s self-triage decisions18.

However, users may choose to ignore even the best-performing SAAs/LLMs and thus not benefit from these systems. In other words, although users may outsource their decision-making partly and integrate the advice from SAAs/LLMs to varying extents, they ultimately make the final decision about which action to take. In a human-technology interaction context, two key concepts on human involvement can be distinguished: (1) human-out-of-the-loop, where the system operates entirely autonomously without human interference, and (2) human-in-the-loop in which humans make decisions and are actively involved21. For SAAs and LLMs used in self-triage, the setup is human-in-the-loop, as users ultimately make the final decision and take corresponding action. This concept has, however, not received much attention in SAA research yet. One study that used real-world data to assess patient care-seeking after using an SAA found that these tools might decrease perceived urgency among users22. However, in this study it was not possible to re-examine the cases to determine whether the decisions influenced by the SAAs were correct. Consequently, no conclusions can be drawn regarding performance. Another study indicated that many users tend to offload their decision-making heavily to SAAs, yet it also did not evaluate the performance of SAAs when being used by humans23. Thus, to empower individuals to make better self-triage decisions and reduce the burden on healthcare systems, it is essential to better understand laypeople’s self-triage decision-making with SAAs/LLMs in the loop and to evaluate the combined accuracy of SAAs/LLMs and laypeople.

As no model for understanding these self-triage decisions currently exists, we build on and adapt a stage model of diagnostic decision-making for physicians (see Fig. 1) utilized by Kämmer et al. who used it to examine advice-taking in physician teams24.

Fig. 1: Stage model of diagnostic decision-making.
Technology-supported self-triage decision making

Adapted from Kämmer et al.24 and the Committee on Diagnostic Error in Health Care (2015)53.

Full size image

This model includes three main tasks: information gathering, information analysis, and information integration. When observing a patient’s symptoms, clinicians first seek to obtain more information about the symptoms and possible causes, then analyze this information to identify potential causes, and finally, integrate the collected evidence. This cycle may be repeated by gathering additional information after the initial decision, or it may conclude with a final decision that is subsequently implemented. Although the authors originally proposed this model for diagnostic decisions among medical professionals, it could also serve as a useful framework for examining self-triage decisions among medical laypeople.

In summary, it is currently unclear how users include the systems in their decision-making and if it makes their decisions more accurate. Based on prior research, decision-making might follow a similar path to diagnostic decisions among physicians and based on studies evaluating the accuracy of SAAs and LLMs in isolation, well-performing systems should enhance the self-triage decisions of laypeople. However, since these studies have not considered the human in the loop, it is unclear whether this translates to a better ‘human-SAA-team’ performance – which more closely resembles decisions in the real world – compared to a single human decision without SAA (or LLM) assistance. We aim to fill this research gap by addressing the following two research questions: How does the ‘human-in-the-loop’ team come to a decision? Do laypeople improve their self-triage decisions when using a well-performing SAA or LLM?

The first research question is exploratory and aims to understand processes within the human-SAA-team and their impact on decision-making. The second research question seeks to assess laypeople’s decision accuracy with SAAs/LLMs in the loop. Based on previous results from Kopka et al.18 we hypothesize the following:

H1: Laypeople have a higher self-triage accuracy with assistance of an SAA compared to making decisions on their own.

H2: Laypeople have a higher self-triage accuracy with assistance of an LLM compared to making decisions on their own.

Methods

Study design

We used a convergent parallel mixed-methods design with semi-structured interviews to explore how humans integrate these systems into a human-in-the-loop decision and a Randomized Controlled Trial (RCT) to evaluate whether SAAs and LLMs can enhance the accuracy of participants’ self-triage decisions.

To gain insights into how SAAs and LLMs are used in users’ self-triage decision-making, we conducted narrative research and semi-structured interviews with participants before and after using these systems. These insights can help explore whether users merely adopt the recommendations they receive or if the decision-making process is more complex. This understanding could indicate whether isolated evaluation studies can predict real-world accuracy or if and why decisions from human-SAA teams deviate from the performance of SAAs alone.

To quantify the effect of using SAAs/LLMs in self-triage decisions, we conducted an RCT with a mixed parallel design (allocation ratio: 1:1). Participants were randomly assigned to receive advice either from the LLM ChatGPT or the SAA Ada Health (between-subjects factor) and assessed the appropriate self-triage level before and after receiving advice from the system (within-subjects factor). The allocation sequence was automatically generated using a simple randomization algorithm by the online survey tool SoSciSurvey, which also concealed the sequence by automatically assigning participants to their respective intervention groups. Participants did not receive specific information about the intervention (e.g., how the advice was exactly generated) and were not aware of the other possible intervention groups. Thus, they were blinded to their group assignment. The outcome assessor was blinded to group allocation until the statistical analysis was complete.

Ethical considerations

The study complied with all ethically relevant regulations including the Declaration of Helsinki. The ethics committee of the Department of Psychology and Ergonomics (IPA) at the Berlin Institute of Technology (tracking numbers AWB_KOP_3_230915 and 2225676) granted ethical approval for this study and informed consent was obtained from every participant prior to participating. The study was preregistered in the German Clinical Trials Register (DRKS00033775). The manuscript was constructed in accordance with the SRQR25 (for qualitative research) and the CONSORT26 (for RCTs) guidelines.

Participants

The participants were required to be proficient in German and reside in Germany, as the questions and interviews pertained to specifics of the German healthcare system (e.g., the distinction between doctors in private practice and emergency medical services). Therefore, a certain degree of familiarity with the system was a prerequisite. They were excluded if they had professional medical training or participated in similar research from our department.

We recruited 24 participants for the interviews between November 2023 and January 2024 using a convenience sampling approach from our network and laid special emphasis on the diversity of participants to ensure different perspectives. Thus, we aimed to include 12 people with prior experience using digital triage tools (including telephone triage) and 12 people without prior experience. Participants were either family, friends, or part of the extended network of the interviewers. To accommodate varying schedules, interviews were conducted in the participant’s home, the interviewer’s home, or a university room, offering flexible settings intended to increase the comfort level and openness of the participants.

For the RCT, participants were sampled between the 14th and 16th March 2024 using a random sample from the sampling provider Prolific and were asked to fill out an online survey. Prolific is a sampling provider that has been shown to produce higher-quality data compared to other providers27. Participants who opted to participate after seeing the option on Prolific were redirected to our online survey, which was hosted on SoSciSurvey. Upon completing the survey, participants were redirected back to Prolific, where they were compensated for their participation. To determine the required sample size, we conducted a simulation-based a-priori power analysis using R. Based on a previous study with a similar setup and the same vignettes18, we estimated laypeople’s accuracy at about 59%. Since the SAA Ada solved 19 out of 27 cases correctly, we estimated the accuracy to be about 67%. Using these values, we constructed a simulated dataset, specified a linear mixed-effects model (with a logit link function) and conducted the simulation-based power analysis with a desired power of 1-β = 0.80 and a significance level of α = 0.05. This resulted in a targeted sample size of 540 participants. Since we expected some users to answer inattentively, we oversampled by 10% and targeted a sample size of 600 participants. They received 0.53€ as a compensation for participation in the study, which took about 4 min. As a motivation to answer correctly and increase data quality, participants received an additional 0.06€ if they made the correct final decision28.

Materials & procedure

The interviews followed an interview guideline that was developed by SMW and SK, with refinements from all authors, and was pilot tested with 3 participants. Once finalized, the interview guide remained unchanged throughout data collection. Interviews were audio-recorded using smartphones. In the beginning, participants provided sociodemographic information. They were then randomized into using either the LLM ChatGPT or the SAA Ada. Participants then received one out of 27 validated case vignettes (stratified for each intervention group to ensure a similar distribution across both groups). The vignette set was taken from a previous study in which it was validated and constructed according to the RepVig Framework to ensure external validity18,29. The vignettes describe real cases from patients who (a) experienced the described symptoms, (b) prospectively wrote them while deciding if and where to seek care, and (c) consulted the internet for decision support. Thus, the vignettes have high generalizability to the use case SAAs are typically approached with. An overview of the symptom distribution across these vignettes can be found in Table 1.

Table 1 Symptom distribution of included case vignettes.
Full size table

Participants were then asked how they would respond if they or someone close to them experienced these symptoms. They could choose from the following options, based on Levine et al.20. (1) ‘Call an ambulance or go directly to the emergency room. The symptoms must be treated immediately’, (2) ‘See or call a (family) doctor. The problem requires medical clarification, but is not a life-threatening emergency’, or (3) ‘The symptoms can be treated by yourself. It is probably not necessary to see a doctor’. Participants then used either ChatGPT or the SAA Ada to get a recommendation on what they should do. Throughout the interaction, they verbalized their thoughts in real-time using the think-aloud method30. In cases in which think-aloud did not work properly, participants were asked again to tell us what they were currently thinking. Following this interaction, participants reassessed the self-triage level of the case and explained their reasoning for the final decision made. The interviews were transcribed verbatim using MAXQDA with an adapted GAT 2 system.

In the RCT, participants followed the same approach: they were asked about their sociodemographic characteristics, previous experience with SAAs and LLMs, their health using the Minimum European Health Module31, their self-efficacy using the Allgemeine Selbstwirksamkeit Kurzskala32 and their technological affinity using the Affinity for Technology Scale Short33. Afterwards, they were presented with the case vignette and asked how they would respond if they or someone close to them experienced these symptoms. Participants were then randomized into one out of two intervention groups: receiving advice from either the LLM ChatGPT or from the SAA Ada. Because they did not know about other possible intervention groups, they were blinded to their group assignment. Participants were shown the result of ChatGPT (obtained using the prompt “Dear ChatGPT, I have a medical problem and hope you can give me some advice. The following are my symptoms: [Case Vignette] How urgent do you think it is and do I need to see a doctor or take other action?”, which represents a synthesis of how participants in the qualitative part asked their questions) or of the SAA Ada, which was selected because it was one of the best-performing SAAs in previous studies10,18. Recommendations from the SAA were obtained by the lead author and two research assistants who entered the symptoms independently from each other using standardized entry forms10,34. The included recommendation was determined by the majority result34. We decided to show the corresponding results screens rather than direct interaction with the systems to maintain internal validity. After viewing the results, participants reassessed the self-triage level they thought was most appropriate. The primary outcome was self-triage accuracy and the secondary outcome was the change in urgency.

Data analysis

For the qualitative analysis, we applied reflexive thematic analysis as outlined by Braun & Clarke35 using MAXQDA. We chose this method and an inductive approach due to its suitability for identifying emergent themes in studies where complex decision-making processes are examined without predefined hypotheses35. The analysis began with an initial familiarization with the data, followed by the generation of initial codes. These were grouped into categories and themes, which were iteratively revised with the whole research team until they reflected the data and addressed the research question.

The quantitative data were analyzed using the symptomcheckR package, which is designed for analyzing self-triage data36. We used a mixed-effects regression model with the participant as a random effect and both the time point and system as fixed effects. The primary outcome, accuracy, was analyzed using binomial logistic regression. The secondary outcome, participants’ perceived urgency, was analyzed descriptively. To identify differences between the systems for the primary outcome, we conducted contrast tests that controlled for participants’ initial decisions. The p-values were adjusted for multiple comparisons by controlling the false discovery rate using the Benjamini-Hochberg procedure37.

Results

Participants

In the interviews, 24 people participated. Their characteristics are shown in Table 2. In the quantitative study, 631 people started the survey, of whom 16 were excluded because they were medical professionals and 10 who did not finish the questionnaire. Subsequently, we excluded two people because they failed the attention check (before randomization), one because they did not read the vignette (in the ChatGPT group), one because of self-reported technical problems (in the SAA group), and one for stating that their data should be excluded for other reasons (in the SAA group). Thus, the final dataset included data from 600 participants (301 in the SAA group and 209 in the ChatGPT group). Their characteristics are shown in Table 3.

Table 2 Description of participants in the interviews.
Full size table
Table 3 Description of participants in the RCT.
Full size table

Identified themes

We identified three themes that relate to decisional influences ‘before the interaction’, ‘during interaction’, and ‘after interaction’ with eight categories: (1) Certainty in own appraisal, (2) Expectations, (3) Data basis, (4) Perceived personalization, (5) Information gathering and information analysis, (6) Explainability, (7) Information integration, and (8) Difficulties in information integration. These themes and a corresponding summary are shown in Table 4.

Table 4 Identified themes and categories
Full size table

Before interacting with the system, participants expressed different levels of certainty in their own symptom assessments, which influenced whether they sought additional information. Participants entered the interaction with varying – both positive and negative – expectations and mentioned the data basis as a specific factor that influenced their expectations.

Participants often had an initial idea of whether symptoms require care. If they were very confident in their assessment, they decided directly and were unlikely to consult additional sources of information. However, if they were unsure about the symptoms or uncertain in their own assessment, they were prepared to seek additional information:

Then I just think to myself: ‘from a common sense point of view, it is probably water retention if it happens often and she’s not in pain. (…) I would then not consult the internet and no AI either. (P16)

If I had [these symptoms] now, I would google it – put symptoms together or think of something else so that I can rule out certain things. Or get a hint: Please go to the doctor. (P13)

Participants frequently based their initial decision on personal experiences or those of others. If they had experienced similar symptoms before, they relied on that experience and made a similar decision:

I stay with my first thought (…). Probably because I have had it myself and I have just taken this past experience and imposed it on the person or I just relied on my past experience for this decision. (P14)

Participants had both negative and positive expectations about SAAs/LLM based on their prior experiences when interacting with them. A concern that participants expressed was fear of misdiagnosis and general mistrust in SAAs/LLMs. For example, one participant stated:

I think a big problem is that misdiagnoses can lead to major psychological stress. Or even in the case of misdiagnoses that say there is nothing wrong, it can of course have a negative impact on health. (P17)

Conversely, some participants had a high initial trust in the systems and strongly believed that the system is more knowledgeable than themselves. This was often attributed to their lack of medical expertise:

I would think that ChatGPT has more background knowledge than I do (…) and could therefore answer the question better if in doubt. (P24)

This quote also points to a key factor influencing participants’ initial trust and expectations of the systems: how they perceive the system’s database in relation to their own knowledge.

Participants highlighted the data basis of both systems. For the SAA, they assumed that – because it was an approved medical device – the data basis must be reliable. In contrast, they viewed the data basis of ChatGPT as too general, untrustworthy, and thus did not consider ChatGPT reliable:

Because I know it’s a database that doesn’t lie. There are causes and symptoms that are linked and there’s just (…) a combination of multiple causes or multiple symptoms that cause (…) certain conditions. And these models don’t lie if they’re fed with the right data. (P13, used the SAA)

I think you would have to design a special machine learning model or something or link it to a database of medical facts, because as it is now, I don’t think it would bring sufficient plausibility or verifiability. (P5, used ChatGPT)

Whereas the previous themes refer to aspects that emerged before participants used SAAs or LLMs, the following quotes describe aspects during their use.

When interacting with the system, participants preferred a high degree of personalization to trust it. They used the system to gather more information and/or to narrow down their decision options and to analyze the available information. Additionally, a high level of explainability helped them make informed decisions and assess the uncertainty of a specific course of action.

When using the SAA and ChatGPT, participants used the tools both to explore symptoms and get more information, and to integrate information to get closer to a solution. Those who felt uninformed before, used the tool to find additional information on their symptoms. Conversely, participants who felt well-informed based on their existing knowledge skipped this step. Both groups used the tools to analyze the information, reduce the decision space, and to ultimately get closer to a decision.

For example, a participant used it to obtain more information and said:

Well, the app as a tool is quite influential. Especially because you get a lot of information again, about things you have not considered before. (P21)

Another participant used it to analyze the symptoms and reported the following:

So it’s really like a selection, simply that you’re given certain things and then (…) the choices are narrowed down and then you get closer and closer to the diagnosis. (P3)

Whereas this theme focuses on the content of the responses, the following two themes – perceived personalization and explainability – relate more to the form and format of the responses provided.

Participants expressed a clear desire for personalized results from the symptom assessment process. They showed higher trust and were more likely to rely on the recommendations when they were personalized with respect to the specific situation they had described. They neglected information from the system in instances where the system gave unspecific answers or overlooked information they had entered:

I honestly don’t feel like it advises me particularly well because the answer is very generic. So for example, the first sentence is: ‘it’s advisable to make a doctor’s appointment as soon as possible, especially if your symptoms are new, unexpected, (…) or worsening’. That’s the kind of answer you would write in a guide. But I already described my symptoms in the beginning. So (…) I would expect the program to skip the general instructions and respond personally to what I have written. (P10, used ChatGPT)

It didn’t just respond superficially, but it also went a bit into detail from the description I gave, which I though was good. (…) Just always this going back to what I said: it’s been like that for months; it was a lump. Yes, this can mean different things with different implications, so all of this was trustworthy. (P5, used ChatGPT)

The systems’ capabilities to provide personalized responses and – in the case of ChatGPT – hold social conversations that feel close to human contact led to high trust and made the system convincing:

This direct approach to the specific question, so not just this keyword search ‘And here are 50 suggestions that could be an answer’, but you get a direct, personal, trustworthy answer, as if you were talking to a real person. And that’s what creates this trust, this direct chat. (P24)

Participants found it helpful when the SAA provided quantifiable estimates of uncertainty alongside its recommendations, such as stating “x out of 10”. Conversely, they noted that ChatGPT did not offer any quantifiable uncertainty, which they found less helpful:

And then on the fifth place, ‘lateral malleolar fracture’. 4 out of 100 people. Oh wow. Well then, I’ll go with the most likely (P6, used the SAA)

Just the statistics. So that’s missing. Well, I say, the diagnoses that ChatGPT throws out are very intensive and not very quantified. So, I’ll also say he throws around technical terms without knowing who he’s talking to (…) or how seriously I take it or how many people have actually got these diseases. (P24, used ChatGPT)

Both systems were criticized for their lack of explainability, as participants would have wished to understand how the systems arrived at their recommendations:

I don’t feel like it’s explained enough here (…) how ChatGPT arrives at something else. So, the explanation for a specific recommendation is not presented. It’s not rule-based enough for me, let’s put it that way. It doesn’t say: ‘okay, it’s this [symptom], so I would say with a greater probability this [disease], because this was like that in the past as a result of this and that’. (P11)

After completing the interaction, participants attempted to integrate the new information and recommendations they received with what they knew before. Based on their prior experience, expectations, and knowledge, they evaluated the recommendations and either accepted them, combined them with their own understanding, or sought additional information and thereby started another iteration of gathering and analyzing information. If participants faced difficulties in information integration, they decided to see a physician and often concluded that the situation was urgent.

Participants tried to integrate the information they received and critically appraised the recommendations. They generally did so by verifying the advice with their own information and previous experience. If the recommendation was easily integrable into their previous knowledge, they readily accepted the system’s recommendation:

I think the self-care measures that are presented are good and would be enough for me. I would also follow them, also because I can say in comparison to past experiences that similar things have helped and it works, so in my head it makes sense. (P4)

During the information integration, some participants still felt an informational need and tried to cross-verify the recommendation with other sources of information. Therefore, they started the cycle of gathering and analyzing information again:

And yet, I would still seek out a few more sources of information. In a similar way, where, if I go to a doctor and get a diagnosis, I usually come home and then read up a bit more about it. (…) With ChatGPT, it might be more about checking if what it told me is correct. (P4)

In some instances, participants faced difficulties integrating information and instead decided to terminate information seeking and sought care quickly to reduce this conflict:

[I would advise to] See a doctor very quickly (…) because there could be so many different diagnoses, which you wouldn’t think of before, I would simply, yes, see a doctor, because I can’t diagnose it myself at all. I have no idea whatsoever and before I go crazy, I would see a doctor as soon as possible. (P3)

If the advice drastically contradicted their prior beliefs, integrating it with their existing knowledge was difficult. As a result, they opted for a more urgent level of care:

Because I feel so confused now? Because initially, before I consulted ChatGPT, I was pretty sure about my decision that it wasn’t urgent. And (…) I assumed that it was not urgent and somehow these answers have confused me now because of course he first listed the worst [diagnoses], so to speak, which I would somehow still rule out now. (P16)

They are somehow very different things. (…) Now that I see it, I would go to the doctor really quickly because it could be different things and serious things. (P5)

These exploratory qualitative findings suggest that translating advice from SAAs/LLMs into action is not a linear but an interactive and often iterative process. A better understanding of the underlying dynamics is therefore essential to identify points of intervention to help improve self-triage decisions. Based on these insights it can be said that improving the accuracy of these tools alone does not necessarily result in a medically correct action, as performance of the human-in-the-loop systems depends as much on the system as it does on the users and on the ambiguity of the experienced symptom patterns.

Decision improvement

In the RCT, participants increased their self-triage accuracy from 53.2% to 64.5% when using the SAA (OR = 2.52 [1.50–3.55], z = 3.75 p < 0.001) but did not show such an increase when using ChatGPT (54.8% pre vs. 54.2% post usage, z = −0.27, p = 0.79). The difference in accuracy when using the SAA versus ChatGPT was statistically significant (OR = 2.24 [1.50 – 3.89], z = 3.59, p < 0.001). Participants’ accuracy in detecting emergencies (SAA: 63.6% pre vs. 81.8% post, z = 1.34, p = 0.25; ChatGPT: 68.2% pre vs. 90.9% post, z = 1.72, p = 0.13) and non-emergency cases did not increase statistically significantly with either system (SAA: 82.8% pre vs. 83.4% post, z = 0.27, p = 0.79; ChatGPT: 83.8% pre vs. 85.7% post, z = 0.92, p = 0.45). However, they detected more self-care cases correctly when using the SAA (13.1% pre vs. 36.9% post, OR = 8.59 [3.47–14.2], z = 3.80, p < 0.001) and fewer when using ChatGPT (16.3% pre vs. 8.1% post, OR = 0.00005 [0.0000003–0.000002], z = −5.28, p < 0.001). These differences are shown in Fig. 2.

Fig. 2
figure 2

Change in self-triage accuracy when using the SAA Ada and ChatGPT.

Full size image

In cases in which participants were initially correct but received incorrect advice from the SAA, they remained at a correct solution in 72% of cases (21/29). Conversely, if they were incorrect but received correct advice, they changed their appraisal to a correct decision in 37% (63/170) of all cases. The same was observed for ChatGPT: in cases in which participants were correct but received incorrect advice, they remained at the correct solution in 61% (22/36) of all cases. If they were incorrect but received correct advice, they changed their appraisal to a correct decision in 57% (38/52) of all cases.

Among participants seeing results from the SAA, most participants remained at their initial appraisal (73%, 221/301). If they changed it, 16% (48/301) decreased their urgency, whereas 11% (32/301) increased it. Among participants seeing results from ChatGPT, most participants remained at their initial appraisal as well (83%, 249/299). However, if they changed it, most participants increased their urgency (13%, 39/299) and only a minority decreased it (4%, 11/299). Urgency changes are shown in Fig. 3.

Fig. 3
figure 3

Change in urgency when using the SAA Ada and ChatGPT.

Full size image

Discussion

Our quantitative results demonstrate that laypeople make more accurate decisions when using a well-performing SAA compared to making decisions on their own, thus supporting our first hypothesis. However, this improvement was not observed when participants used ChatGPT for advice, leading us to reject our second hypothesis. Although laypeople did not achieve the high accuracy levels of the tested SAA, they approached these levels. Notably, they frequently ignored incorrect recommendations, especially when using ChatGPT. This observation suggests that whereas users benefit from correct advice, they are less frequently misled by incorrect suggestions. On a broader scale – since most interactions with SAAs in the real world refer to non-emergency or self-care cases38 and laypeople tend to be very risk-averse when unassisted39 – our findings indicate that using self-triage decision support systems like SAAs leads to a shift towards lower urgency. This result aligns with previous research, which highlighted the usefulness of SAAs in encouraging users to treat their conditions at home (when appropriate) and reducing unnecessary visits to healthcare facilities22,40,41. The results also demonstrate the importance of considering humans in the loop when evaluating SAAs and LLMs, as the performance of the team is different from isolated SAA accuracy.

Our qualitative study identified factors before, during, and after the interaction that influence the decision-making process in a human-SAA team. Before the interaction, participants’ certainty in their own assessments determined whether they would seek additional information in the first place. When they did, their expectations of the system and its data basis influenced if and how they accepted advice from SAAs and LLMs. During the interaction, personalization and explainability played important roles. If users perceived a high degree of personalization – particularly with LLMs as conversational agents – they experienced more trust and were more likely to rely on the recommendations. These findings align with a systematic review on advice-taking42, which suggests that decision makers prefer advisors with high (perceived) expertise and relatability, which may be demonstrated in both systems. In the case of ChatGPT, not communicating uncertainty might have given the impression of high certainty and writing like a human being might have increased relatability by being perceived as a social agent. For the SAA, its use of highly professionalized language and the provision of extensive information may have led to being perceived as an expert system.

Our qualitative results can also be understood in relation to the stage model of diagnostic decision-making24. Factors before the interaction correspond to the influence of existing knowledge on the information gathering phase; factors during the interaction align with information gathering and analysis; and factors after the interaction correspond to information integration. From this perspective, laypeople’s decision making follows an approach similar to physicians: Their search for information is comparable to physicians’ inductive foraging, seeking more information on symptoms43, and their information analysis is similar to deductive inquiry, narrowing down the decision-space to approach a decision44. This corresponds to information gathering and information analysis in the stage model of diagnostic decision-making and users appear to use SAAs and LLMs specifically for these steps. Afterward and in line with the model, they integrate this information24. At this stage, most users critically evaluate the recommendation before making a final decision. This indicates that SAAs are often used to complement rather than replace individual decision-making45.

In summary, our findings can be used to adapt the stage-model of diagnostic decision-making24 for application in self-triage decisions. The resulting stage model of technology-assisted self-triage decision-making can be found in Fig. 4.

Fig. 4
figure 4

Stage model of technology-assisted self-triage decision-making.

Full size image

The decision-making process begins when laypeople experience symptoms themselves or try to advise others. If they encountered similar symptoms before or are highly confident in an initial assessment of the appropriate care pathway, they make a decision directly based on their previous experience and knowledge. However, if they are uncertain about their initial assessment or have no idea what to do, they seek additional information, e.g., from technological systems46. They then input their symptoms into the system (information gathering), which analyzes the information to provide a recommendation (information analysis). During the information integration phase, users try to integrate the recommendation and any other new information they received (e.g., on potential diagnoses) with their prior experience, expectations, and existing knowledge. If users believe they have identified the correct solution, they may conclude the process and determine a final self-triage level. In cases in which the recommendation and information is compatible with their previous information, they simply accept the received recommendation. Conversely, if the recommendation conflicts with their previous information, they weigh all informational cues to arrive at a decision. If they still face an informational need, they may gather additional information and restart the cycle of information gathering and information analysis. Alternatively, if users feel overwhelmed by too much information or face highly conflicting information, they may abort the self-triage decision, see a physician instead, and opt for a care pathway with high urgency. This approach allows them to reduce perceived uncertainty by deferring the integration of the wealth of complex information to a medical professional47.

In contrast to the original model, our model suggests that users may not engage in the full decision-making cycle. Instead, they might use the recognition heuristic to make a decision before any additional information is sought48: That is, if they recognize the symptoms from previous experiences, people may directly choose the course of action that was successful in those situations. Additionally, our model differs from the original model in terms of task allocation. In Kämmer et al.’s study24, information analysis and integration were important collaborative tasks in physician pairs, while information gathering was not a relevant part of the process49. In contrast, for human-in-the-loop teams comprising a layperson and an SAA, both information gathering and analysis seem to be mainly allocated to the SAA, whereas the human user is the predominant agent during information integration. The layperson, being responsible for the information integration phase, thus typically has the role of a supervisor, making the final decision50.

On a systemic level, the burden on the healthcare system might increase if information integration fails in too many users who then seek urgent care to reduce their perceived uncertainty. Our quantitative data suggests that this concern is valid, as we observed an increase in users planning to seek emergency care after using the SAA/LLM. To identify the positive versus negative consequences of this pattern, we use concepts of Signal Detection Theory (SDT)51. Based on SDT, potential positive consequences relate to users who experience severe symptoms and thus seek emergency care (Hits) and users with minor symptoms who actually opt for self-care (Correct Rejections). Potential negative consequences relate to users who decide for self-care even though they experience severe symptoms (Misses) or seek emergency care for minor symptoms (False Alarms). Fortunately, our data show that users tend to be risk-averse and defer their decision to a medical professional if in doubt, which reduces the risk of Misses. Thus, based on our data, the primary concern is that SAAs/LLM may increase the potential burden on emergency services due to False Alarms. Future research should focus on designing SAAs/LLMs that minimize these errors by supporting users who face difficulties in information integration, e.g., by providing additional information explaining why recommendations were given as a cross-check on the reasoning based on their symptoms. A potential system design improvement might also include referral to a follow-up hotline, where users who perceive a high level of uncertainty may clarify their SAA recommendation with a medical professional.

This study is not without limitations. First, we used the think-aloud method to get insights into participants’ decision-making when interacting with SAAs/LLM using representative stimuli. Although this provides in-depth insights into cognitive processes, it limits control of intraindividual versus environmental factors. For example, some individuals may have had a general tendency toward risk-averse decisions, whereas in other situations the symptoms provided in the vignettes could have triggered similar, situation-specific risk assessments in individuals who are generally more risk prone. Thus, our model does not allow us to disentangle the influence of intrapersonal versus situation-specific differences in risk perception on decision making. Second, although we used a validated set of case vignettes with greater external validity than traditional vignettes18,29, the nature of vignette-based studies still poses limitations compared to real-world evidence. Participants did not experience the symptoms themselves but only read about others’ symptoms, which could alter decision-making processes compared to experiencing symptoms directly. Nevertheless, SAAs are frequently used to make decisions for other people5. Another limitation concerns our experimental setup. Whereas participants entered the symptoms in the SAAs/LLM themselves during the interviews, in the RCT they were entered by the study team and participants only saw the results screen. The lack of direct interaction with the tools may have influenced outcomes, given the high individual variability in symptom entry and the fact that even minor input differences can lead to different recommendations34,52. Although this setup limits generalizability, it increases internal validity by allowing us to examine the decision-making processes and the impacts of SAAs and LLMs in a controlled environment without symptom entry variability. To ensure the reliability of the recommendations shown in this study, we had three individuals enter the symptoms independently and based the final recommendation we gave to participants on the majority result. This approach was proposed by Meczner et al.34 after specifically examining variability in symptom entry and how to deal with it. Future studies should include symptom entry variability in the experimental setup to replicate our results with higher external validity by allowing participants to input symptoms themselves. An additional limitation of our quantitative study is that we did not account for participants’ trust in the SAA and LLM. Future studies should explore whether trust levels differ between SAAs and LLMs and, if so, how they influence decision making.

Finally, we selected one of the best-performing SAAs for this study, as prior research suggests that only high-performing SAAs should be implemented10,18. Thus, it remains unclear how decisions might be influenced by a low-performing SAA and whether participants could counteract incorrect recommendations if they make up the majority of received recommendations.

In conclusion, laypeople seem to use SAAs and LLMs as decision aids rather than replacements. When working alongside SAAs as a human-SAA team, they make more accurate decisions than they would on their own, especially because they are able to compensate for incorrect recommendations given by SAAs/LLM. Given laypeople’s risk-averse approach, SAAs can be particularly effective in the real world to help identify self-care cases correctly – a decision in which these systems outperform laypeople’s independent decisions. Because self-triage decision-making is complex, and the user’s role involves integrating all available information with their previous experience and knowledge to arrive at a decision, our findings highlight the importance of studying SAAs and laypeople not in isolation, but as integrated human-SAA teams where humans play an active role in the decision-making process.

Related Articles

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Large language models (LLMs) have emerged as transformative tools with the potential to revolutionize philosophical counseling. By harnessing their advanced natural language processing and reasoning capabilities, LLMs offer innovative solutions to overcome limitations inherent in traditional counseling approaches—such as counselor scarcity, difficulties in identifying mental health issues, subjective outcome assessment, and cultural adaptation challenges. In this study, we explore cutting‐edge technical strategies—including prompt engineering, fine‐tuning, and retrieval‐augmented generation—to integrate LLMs into the counseling process. Our analysis demonstrates that LLM-assisted systems can provide counselor recommendations, streamline session evaluations, broaden service accessibility, and improve cultural adaptation. We also critically examine challenges related to user trust, data privacy, and the inherent inability of current AI systems to genuinely understand or empathize. Overall, this work presents both theoretical insights and practical guidelines for the responsible development and deployment of AI-assisted philosophical counseling practices.

Evaluating search engines and large language models for answering health questions

Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50–70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs’ effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.

The risk effects of corporate digitalization: exacerbate or mitigate?

This study elaborates on the risk effects of corporate digital transformation (CDT). Using the ratio of added value of digital assets to total intangible assets as a measure of CDT, this study overall reveals an inverse relationship between CDT and revenue volatility, even after employing a range of technical techniques to address potential endogeneity. Heterogeneity analysis highlights that the firms with small size, high capital intensity, and high agency costs benefit more from CDT. It also reveals that advancing information infrastructure, intellectual property protection, and digital taxation enhances the effectiveness of CDT. Mechanism analysis uncovers that CDT not only enhances financial advantages such as bolstering core business and mitigating non-business risks but also fosters non-financial advantages like improving corporate governance and ESG performance. Further inquiries into the side effects of CDT and the dynamics of revenue volatility indicate that CDT might compromise cash flow availability. Excessive digital investments exacerbate operating risks. Importantly, the reduction in operating risk associated with CDT does not sacrifice the potential for enhanced company performance; rather, it appears to augment the value of real options.

Advantages and limitations of large language models for antibiotic prescribing and antimicrobial stewardship

Antibiotic prescribing requires balancing optimal treatment for patients with reducing antimicrobial resistance. There is a lack of standardization in research on using large language models (LLMs) for supporting antibiotic prescribing, necessitating more efforts to identify biases and misinformation in their outputs. Educating future medical professionals on these aspects is crucial for ensuring the proper use of LLMs for supporting antibiotic prescribing, providing a deeper understanding of their strengths and limitations.

Developing a named entity framework for thyroid cancer staging and risk level classification using large language models

We developed a named entity (NE) framework for information extraction from semi-structured clinical notes retrieved from The Cancer Genome Atlas—Thyroid Cancer (TCGA-THCA) database and examined Large Language Models (LLMs) strategies to classify the 8th edition of American Joint Committee on Cancer (AJCC) staging and American Thyroid Association (ATA) risk category for patients with well-differentiated thyroid cancer. The NE framework consisted of annotation guidelines development, ground truth labelling, prompting approaches, and evaluation codes. Four LLMs (Mistral-7B-Instruct, Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, and Qwen2.5-7B-Instruct) were offline utilised for information extraction, comparing with expert-curated ground truth. Our framework was developed using 50 TCGA-THCA pathology notes. 289 TCGA-THCA notes and 35 pseudo-clinical cases were used for validation. Taking an ensemble-like majority-vote strategy achieved satisfactory performance for AJCC and ATA in both development and validation sets. Our framework and ensemble classifier optimised efficiency and accuracy of classifying stage and risk category in thyroid cancer patients.

Responses

Your email address will not be published. Required fields are marked *