Visual cognition in multimodal large language models

Main

People are quick to anthropomorphize, attributing human characteristics to non-human agents1. The tendency to anthropomorphize has only intensified with the advent of large language models (LLMs)2. LLMs apply deep learning techniques to generate text3, learning from vast datasets to produce responses that can be startlingly human-like4. Astonishingly, these models cannot only generate text. When scaled up to bigger training data and architectures, other, so-called ‘emergent abilities’ appear5,6. The current models can, for example, pass the bar exam7, write poems8, compose music9 and assist in programming and data analysis tasks10. As a result, the line between human and machine capabilities is increasingly blurred11,12. People not only interact with these systems as if they were humans13, but they also start to rely on them for complex decision-making14, artistic creation15 and personal interactions16. It is, therefore, natural to ask: Have we built machines that think like people?

Judging whether or not artificial agents can mimic human thought is at the core of cognitive science17,18. Therein, researchers have long debated the capabilities of artificially intelligent agents19,20,21. In a seminal paper, Lake and colleagues22 proposed core domains to consider when making such judgements. Published during the height of the deep learning revolution23, the authors focused on domains that were easy for people but difficult for deep learning models: intuitive physics, causal reasoning and intuitive psychology.

Research on intuitive physics has studied how people perceive and interpret physical phenomena24,25,26. Past work on this topic has emphasized that humans possess an innate ability to predict and understand the physical properties of objects and their interactions27, even from a young age28, a notion sometimes summarized as a ‘physics engine’ in people’s heads29. This understanding includes concepts such as gravity30, inertia31 and momentum32. Some of the most canonical tasks in this domain involve testing people’s judgements about the stability of block towers33,34. These tasks have made their way into machine learning benchmarks35,36, where they are used to test the intuitive physical understanding of neural networks (see ref. 37 for an overview of previous work on building models with human-like physical knowledge).

Research on causal reasoning has studied how individuals infer and think about cause–effect relationships38,39,40. Past work on this topic has proposed that humans possess an intuitive capacity to infer, understand and predict causal relationships in their environment41,42,43,44, oftentimes described using Bayesian models of causal learning45,46. This cognitive ability encompasses recognizing patterns47,48, inferring causes from interventions49,50, and predicting future events based on hypothetical events51. Canonical tasks in this domain often involve assessing individuals’ ability to infer causal relationships, for example, when judging the responsibility of one object causing other objects’ movement52,53. Causal reasoning remains a challenge, even for current machine learning approaches54,55.

Research on intuitive psychology has explored how individuals infer, understand and interpret social phenomena and mental states of other agents56,57. Past work on this topic has emphasized the concept that humans possess an inherent ability to infer and reason about the mental states58,59, intentions and emotions of others, often referred to as a ‘theory of mind’60,61. This ability has been modelled as a Bayesian inference problem62,63,64. Canonical tasks in this domain often involve assessing individuals’ capacity to predict actions based on understanding others’ perspectives or intentions, such as determining agents’ utility functions based on their actions in a given environment65,66. It is the subject of ongoing debates whether modern algorithms show any form of intuitive psychology67,68,69.

Lake and colleagues argued that some of these abilities act as ‘start-up software’, because they constitute cognitive capabilities present early in development. Moreover, they proposed that these so-called ‘intuitive theories’70,71 need to be expressed explicitly using the calculus of Bayesian inference72, as opposed to being learned from scratch, for example, via gradient descent. However, with the increase in abilities of current neural networks, in particular LLMs, we pondered: Can LLMs, in particular vision LLMs, sufficiently solve problems from these core domains?

To address this question, we took canonical tasks from the domains of intuitive physics, causal reasoning and intuitive psychology that could be studied by providing images and language-based questions. We submitted them to some of the currently most advanced LLMs. To evaluate whether the LLMs show human-like performance in these domains, we follow the approach outlined in ref. 73: we treat the models as participants in psychological experiments. This allows us to draw direct comparisons with human behaviour on these tasks. Since the tasks are designed to test abilities in specific cognitive domains, this comparison allows us to investigate in which domains multimodal LLMs perform similar to humans, and in which they don’t. Our results showed that these models can, at least partially, solve these tasks. In particular, two of the largest currently available models, OpenAI’s Generative Pre-trained Transformer (GPT-4) and Anthropic’s Claude-3 Opus, managed to perform robustly above chance in two of the three domains. Yet crucial differences emerged. First, none of the models matched human-level performance in any of the domains. Second, none of the models fully captured human behaviour, leaving room for domain-specific models of cognition such as the Bayesian models originally proposed for the tasks.

Related work

There have been a large number of studies on reasoning abilities in LLMs74,75,76. Previous studies have focused, among others, on testing LLMs’ cognitive abilities in model-based planning73, analogical reasoning tests77, exploration tasks78, systematic reasoning tests79,80, psycholinguistic completion studies81 and affordance understanding problems82. In this sense, our contribution can be seen as a part of a larger movement in which researchers use methods from the behavioural sciences to understand black box machine learning models83,84,85. However, most of the previous studies did not investigate multimodal LLMs but rather remained in the pure language domain. Although there have been recent attempts to investigate vision LLMs’ cognitive features, including their reaction to visual illusions86 as well as how they solve simple intelligence tasks87, we investigate the proposed core components of cognition in these models.

Previous work has also looked at how LLMs solve cognitive tasks taken from the same domains that we have looked at. In intuitive physics, Zečević and colleagues88 found that LLMs performed poorly in a task using language descriptions of physical scenarios. Zhang and colleagues89 extracted programs from text produced by LLMs to improve their physical reasoning abilities. Finally, Jassim and colleagues90 proposed a new benchmark for evaluating multimodal LLMs’ understanding of situated physics. In causal reasoning, Binz and Schulz73 showed that GPT-3 failed at simple causal reasoning experiments, while Kosoy and colleagues91 showed that LLMs cannot learn human-like causal over-hypotheses. In research on intuitive psychology, Kosinski argued that theory of mind might have emerged in LLMs68 which has been criticized other researchers69. Akata and colleagues showed that GTP-4 plays repeated games very selfishly and could not pick up on simple conventions such as alternating between options16. Finally, Gandhi and colleagues92 proposed a framework for procedurally generating theory of mind evaluations and found that GPT-4’s abilities mirror human inference patterns, although less reliable, while all other LLMs struggled.

Many of the past studies on LLMs have fallen risk to appearing in new models’ training sets. Recent work has recognized this issue and, in turn, evaluated language models on many problem variations to minimize training set effects93. Our work differs from these approaches as current models could not have just memorized solutions to the given problems because these problems require higher level reasoning. Furthermore, the human data and ground truth are most commonly stored in additional data files, which first have to be extracted and matched to the respective images to be used for model training. Since this requires data wrangling that cannot easily be automated and the number of stimuli to gain is so small, it is extremely unlikely that these stimuli together with the ground truth were entered into the training set of any of the investigated models.

Results

We tested five different models on three core components for human-like intelligence as outlined in ref. 22 (Fig. 1a). The models we used are vision LLMs, which are multimodal models that integrate image processing capabilities into LLMs94,95 (Fig. 1c). These models allow users to perform visual question answering96,97: users can upload an image and ask questions about it, which the model interprets and responds to accordingly.

Fig. 1: Overview of domains, tasks, approach and models.
figure 1

a, Example images for the different experiments. Each experiment was taken from one of three cognitive domains: intuitive physics, causal reasoning and intuitive psychology. b, General approach. For every query, an image was submitted to the model, and different questions were asked about the image, that is, we performed visual question answering. c, Used multimodal LLMs and their size. MLLM, multimodal LLM.

Full size image

To test the three core components, we used tasks from the cognitive science literature that could be studied in vision LLMs via visual question answering. For every task, we queried the visual reasoning abilities of the LLMs with tasks of increasing complexity. First, we asked about simple features of the shown images such as the background colour or the number of objects shown. Afterwards, we submitted questions taken from the cognitive science experiments. We report results based on comparisons with the ground truth as well as the different models’ matches to human data.

Intuitive physics with block towers

To test the intuitive physics capabilities of the different LLMs, we used photographs depicting wooden block towers from ref. 98 (see Supplementary Fig. 1 for an example). We first asked models to determine the background colour of the image. All four models achieved almost perfect accuracy (Fig. 2a). We then asked models to state the colour of blocks from top to bottom. Here, the performance of most models except for GPT-4V and Claude-3 deteriorated (Fig. 2b). Please note that the first two tasks are fairly trivial for humans and we would expect human performance to be at 100% (the background colour is always white and images featured two, three or four blocks in primary colours).

Fig. 2: Results for five vision LLMs for tasks of increasing complexity given images of real block towers.
figure 2

ac, We first ask for the background colour in the image (a) (images were taken from ref. 98), then the colour of blocks from top to bottom (b) and finally a binary stability rating for the block towers (c). d, The last plot shows the square root of the R2 value for the Bayesian logistic mixed effects regression between models and human participants. Bars in plots ac show percentage of correct answers with error bars given by the standard deviation of a binomial distribution (n = 100). Bars in plot d show the square root of the R2 values for Bayesian logistic mixed effects regressions with error bars given by the square root of the 95% percentiles for this R2 value (n = 10,700, number of images times number of human participants).

Full size image

To test the models’ physical reasoning abilities, we asked them to give a binary stability judgement of the depicted block towers. Here, only GPT-4V and Claude-3 performed slightly above chance (Fig. 2c; for GPT-4V, Fisher’s exact test yielded an odds ratio of 2.597 with a one-sided P value of 0.028). None of the other models performed significantly above chance (the second best performing model, Claude-3, had an odds ratio of 2.016, with a one-sided P value of 0.078). Human participants were also not perfect but showed an average accuracy of 65.608%.

Finally, we determined the relationship between models’ and humans’ stability judgements using a Bayesian logistic mixed effects regression. We compute a Bayesian R2 for each regression model based on draws from the modelled residual variances99. We then take the square root of this Bayesian R2 and multiply it with the sign of the main regression coefficient to arrive at a pseudo r value. Around this pseudo r value we plot the square root of the 95% percentiles for the R2 value (Fig. 2d). We found that GPT-4V was the only model that showed a relation to human judgements, with a regression coefficient of 1.15 (95% credible interval (95% CI) 1.04, 1.27) and an R2 value of 0.066. However, the regression coefficient between individual humans and the mean over humans was still larger, with a coefficient of 1.46 (95% CI 1.41, 1.52) and an R2 value of 0.354.

Causal reasoning with Jenga

To test the models’ causal reasoning capabilities, we used synthetic images from refs. 100,101, which depicted block towers that were stable but might collapse if one of the blocks was removed (see Supplementary Fig. 2 for an example). We started by asking the models to count the blocks in the image. The images in this task displayed a larger number of blocks (ranging from 6 to 19), which made the basic counting task significantly more challenging than in the previous section. Models’ responses approximated the ground truth, albeit rarely matching it exactly. Therefore, we report the mean absolute distance to the ground truth instead of the percentage of correct answers (Fig. 3a). The models’ performance highlighted the challenging nature of this task, with the best performing model (Claude-3) still being on average more than one block off.

Fig. 3: Results for Jenga causal reasoning experiment.
figure 3

ad, We first ask for the number of blocks in the image (a), then we ask for the number of blocks that would fall if a specific block is removed and compute the absolute distance to the ground truth (b) as well as the absolute distance to human judgements (c) and finally a rating between 0 and 100 for how responsible a specific block is for the stability of the tower (d). The causal reasoning experiment was taken from ref. 100. For the responsibility ratings, all LLMs except for GPT-4V give constant ratings: Fuyu and Claude-3 always respond with 100, while Otter and LLaMA-Adapter V2 always respond with 50. Bars in plots a and b show absolute distance to ground truth with error bars given by the standard error of the mean (n = 42). Bars in plot c show the distance to human answers with errors bars again given by the standard error of the mean (n = 41). Bars in plot d show the square root of the R2 values for Bayesian logistic mixed effects regressions with error bars given by the square root of the 95% percentiles for this R2 value (n = 1,470).

Full size image

We continued by querying the models for the number of blocks that would fall if a specific block was removed from the scene (Fig. 3b,c). We established a baseline performance represented by a horizontal line in Fig. 3b,c, which corresponds to a random agent that gave the mean between 0 and the number of blocks in each image as its’ prediction, essentially behaving like a uniform distribution over the possible number of blocks that could fall. Notably, both GPT-4V and Fuyu-8B surpassed the random baseline, their performance levels being close to the human results reported in ref. 100, which is depicted by the rightmost bar in the plot. However, GPT-4V still diverges significantly from the average over human participants (t(42) = 2.59, P < 0.05).

Finally, we asked the models to rate the responsibility of a specific block for the stability of the other blocks (Fig. 3d). Notably, all models except for GPT-4V gave constant ratings for this task (Fuyu and Claude-3 always responded with 100, while Otter and LLaMA-Adapter V2 always responded with 50). The regression coefficient for GPT-4V with human values is 0.16 (95% CI 0.10, 0.21) with an R2 value of 0.027. The human-to-human regression has a coefficient of 0.54 (95% CI 0.45, 0.63) and an R2 value of 0.268.

Causal reasoning with Michotte

For the second test for causal reasoning abilities, we ran an experiment from ref. 52 that is based on the classic Michotte launching paradigm102. It uses simple synthetic two-dimensional (2D) depictions of two balls labelled ‘A’ and ‘B’ with arrows showing their trajectories in front of a white background (see Supplementary Fig. 3 for an example). We started by asking the models to determine the background colour of the image (Fig. 4a). Most models perform fine with the exception of Fuyu, which always answers ‘pink’ (probably since pink is mentioned as the colour of the gate in the prompt). Then, we asked models to infer the trajectory of ball movement. This proved challenging for most models (Fig. 4b), which is surprising given that the prompt explicitly mentions that the arrows in the stimuli depict the trajectory of the balls and the balls always move from right to left.

Fig. 4: Results for Michotte causal reasoning experiment.
figure 4

ad, We first ask for the background colour in the image (a), then the direction of ball movement (b), a judgement between 0 and 100 on whether ball ‘B’ goes through the gate (c) and finally a counterfactual judgement between 0 and 100 on whether ball ‘B’ would have gone through the gate, had ball ‘A’ not been present in the scene (d). The causal reasoning experiment was taken from ref. 52. Bars in plots a and b show percentage of correct answers with error bars given by the standard deviation of a binomial distribution (n = 18). Bars in plots c and d show the square root of the R2 values for Bayesian logistic mixed effects regressions with error bars given by the square root of the 95% percentiles for this R2 value (n = 252 and 234, respectively).

Full size image

We then queried the models for their agreement on a scale from 0 to 100 with the following questions: either ‘Ball B went through the middle of the gate’ (if ball B entered the gate) or ‘Ball B completely missed the gate’ (if ball B missed the gate) (Fig. 4c). No model performs close to the human results reported in ref. 52. The best performing model is Fuyu with a regression coefficient of 0.26 (95% CI −0.08, 0.61) and an R2 value of 0.067. Interestingly, Claude-3 shows a negative relationship with human judgements, with a regression coefficient of −0.22 (95% CI −0.39, −0.06) and an R2 value of 0.076. The human-to-human regression coefficient is 0.85 (95% CI 0.69, 1.03) with an R2 value of 0.556.

Finally, we asked the models for their agreement on a scale from 0 to 100 with the counterfactual question of whether ‘Ball B would have gone through the gate had Ball A not been present in the scene’ (Fig. 4d). Notably, the closed-source models perform worse than some open-source models for both tasks. Here, Fuyu is again the best performing model with a regression coefficient of 0.42 (95% CI 0.28, 0.57) and an R2 value of 0.185. Pseudo r values for LLaMA-Adapter V2 and GPT-4V are missing, since the former gave only non-valid answers and latter always responded with 100. The human-to-human regression coefficient is 0.85 (95% CI 0.76, 0.93) with an R2 value of 0.698.

Intuitive psychology with the astronaut task

As a first test for the intuitive psychology understanding of the different LLMs, we used synthetic images depicting an astronaut on a coloured background from ref. 103 (see Supplementary Figs. 4 and 5 for an example). The images featured different terrains and care packages. Depending on which terrain the astronaut crossed or which care package they chose to pick up, it was possible to infer the costs associated with the terrains and rewards associated with the care packages.

Again, we first tasked models with determining the background colour of the images. Here, the performance of the models was worse compared with the intuitive physics dataset (Fig. 5a), which might be due to the fact that the background colour here was not uniform (Supplementary Fig. 5). We then asked models to count the number of care packages in the scene. Most models except for GPT-4V struggled here (Fig. 5b).

Fig. 5: Results on astronaut task for intuitive psychology.
figure 5

a,b, Again, we first ask for the background colour (a) and the number of boxes in the scene (b). c,d, Models are then asked to make inferences about the costs (c) and rewards (d) in an environment depending on the path an agent has taken. The tasks for intuitive psychology were taken from ref. 103. Regression coefficients for Fuyu and LLaMA-Adapter V2 are missing as they always responded with constant ratings for either cost or reward questions. Bars in plots a and b show percentage of correct answers with error bars given by the standard deviation of a binomial distribution (n = 16). Bars in plots c and d show the square root of the R2 values for Bayesian logistic mixed effects regressions with error bars given by the square root of the 95% percentiles for this R2 value (n = 81 and 70, respectively).

Full size image

Afterwards, we asked them to infer the costs associated with the different terrains (Fig. 5c) and the rewards associated with different care packages (Fig. 5d). All models only showed weak relations with the average over human participants in their judgements about the costs and rewards associated with the environment. The regression coefficients of the models with the z-scaled mean over human participants ranged from −0.24 to 0.16 with R2 values between 0.025 to 0.04 for cost questions, and from −0.02 to 0.39 (Claude-3, 95% CI 0.11, 0.66) with R2 values between 0.015 and 0.110 for reward questions.

Intuitive psychology with the help or hinder task

The new intuitive psychology dataset we added is taken from ref. 104. This task shows a simple 2D depiction of two agents in a grid environment (see Supplementary Fig. 6 for an example). On each time step, the agents can move up, down, left or right, or stay in place, but cannot move through walls or boxes. The red agent has the objective of reaching a star in ten time steps. If the agent runs out of time they fail. The blue agent has the objective of either helping or hindering the red agent by pushing or pulling boxes around.

We first asked models to determine the background colour in the scene and to determine the number of boxes in the scene (Fig. 6a,b). The closed-source models are able to perfectly determine the background colour (always white) but they nonetheless struggle with determining the number of boxes in the scene (always 1, 2 or 3). Model answers for the counting task ranged from 1 to 4, with only LLaMA-Adapter V2 giving constant responses of 2.

Fig. 6: Results on help or hinder task for intuitive psychology.
figure 6

ad, We first ask for the background colour in the image (a), then the number of boxes in the scene (b), a judgement between 0 and 100 on whether an agent in the scene tried to hinder the other agent (c) and finally a counterfactual judgement between 0 and 100 on whether an agent in the scene would have successfully reached the goal, had the other agent not been present (d). The intuitive psychology dataset was from taken from ref. 104. Bars in plots a and b show percentage of correct answers with error bars given by the standard deviation of a binomial distribution (n = 24). Bars in plots c and d show the square root of the R2 values for Bayesian logistic mixed effects regressions with error bars given by the square root of the 95% percentiles for this R2 value (n = 1,200).

Full size image

We then asked the models whether the blue agent tried to help or hinder the red agent (Fig. 6c). Here, Otter shows the highest regression coefficient with human answers with 0.19 (95% CI 0.13, 0.25) and an R2 value of 0.038. Claude-3 shows a negative relationship with human answers with a coefficient of −0.25 (95% CI −0.31, −0.20) and an R2 value of 0.066. No model showed coefficients even close to the human-to-human coefficient of 0.93 (95% CI 0.90, 0.96) with an R2 value of 0.858.

Finally, we asked the model whether the red agent would have succeeded in reaching the star, had the blue agent not been there. We show the square root of the R2 for the Bayesian linear mixed effects regression with 95% percentiles in Fig. 6d. Interestingly, the results here flip, with Otter now showing a stronger negative relationship with a coefficient of −0.40 (95% CI −0.47, −0.33) and an R2 value of 0.161 (this makes sense, since this task is essentially a counterfactual simulation question similar to Fig. 4d, where Otter already showed a negative relation to human judgements). GPT-4V and Claude-3 both show small positive regression coefficients with humans answers: 0.15 (95% CI 0.09, 0.21) with an R2 value of 0.025, and 0.17 (95% CI 0.11, 0.23) with an R2 value of 0.032, respectively. Again, no model coefficient is close to the human-to-human coefficient of 0.83 (95% CI 0.80, 0.87) with an R2 value of 0.688.

Discussion

We started by asking whether, with the rise of modern LLMs, researchers have created machines that—at least to some degree—think like people. To address this question, we took four recent multimodal LLMs and probed their abilities in three core cognitive domains: intuitive physics, causal reasoning and intuitive psychology.

In intuitive physics and causal reasoning, the models managed to solve some of the given tasks and GPT-4V showed a slight match with human data. However, while they performed well in some tasks, the models did not show a conclusive match with human data for the causal reasoning experiments. Finally, in the intuitive psychology tasks, none of the models showed a strong match with human data. Thus, an appropriate answer to the question motivating our work would be ‘No’, or—perhaps more optimistically—‘Not quite’.

Although we have tried our best to give all models a fair chance and set up the experiments in a clean and replicable fashion, some shortcomings remain that should be addressed in future work. First of all, we have tested only a handful of multimodal models on just three cognitive domains. While we believe that the used models and tasks provide good insights into the state-of-the-science of LLMs’ cognitive abilities, future studies should look at more domains and different models to further tease apart when and why LLMs can mimic human reasoning. For example, it would be interesting to see whether scale is the only important feature influencing model performance105,106. Currently, our evidence suggests that even smaller models, for example, Fuyu, with its 8 billion parameters, can sometimes perform as well as GPT-4V in some tasks. Additionally, we applied all models out of the box and without further fine-tuning. Future studies could attempt to fine-tune multimodal LLMs to better align with cognitive data107 and assess whether this improves their reasoning abilities more generally. Similar to other recent work108, we found that many models were already constrained in their basic visual processing. While the more powerful closed-source models performed more robustly on simple scene understanding tasks, we found that they still failed simple questions that would be trivial for human observers. Thus, we think that the models’ weak performance in some domains can partially be explained by their poor basic visual processing capabilities.

Another shortcoming of the current work is the simplicity of the used stimuli. While the block towers used in our first study were deliberately designed to be more realistic98 than commonly used psychological stimuli33, this was not true for the experiments in the other two domains. For the intuitive psychology experiments, in particular, we would expect the models to perform better if the stimuli contained more realistic images of people, which has been shown to work better in previous studies109. Interestingly, using more realistic stimuli can also change people’s causal judgements110; how realistic the stimuli used in cognitive experiments should be remains an open question111.

On a related point, we used only static images in our current experiments, which severely limits the breadth and level of detail of the questions we could ask. For example, some of the most canonical tasks investigating people’s causal reasoning abilities involve videos of colliding billiard balls52. As future LLMs will probably be able to answer questions about videos112, these tasks represent the next frontier of cognitively inspired benchmarks.

For the comparisons with human data, we used the participant data collected in the original studies for all experiments, except for the intuitive physics task, and assessed the correspondence between models and these data via a Bayesian mixed effects regression and R2 values. Future work could expand on this approach by collecting new data from human participants choosing which of the model’s judgements they prefer. This could lead to a more detailed comparison, similar to what has been proposed to discriminate among deep learning models for human vision113 and language114.

A crucial weakness of most studies using LLMs is that they can be sensitive to specific prompts115,116,117. While we have attempted to use prompts that elicited good behaviour, thereby giving LLMs a chance to perform well, future work could try to further optimize these prompts using available methods118,119,120, while also assessing how the models respond to paraphrased versions of the same tasks. We present an exploratory analysis of the effects of response constraints and context complexity on human behaviour in the intuitive psychology astronaut task in Supplementary Fig. 7. While response constraints and context complexity both influence model outputs, we also find that small variations to prompts on a character level can impact model behaviour, probably due to tokenization. Taken together, this shows that evaluations of visual LLMs are not only dependent on the specific models and experiments used, but also on the prompts and probably even how these prompts are tokenized. While it could be possible to further engineer the used prompts, we believe that our current approach was sufficient to showcase these models’ abilities.

Our work has shown that multimodal LLMs have come a long way, showing some correspondence to human behaviour and often performing above chance. Moreover, machine learning researchers have put forward various ideas about how to close the remaining gap between humans and machines121, including self-supervised learning122, translating from natural into probabilistic languages123 or grounding LLMs in realistic environments124. This continuous evolution in models’ capabilities necessitates a re-evaluation of the metaphors and tools we use to understand them. We believe that cognitive science can offer tools, theories and benchmarks to evaluate how close we have come to ‘building machines that learn and think like people’.

Methods

Code

The open-source models were installed per the instructions on their related GitHub or Huggingface repositories and evaluated on a Slurm-based cluster with a single A100. For the results reported as GPT-4V, we used the public ChatGPT interface and the OpenAI application programming interface (API), specifically the November 2023 release of gpt4-vision-preview model which is available via the completions endpoint. For Claude-3, we used the Anthropic API. Code for replicating our results is available on GitHub (github.com/lsbuschoff/multimodal). All models were evaluated in Python using PyTorch125. Additional analyses were carried out using NumPy126, Pandas127 and SciPy128. Matplotlib129 and Seaborn130 were used for plotting. Bayesian mixed effects models were computed using brms131 in R132.

Models

Open-source

Fuyu is an 8 billion parameter multimodal text and image decoder-only transformer. We used the Huggingface implementation with standard settings and without further fine-tuning (available at https://huggingface.co/adept/fuyu-8b). The maximum number of generated tokens was set to 8 and responses were parsed by hand. Otter is a multimodal LLM that supports in-context instruction tuning and is based on the OpenFlamingo model. We used the Huggingface implementation of OTTER-Image-MPT7B (available at https://huggingface.co/luodian/OTTER-Image-MPT7B), again with standard settings and without fine-tuning. The maximum number of generated tokens was left at 512 and responses were parsed by hand. For LLaMA-Adapter V2, which adds adapters into LLaMA’s transformer to turn it into an instruction-following model, we used the GitHub implementation of llama-adapter-v2-multimodal7b with standard settings and again without further fine-tuning (available at https://github.com/OpenGVLab/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal7b). The maximum number of generated tokens was left at 512 and responses were parsed by hand.

Closed-source

We initially queried GPT-4V through the ChatGPT interface, since the OpenAI API was not publicly available at the outset of this project. The intuitive psychology task responses were collected using the gpt4-vision-preview model variant after its November 2023 release in the API. We set the maximum number of generated tokens for a given prompt to 1 to get single numerical responses. All other parameters were set to their default values. Note that this model does not currently feature an option for manually setting the temperature, and the provided documentation does not specify what the default temperature is. We query Claude-3 using the Anthropic API. We use the model version claude-3-opus-20240229 with a temperature of zero and the maximum number of new tokens between 3 and 6 depending on the task.

Datasets

Intuitive physics with block towers

We tested the intuitive physical understanding of the models using images from ref. 98. The photos depict a block tower consisting of coloured wooden blocks in front of a white fabric (see Supplementary Fig. 1 for an example). The images are of size 224 × 244. In the dataset, there are a total of 516 images of block towers. We tested the models on 100 randomly drawn images. We first tested the models on their high-level visual understanding of the scenes: we tasked them with determining the background colour and the number of blocks in the image. To test their physical understanding, we tested them on the same task as the original study: we asked them to give a binary rating on the stability of the depicted block towers. For the first two tasks, we calculated the percentage of correct answers for each of the models. For the third task, we calculated a Bayesian linear mixed effects regression between human and model answers.

Due to the limited sample size of the original human experiment, we reran the human experiment from ref. 98 on Prolific with 107 participants (55 female and 52 male native English speakers with a mean age of 27.73 (s.d. = 4.21)). All participants agreed to take part in the study and were informed about the general purpose of the experiment. Experiments were performed in accordance with the relevant guidelines and regulations approved by the ethics committee of the University of Tübingen. Participants first saw an example trial, followed by 100 test images. In a two-alternative forced choice paradigm, participants were asked whether the block tower in a given image was stable or not stable. They were paid £1.50 and the median time they took to complete the experiment was 08:08 min, making the average base reward £11.07 per hour. Additionally, they received a bonus payment of up to £1 depending on their performance (1 penny for each correct answer).

Causal reasoning with Jenga

For the first causal reasoning experiment, we used images from ref. 100. The images show artificial block stacks of red and grey blocks on a black table (see Supplementary Fig. 2 for an example). The dataset consists of 42 images on which we tested all models. We again first tested the models on their high-level visual understanding of the scene and therefore tasked them with determining the number of blocks in the scene. The ground truth number of blocks in the scenes ranged from 6 to 19. Since this task is rather challenging due to the increased number of blocks, we do not report the percentage correct as for the intuitive physics dataset, but the mean over the absolute distance between model predictions and the ground truth for each image (Fig. 3a).

To test the causal reasoning of the models, we adopted the tasks performed in the original study100,101. We asked models to infer how many red blocks would fall if the grey block was removed. For this condition, Zhou and colleagues100 collected data from 42 participants. We again report the absolute distance between model predictions and the ground truth for each image (Fig. 3b). We calculate a random baseline which uses the mean between 0 and the number of blocks for each specific image as the prediction. We also ask the models for a rating between 0 and 100 for how responsible the grey block is for the stability of the tower. Here, data for 41 human participants were publicly available. For both the number of blocks that would fall if the grey block was removed, and its responsibility for the stability of the tower, we calculate the mean Pearson correlation to human participants from the original study (Fig. 3c).

Causal reasoning with Michotte

For the second test for causal reasoning abilities, we used a task from ref. 52. It features 18 images which show a 2D view of two balls and their trajectories on a flat surface (see Supplementary Fig. 3 for an example). This experiment is a variation of the classic Michotte launching paradigm102, used to test visual causal perception. We again first tested the models on their high-level visual understanding of the scene: we first asked them to determine the background colour (Fig. 4a) and then the direction of ball movement (Fig. 4b) from the two options ‘left to right’ or ‘right to left’ (the balls always moved from right to left).

To test the causal reasoning of the models, we adopted the tasks performed in the original study. We asked models about the actual outcome of the scene: ‘Did ball A enter the gate?’ As in the original experiments, models had to indicate their agreement with this statement on a scale from 0 (not at all) to 100 (completely). We then also asked the counterfactual question: ‘Would ball A have entered the gate had it not collided with ball B?’ The original authors52 collected the responses of 14 participants in the ‘outcome’ condition and 13 participants in the ‘counterfactual’ condition. We here report the regression between model and human responses (Fig. 4c,d).

Intuitive psychology with astronaut images task

To test the intuitive psychology of the different LLMs, we used stimuli from ref. 103. This part consisted of three different experiments, each consisting of 16, 17 and 14 images showing a 2D depiction of an astronaut and care packages in different terrains (see Supplementary Figs. 4 and 5 for an example). To check their high-level understanding of the images, we again asked the models to determine the background colour of the images. Since this background colour is not uniform, we counted both ‘Pink’ and ‘Purple’ as correct answers. We report the percentage of correct answers for the background colour in Fig. 5a.

In accordance with the original study, analyses for the intuitive psychological capabilities of the models are split into cost questions (passing through a terrain is associated with a cost for the agent) and reward questions (collecting a care package yields some sort of reward for the agent). We pooled cost and reward questions over all three experiments and reported the mean Pearson correlation with the data of 90 human participants collected in ref. 103 (Fig. 5b,c). This heuristic calculates the costs and rewards associated with the environment from the amount of time an agent spends in each terrain and which care package the agent collects.

Intuitive psychology with the help or hinder task

The second intuitive psychology experiment is taken from ref. 104. It consists of 24 images showing a 2D depiction of two agents in a grid world (see Supplementary Fig. 6 for an example). To check the models’ basic understanding of the images, we again asked the models to determine the background colour of the images and the number of boxes in the scene. We report the percentage of correct answers for both tasks in Fig. 6a,b.

We then asked the models whether the blue agent tried to help or hinder the red agent on a scale from ‘definitely hinder RED’ (0) to ‘definitely help RED’ (100), with the midpoint ‘unsure’ (50). We show the regression to human judgements in Fig. 6c. Finally, we asked the model the counterfactual question if the red agent would have succeeded in reaching the star had the blue agent not been there on a scale from ‘not at all’ (0) to ‘very much’ (100)? The original authors collected the responses of 50 participants for each of the two conditions (‘intention’ and ‘counterfactual’). We show the mixed linear regression coefficients between model and human answers for all models with 95% credible intervals in Fig. 6d.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Related Articles

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Large language models (LLMs) have emerged as transformative tools with the potential to revolutionize philosophical counseling. By harnessing their advanced natural language processing and reasoning capabilities, LLMs offer innovative solutions to overcome limitations inherent in traditional counseling approaches—such as counselor scarcity, difficulties in identifying mental health issues, subjective outcome assessment, and cultural adaptation challenges. In this study, we explore cutting‐edge technical strategies—including prompt engineering, fine‐tuning, and retrieval‐augmented generation—to integrate LLMs into the counseling process. Our analysis demonstrates that LLM-assisted systems can provide counselor recommendations, streamline session evaluations, broaden service accessibility, and improve cultural adaptation. We also critically examine challenges related to user trust, data privacy, and the inherent inability of current AI systems to genuinely understand or empathize. Overall, this work presents both theoretical insights and practical guidelines for the responsible development and deployment of AI-assisted philosophical counseling practices.

Multimodal insights: enhancing cultural promotion through analysis of Saudi Arabian audiovisual productions

This research explores the application of Dicerto’s (2018) multimodal pragmatic model in analyzing Arabic audiovisual productions for translation purposes, focusing on enhancing cultural promotion. Employing a qualitative descriptive analysis approach, the study examines samples from Saudi productions that promote tourism, mainly focusing on Saudi coffee and its cultural traditions to enlighten foreign visitors about Saudi culture. The analysis reveals that Dicerto’s model provides a clear framework for achieving semantic fidelity in translation, ensuring that the translated text closely resembles its original in interpretative richness. Central to this framework is the principle of optimal relevance, wherein the sender intends the message to be maximally pertinent to the audience, thereby justifying the recipient’s cognitive effort in processing it and facilitating access to the sender’s intentions. This research sheds light on the effectiveness of applying multimodal analysis models in cultural promotion efforts through audiovisual productions, particularly in Saudi Arabian tourism promotion.

Innovating beyond electrophysiology through multimodal neural interfaces

Neural circuits distributed across different brain regions mediate how neural information is processed and integrated, resulting in complex cognitive capabilities and behaviour. To understand dynamics and interactions of neural circuits, it is crucial to capture the complete spectrum of neural activity, ranging from the fast action potentials of individual neurons to the population dynamics driven by slow brain-wide oscillations. In this Review, we discuss how advances in electrical and optical recording technologies, coupled with the emergence of machine learning methodologies, present a unique opportunity to unravel the complex dynamics of the brain. Although great progress has been made in both electrical and optical neural recording technologies, these alone fail to provide a comprehensive picture of the neuronal activity with high spatiotemporal resolution. To address this challenge, multimodal experiments integrating the complementary advantages of different techniques hold great promise. However, they are still hindered by the absence of multimodal data analysis methods capable of providing unified and interpretable explanations of the complex neural dynamics distinctly encoded in these modalities. Combining multimodal studies with advanced data analysis methods will offer novel perspectives to address unresolved questions in basic neuroscience and to develop treatments for various neurological disorders.

AI can outperform humans in predicting correlations between personality items

We assess the abilities of both specialized deep neural networks, such as PersonalityMap, and general LLMs, including GPT-4o and Claude 3 Opus, in understanding human personality by predicting correlations between personality questionnaire items. All AI models outperform the vast majority of laypeople and academic experts. However, we can improve the accuracy of individual correlation predictions by taking the median prediction per group to produce a “wisdom of the crowds” estimate. Thus, we also compare the median predictions from laypeople, academic experts, GPT-4o/Claude 3 Opus, and PersonalityMap. Based on medians, PersonalityMap and academic experts surpass both LLMs and laypeople on most measures. These results suggest that while advanced LLMs make superior predictions compared to most individual humans, specialized models like PersonalityMap can match even expert group-level performance in domain-specific tasks. This underscores the capabilities of large language models while emphasizing the continued relevance of specialized systems as well as human experts for personality research.

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.

Responses

Your email address will not be published. Required fields are marked *