Visual cognition in multimodal large language models

Main

People are quick to anthropomorphize, attributing human characteristics to non-human agents¹. The tendency to anthropomorphize has only intensified with the advent of large language models (LLMs)². LLMs apply deep learning techniques to generate text³, learning from vast datasets to produce responses that can be startlingly human-like⁴. Astonishingly, these models cannot only generate text. When scaled up to bigger training data and architectures, other, so-called ‘emergent abilities’ appear^5,6. The current models can, for example, pass the bar exam⁷, write poems⁸, compose music⁹ and assist in programming and data analysis tasks¹⁰. As a result, the line between human and machine capabilities is increasingly blurred^11,12. People not only interact with these systems as if they were humans¹³, but they also start to rely on them for complex decision-making¹⁴, artistic creation¹⁵ and personal interactions¹⁶. It is, therefore, natural to ask: Have we built machines that think like people?

Judging whether or not artificial agents can mimic human thought is at the core of cognitive science^17,18. Therein, researchers have long debated the capabilities of artificially intelligent agents^19,20,21. In a seminal paper, Lake and colleagues²² proposed core domains to consider when making such judgements. Published during the height of the deep learning revolution²³, the authors focused on domains that were easy for people but difficult for deep learning models: intuitive physics, causal reasoning and intuitive psychology.

Research on intuitive physics has studied how people perceive and interpret physical phenomena^24,25,26. Past work on this topic has emphasized that humans possess an innate ability to predict and understand the physical properties of objects and their interactions²⁷, even from a young age²⁸, a notion sometimes summarized as a ‘physics engine’ in people’s heads²⁹. This understanding includes concepts such as gravity³⁰, inertia³¹ and momentum³². Some of the most canonical tasks in this domain involve testing people’s judgements about the stability of block towers^33,34. These tasks have made their way into machine learning benchmarks^35,36, where they are used to test the intuitive physical understanding of neural networks (see ref. ³⁷ for an overview of previous work on building models with human-like physical knowledge).

Research on causal reasoning has studied how individuals infer and think about cause–effect relationships^38,39,40. Past work on this topic has proposed that humans possess an intuitive capacity to infer, understand and predict causal relationships in their environment^41,42,43,44, oftentimes described using Bayesian models of causal learning^45,46. This cognitive ability encompasses recognizing patterns^47,48, inferring causes from interventions^49,50, and predicting future events based on hypothetical events⁵¹. Canonical tasks in this domain often involve assessing individuals’ ability to infer causal relationships, for example, when judging the responsibility of one object causing other objects’ movement^52,53. Causal reasoning remains a challenge, even for current machine learning approaches^54,55.

Research on intuitive psychology has explored how individuals infer, understand and interpret social phenomena and mental states of other agents^56,57. Past work on this topic has emphasized the concept that humans possess an inherent ability to infer and reason about the mental states^58,59, intentions and emotions of others, often referred to as a ‘theory of mind’^60,61. This ability has been modelled as a Bayesian inference problem^62,63,64. Canonical tasks in this domain often involve assessing individuals’ capacity to predict actions based on understanding others’ perspectives or intentions, such as determining agents’ utility functions based on their actions in a given environment^65,66. It is the subject of ongoing debates whether modern algorithms show any form of intuitive psychology^67,68,69.

Lake and colleagues argued that some of these abilities act as ‘start-up software’, because they constitute cognitive capabilities present early in development. Moreover, they proposed that these so-called ‘intuitive theories’^70,71 need to be expressed explicitly using the calculus of Bayesian inference⁷², as opposed to being learned from scratch, for example, via gradient descent. However, with the increase in abilities of current neural networks, in particular LLMs, we pondered: Can LLMs, in particular vision LLMs, sufficiently solve problems from these core domains?

To address this question, we took canonical tasks from the domains of intuitive physics, causal reasoning and intuitive psychology that could be studied by providing images and language-based questions. We submitted them to some of the currently most advanced LLMs. To evaluate whether the LLMs show human-like performance in these domains, we follow the approach outlined in ref. ⁷³: we treat the models as participants in psychological experiments. This allows us to draw direct comparisons with human behaviour on these tasks. Since the tasks are designed to test abilities in specific cognitive domains, this comparison allows us to investigate in which domains multimodal LLMs perform similar to humans, and in which they don’t. Our results showed that these models can, at least partially, solve these tasks. In particular, two of the largest currently available models, OpenAI’s Generative Pre-trained Transformer (GPT-4) and Anthropic’s Claude-3 Opus, managed to perform robustly above chance in two of the three domains. Yet crucial differences emerged. First, none of the models matched human-level performance in any of the domains. Second, none of the models fully captured human behaviour, leaving room for domain-specific models of cognition such as the Bayesian models originally proposed for the tasks.

Related work

There have been a large number of studies on reasoning abilities in LLMs^74,75,76. Previous studies have focused, among others, on testing LLMs’ cognitive abilities in model-based planning⁷³, analogical reasoning tests⁷⁷, exploration tasks⁷⁸, systematic reasoning tests^79,80, psycholinguistic completion studies⁸¹ and affordance understanding problems⁸². In this sense, our contribution can be seen as a part of a larger movement in which researchers use methods from the behavioural sciences to understand black box machine learning models^83,84,85. However, most of the previous studies did not investigate multimodal LLMs but rather remained in the pure language domain. Although there have been recent attempts to investigate vision LLMs’ cognitive features, including their reaction to visual illusions⁸⁶ as well as how they solve simple intelligence tasks⁸⁷, we investigate the proposed core components of cognition in these models.

Previous work has also looked at how LLMs solve cognitive tasks taken from the same domains that we have looked at. In intuitive physics, Zečević and colleagues⁸⁸ found that LLMs performed poorly in a task using language descriptions of physical scenarios. Zhang and colleagues⁸⁹ extracted programs from text produced by LLMs to improve their physical reasoning abilities. Finally, Jassim and colleagues⁹⁰ proposed a new benchmark for evaluating multimodal LLMs’ understanding of situated physics. In causal reasoning, Binz and Schulz⁷³ showed that GPT-3 failed at simple causal reasoning experiments, while Kosoy and colleagues⁹¹ showed that LLMs cannot learn human-like causal over-hypotheses. In research on intuitive psychology, Kosinski argued that theory of mind might have emerged in LLMs⁶⁸ which has been criticized other researchers⁶⁹. Akata and colleagues showed that GTP-4 plays repeated games very selfishly and could not pick up on simple conventions such as alternating between options¹⁶. Finally, Gandhi and colleagues⁹² proposed a framework for procedurally generating theory of mind evaluations and found that GPT-4’s abilities mirror human inference patterns, although less reliable, while all other LLMs struggled.

Many of the past studies on LLMs have fallen risk to appearing in new models’ training sets. Recent work has recognized this issue and, in turn, evaluated language models on many problem variations to minimize training set effects⁹³. Our work differs from these approaches as current models could not have just memorized solutions to the given problems because these problems require higher level reasoning. Furthermore, the human data and ground truth are most commonly stored in additional data files, which first have to be extracted and matched to the respective images to be used for model training. Since this requires data wrangling that cannot easily be automated and the number of stimuli to gain is so small, it is extremely unlikely that these stimuli together with the ground truth were entered into the training set of any of the investigated models.

Results

We tested five different models on three core components for human-like intelligence as outlined in ref. ²² (Fig. 1a). The models we used are vision LLMs, which are multimodal models that integrate image processing capabilities into LLMs^94,95 (Fig. 1c). These models allow users to perform visual question answering^96,97: users can upload an image and ask questions about it, which the model interprets and responds to accordingly.

**Fig. 1: Overview of domains, tasks, approach and models.**

To test the three core components, we used tasks from the cognitive science literature that could be studied in vision LLMs via visual question answering. For every task, we queried the visual reasoning abilities of the LLMs with tasks of increasing complexity. First, we asked about simple features of the shown images such as the background colour or the number of objects shown. Afterwards, we submitted questions taken from the cognitive science experiments. We report results based on comparisons with the ground truth as well as the different models’ matches to human data.

Intuitive physics with block towers

To test the intuitive physics capabilities of the different LLMs, we used photographs depicting wooden block towers from ref. ⁹⁸ (see Supplementary Fig. 1 for an example). We first asked models to determine the background colour of the image. All four models achieved almost perfect accuracy (Fig. 2a). We then asked models to state the colour of blocks from top to bottom. Here, the performance of most models except for GPT-4V and Claude-3 deteriorated (Fig. 2b). Please note that the first two tasks are fairly trivial for humans and we would expect human performance to be at 100% (the background colour is always white and images featured two, three or four blocks in primary colours).

**Fig. 2: Results for five vision LLMs for tasks of increasing complexity given images of real block towers.**

To test the models’ physical reasoning abilities, we asked them to give a binary stability judgement of the depicted block towers. Here, only GPT-4V and Claude-3 performed slightly above chance (Fig. 2c; for GPT-4V, Fisher’s exact test yielded an odds ratio of 2.597 with a one-sided P value of 0.028). None of the other models performed significantly above chance (the second best performing model, Claude-3, had an odds ratio of 2.016, with a one-sided P value of 0.078). Human participants were also not perfect but showed an average accuracy of 65.608%.

Finally, we determined the relationship between models’ and humans’ stability judgements using a Bayesian logistic mixed effects regression. We compute a Bayesian R² for each regression model based on draws from the modelled residual variances⁹⁹. We then take the square root of this Bayesian R² and multiply it with the sign of the main regression coefficient to arrive at a pseudo r value. Around this pseudo r value we plot the square root of the 95% percentiles for the R² value (Fig. 2d). We found that GPT-4V was the only model that showed a relation to human judgements, with a regression coefficient of 1.15 (95% credible interval (95% CI) 1.04, 1.27) and an R² value of 0.066. However, the regression coefficient between individual humans and the mean over humans was still larger, with a coefficient of 1.46 (95% CI 1.41, 1.52) and an R² value of 0.354.

Causal reasoning with Jenga

To test the models’ causal reasoning capabilities, we used synthetic images from refs. ^100,101, which depicted block towers that were stable but might collapse if one of the blocks was removed (see Supplementary Fig. 2 for an example). We started by asking the models to count the blocks in the image. The images in this task displayed a larger number of blocks (ranging from 6 to 19), which made the basic counting task significantly more challenging than in the previous section. Models’ responses approximated the ground truth, albeit rarely matching it exactly. Therefore, we report the mean absolute distance to the ground truth instead of the percentage of correct answers (Fig. 3a). The models’ performance highlighted the challenging nature of this task, with the best performing model (Claude-3) still being on average more than one block off.

**Fig. 3: Results for Jenga causal reasoning experiment.**

We continued by querying the models for the number of blocks that would fall if a specific block was removed from the scene (Fig. 3b,c). We established a baseline performance represented by a horizontal line in Fig. 3b,c, which corresponds to a random agent that gave the mean between 0 and the number of blocks in each image as its’ prediction, essentially behaving like a uniform distribution over the possible number of blocks that could fall. Notably, both GPT-4V and Fuyu-8B surpassed the random baseline, their performance levels being close to the human results reported in ref. ¹⁰⁰, which is depicted by the rightmost bar in the plot. However, GPT-4V still diverges significantly from the average over human participants (t(42) = 2.59, P < 0.05).

Finally, we asked the models to rate the responsibility of a specific block for the stability of the other blocks (Fig. 3d). Notably, all models except for GPT-4V gave constant ratings for this task (Fuyu and Claude-3 always responded with 100, while Otter and LLaMA-Adapter V2 always responded with 50). The regression coefficient for GPT-4V with human values is 0.16 (95% CI 0.10, 0.21) with an R² value of 0.027. The human-to-human regression has a coefficient of 0.54 (95% CI 0.45, 0.63) and an R² value of 0.268.

Causal reasoning with Michotte

For the second test for causal reasoning abilities, we ran an experiment from ref. ⁵² that is based on the classic Michotte launching paradigm¹⁰². It uses simple synthetic two-dimensional (2D) depictions of two balls labelled ‘A’ and ‘B’ with arrows showing their trajectories in front of a white background (see Supplementary Fig. 3 for an example). We started by asking the models to determine the background colour of the image (Fig. 4a). Most models perform fine with the exception of Fuyu, which always answers ‘pink’ (probably since pink is mentioned as the colour of the gate in the prompt). Then, we asked models to infer the trajectory of ball movement. This proved challenging for most models (Fig. 4b), which is surprising given that the prompt explicitly mentions that the arrows in the stimuli depict the trajectory of the balls and the balls always move from right to left.

**Fig. 4: Results for Michotte causal reasoning experiment.**

We then queried the models for their agreement on a scale from 0 to 100 with the following questions: either ‘Ball B went through the middle of the gate’ (if ball B entered the gate) or ‘Ball B completely missed the gate’ (if ball B missed the gate) (Fig. 4c). No model performs close to the human results reported in ref. ⁵². The best performing model is Fuyu with a regression coefficient of 0.26 (95% CI −0.08, 0.61) and an R² value of 0.067. Interestingly, Claude-3 shows a negative relationship with human judgements, with a regression coefficient of −0.22 (95% CI −0.39, −0.06) and an R² value of 0.076. The human-to-human regression coefficient is 0.85 (95% CI 0.69, 1.03) with an R² value of 0.556.

Finally, we asked the models for their agreement on a scale from 0 to 100 with the counterfactual question of whether ‘Ball B would have gone through the gate had Ball A not been present in the scene’ (Fig. 4d). Notably, the closed-source models perform worse than some open-source models for both tasks. Here, Fuyu is again the best performing model with a regression coefficient of 0.42 (95% CI 0.28, 0.57) and an R² value of 0.185. Pseudo r values for LLaMA-Adapter V2 and GPT-4V are missing, since the former gave only non-valid answers and latter always responded with 100. The human-to-human regression coefficient is 0.85 (95% CI 0.76, 0.93) with an R² value of 0.698.

Intuitive psychology with the astronaut task

As a first test for the intuitive psychology understanding of the different LLMs, we used synthetic images depicting an astronaut on a coloured background from ref. ¹⁰³ (see Supplementary Figs. 4 and 5 for an example). The images featured different terrains and care packages. Depending on which terrain the astronaut crossed or which care package they chose to pick up, it was possible to infer the costs associated with the terrains and rewards associated with the care packages.

Again, we first tasked models with determining the background colour of the images. Here, the performance of the models was worse compared with the intuitive physics dataset (Fig. 5a), which might be due to the fact that the background colour here was not uniform (Supplementary Fig. 5). We then asked models to count the number of care packages in the scene. Most models except for GPT-4V struggled here (Fig. 5b).

**Fig. 5: Results on astronaut task for intuitive psychology.**

Afterwards, we asked them to infer the costs associated with the different terrains (Fig. 5c) and the rewards associated with different care packages (Fig. 5d). All models only showed weak relations with the average over human participants in their judgements about the costs and rewards associated with the environment. The regression coefficients of the models with the z-scaled mean over human participants ranged from −0.24 to 0.16 with R² values between 0.025 to 0.04 for cost questions, and from −0.02 to 0.39 (Claude-3, 95% CI 0.11, 0.66) with R² values between 0.015 and 0.110 for reward questions.

Intuitive psychology with the help or hinder task

The new intuitive psychology dataset we added is taken from ref. ¹⁰⁴. This task shows a simple 2D depiction of two agents in a grid environment (see Supplementary Fig. 6 for an example). On each time step, the agents can move up, down, left or right, or stay in place, but cannot move through walls or boxes. The red agent has the objective of reaching a star in ten time steps. If the agent runs out of time they fail. The blue agent has the objective of either helping or hindering the red agent by pushing or pulling boxes around.

We first asked models to determine the background colour in the scene and to determine the number of boxes in the scene (Fig. 6a,b). The closed-source models are able to perfectly determine the background colour (always white) but they nonetheless struggle with determining the number of boxes in the scene (always 1, 2 or 3). Model answers for the counting task ranged from 1 to 4, with only LLaMA-Adapter V2 giving constant responses of 2.

**Fig. 6: Results on help or hinder task for intuitive psychology.**

We then asked the models whether the blue agent tried to help or hinder the red agent (Fig. 6c). Here, Otter shows the highest regression coefficient with human answers with 0.19 (95% CI 0.13, 0.25) and an R² value of 0.038. Claude-3 shows a negative relationship with human answers with a coefficient of −0.25 (95% CI −0.31, −0.20) and an R² value of 0.066. No model showed coefficients even close to the human-to-human coefficient of 0.93 (95% CI 0.90, 0.96) with an R² value of 0.858.

Finally, we asked the model whether the red agent would have succeeded in reaching the star, had the blue agent not been there. We show the square root of the R² for the Bayesian linear mixed effects regression with 95% percentiles in Fig. 6d. Interestingly, the results here flip, with Otter now showing a stronger negative relationship with a coefficient of −0.40 (95% CI −0.47, −0.33) and an R² value of 0.161 (this makes sense, since this task is essentially a counterfactual simulation question similar to Fig. 4d, where Otter already showed a negative relation to human judgements). GPT-4V and Claude-3 both show small positive regression coefficients with humans answers: 0.15 (95% CI 0.09, 0.21) with an R² value of 0.025, and 0.17 (95% CI 0.11, 0.23) with an R² value of 0.032, respectively. Again, no model coefficient is close to the human-to-human coefficient of 0.83 (95% CI 0.80, 0.87) with an R² value of 0.688.

Discussion

We started by asking whether, with the rise of modern LLMs, researchers have created machines that—at least to some degree—think like people. To address this question, we took four recent multimodal LLMs and probed their abilities in three core cognitive domains: intuitive physics, causal reasoning and intuitive psychology.

In intuitive physics and causal reasoning, the models managed to solve some of the given tasks and GPT-4V showed a slight match with human data. However, while they performed well in some tasks, the models did not show a conclusive match with human data for the causal reasoning experiments. Finally, in the intuitive psychology tasks, none of the models showed a strong match with human data. Thus, an appropriate answer to the question motivating our work would be ‘No’, or—perhaps more optimistically—‘Not quite’.

Although we have tried our best to give all models a fair chance and set up the experiments in a clean and replicable fashion, some shortcomings remain that should be addressed in future work. First of all, we have tested only a handful of multimodal models on just three cognitive domains. While we believe that the used models and tasks provide good insights into the state-of-the-science of LLMs’ cognitive abilities, future studies should look at more domains and different models to further tease apart when and why LLMs can mimic human reasoning. For example, it would be interesting to see whether scale is the only important feature influencing model performance^105,106. Currently, our evidence suggests that even smaller models, for example, Fuyu, with its 8 billion parameters, can sometimes perform as well as GPT-4V in some tasks. Additionally, we applied all models out of the box and without further fine-tuning. Future studies could attempt to fine-tune multimodal LLMs to better align with cognitive data¹⁰⁷ and assess whether this improves their reasoning abilities more generally. Similar to other recent work¹⁰⁸, we found that many models were already constrained in their basic visual processing. While the more powerful closed-source models performed more robustly on simple scene understanding tasks, we found that they still failed simple questions that would be trivial for human observers. Thus, we think that the models’ weak performance in some domains can partially be explained by their poor basic visual processing capabilities.

Another shortcoming of the current work is the simplicity of the used stimuli. While the block towers used in our first study were deliberately designed to be more realistic⁹⁸ than commonly used psychological stimuli³³, this was not true for the experiments in the other two domains. For the intuitive psychology experiments, in particular, we would expect the models to perform better if the stimuli contained more realistic images of people, which has been shown to work better in previous studies¹⁰⁹. Interestingly, using more realistic stimuli can also change people’s causal judgements¹¹⁰; how realistic the stimuli used in cognitive experiments should be remains an open question¹¹¹.

On a related point, we used only static images in our current experiments, which severely limits the breadth and level of detail of the questions we could ask. For example, some of the most canonical tasks investigating people’s causal reasoning abilities involve videos of colliding billiard balls⁵². As future LLMs will probably be able to answer questions about videos¹¹², these tasks represent the next frontier of cognitively inspired benchmarks.

For the comparisons with human data, we used the participant data collected in the original studies for all experiments, except for the intuitive physics task, and assessed the correspondence between models and these data via a Bayesian mixed effects regression and R² values. Future work could expand on this approach by collecting new data from human participants choosing which of the model’s judgements they prefer. This could lead to a more detailed comparison, similar to what has been proposed to discriminate among deep learning models for human vision¹¹³ and language¹¹⁴.

A crucial weakness of most studies using LLMs is that they can be sensitive to specific prompts^115,116,117. While we have attempted to use prompts that elicited good behaviour, thereby giving LLMs a chance to perform well, future work could try to further optimize these prompts using available methods^118,119,120, while also assessing how the models respond to paraphrased versions of the same tasks. We present an exploratory analysis of the effects of response constraints and context complexity on human behaviour in the intuitive psychology astronaut task in Supplementary Fig. 7. While response constraints and context complexity both influence model outputs, we also find that small variations to prompts on a character level can impact model behaviour, probably due to tokenization. Taken together, this shows that evaluations of visual LLMs are not only dependent on the specific models and experiments used, but also on the prompts and probably even how these prompts are tokenized. While it could be possible to further engineer the used prompts, we believe that our current approach was sufficient to showcase these models’ abilities.

Our work has shown that multimodal LLMs have come a long way, showing some correspondence to human behaviour and often performing above chance. Moreover, machine learning researchers have put forward various ideas about how to close the remaining gap between humans and machines¹²¹, including self-supervised learning¹²², translating from natural into probabilistic languages¹²³ or grounding LLMs in realistic environments¹²⁴. This continuous evolution in models’ capabilities necessitates a re-evaluation of the metaphors and tools we use to understand them. We believe that cognitive science can offer tools, theories and benchmarks to evaluate how close we have come to ‘building machines that learn and think like people’.

Methods

Code

The open-source models were installed per the instructions on their related GitHub or Huggingface repositories and evaluated on a Slurm-based cluster with a single A100. For the results reported as GPT-4V, we used the public ChatGPT interface and the OpenAI application programming interface (API), specifically the November 2023 release of gpt4-vision-preview model which is available via the completions endpoint. For Claude-3, we used the Anthropic API. Code for replicating our results is available on GitHub (github.com/lsbuschoff/multimodal). All models were evaluated in Python using PyTorch¹²⁵. Additional analyses were carried out using NumPy¹²⁶, Pandas¹²⁷ and SciPy¹²⁸. Matplotlib¹²⁹ and Seaborn¹³⁰ were used for plotting. Bayesian mixed effects models were computed using brms¹³¹ in R¹³².

Models

Open-source

Fuyu is an 8 billion parameter multimodal text and image decoder-only transformer. We used the Huggingface implementation with standard settings and without further fine-tuning (available at https://huggingface.co/adept/fuyu-8b). The maximum number of generated tokens was set to 8 and responses were parsed by hand. Otter is a multimodal LLM that supports in-context instruction tuning and is based on the OpenFlamingo model. We used the Huggingface implementation of OTTER-Image-MPT7B (available at https://huggingface.co/luodian/OTTER-Image-MPT7B), again with standard settings and without fine-tuning. The maximum number of generated tokens was left at 512 and responses were parsed by hand. For LLaMA-Adapter V2, which adds adapters into LLaMA’s transformer to turn it into an instruction-following model, we used the GitHub implementation of llama-adapter-v2-multimodal7b with standard settings and again without further fine-tuning (available at https://github.com/OpenGVLab/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal7b). The maximum number of generated tokens was left at 512 and responses were parsed by hand.

Closed-source

We initially queried GPT-4V through the ChatGPT interface, since the OpenAI API was not publicly available at the outset of this project. The intuitive psychology task responses were collected using the gpt4-vision-preview model variant after its November 2023 release in the API. We set the maximum number of generated tokens for a given prompt to 1 to get single numerical responses. All other parameters were set to their default values. Note that this model does not currently feature an option for manually setting the temperature, and the provided documentation does not specify what the default temperature is. We query Claude-3 using the Anthropic API. We use the model version claude-3-opus-20240229 with a temperature of zero and the maximum number of new tokens between 3 and 6 depending on the task.