Causal chambers as a real-world physical testbed for AI methodology

Main

Methodological research in artificial intelligence (AI), machine learning and statistics often develops without a concrete application in mind. Many impactful advances in these fields have been made in this way, and there are important theoretical questions that are studied outside the context of a particular application. Crucially, progress also relies on having access to high-quality, real-world datasets, which benefits methodological and theoretical researchers by helping them steer research in meaningful directions, relaxing assumptions that are unlikely to hold in practice, as well as developing methodologies that may work well on a variety of real-world problems.

However, for some research areas, particularly nascent ones, it can be difficult to find real-world datasets that provide a ground truth suitable to validate new methods and check foundational assumptions that underlie theoretical work. This is because new fields come with new requirements in terms of ground truth, and few or no datasets may have been collected that already satisfy them. For example, for most sub-fields of causal inference^1,2,3, we require data from phenomena in which the underlying causal relationships are already exquisitely understood or for which carefully designed intervention experiments are available. For symbolic regression^4,5, the data must follow a known, closed-form mathematical expression, for example, a natural law in a controlled experimental environment. For the different types of representation learning^6,7, we may need data for which there are some latent ‘generating factors’ that we can measure directly. Such datasets can be difficult to obtain in practice, and few exist for these tasks. As a result, researchers are often limited to synthetic data produced by computer simulations, which may fall short of answering how well a particular method works in practice.

This is where we believe our work can contribute. We have constructed two physical devices that allow the inexpensive and automated collection of data from two well-understood, real physical systems (Fig. 1). The devices, which we call causal chambers, consist of a light tunnel and a wind tunnel (Fig. 2). They are, in essence, computer-controlled laboratories to manipulate and measure different variables of the physical system they contain.

We believe that the chambers are well suited to substantially improve the validation of methodological advancements across machine learning and statistics, by providing real datasets with a ground truth for fields in which such datasets are otherwise scarce or non-existent. This is accomplished through two key properties of the chambers. First, the underlying physical systems are well understood, in the sense that the relationships between most variables are described by first principles and natural laws involving linear, nonlinear and differential equations (Supplementary Sections III and IV provide a detailed description with carefully designed experiments). This allows us to provide ground truths for various tasks, including a causal model of each chamber. Second, we can manipulate the systems in a controlled and automated way, quickly producing vast amounts of data. Furthermore, the chambers produce data of different modalities, including independent and identically distributed, time-series and image data, allowing us to provide validation tasks for a wide range of methodologies.

To illustrate the practical use of the chambers, we perform case studies in causal discovery, out-of-distribution generalization, change point detection, independent component analysis (ICA) and symbolic regression (see the ‘Case studies’ section and Figs. 5 and 6). Our choice constitutes only an initial selection, and we believe many other possibilities exist.

Our work complements existing datasets from more complex real-world systems for which a ground truth is not or only partially available⁸, as well as efforts to produce synthetic data that mimics such systems^{5,9,10,11,12,13,14}. Although good performance on the chambers is not guaranteed to carry over to more complex systems, we believe that the chambers can serve as a sanity check for foundational assumptions and algorithms that are intended to work in a variety of settings.

A list of all the datasets we currently provide can be found at https://causalchamber.org, together with a description of the experimental procedures used to collect them. To allow other researchers to build their own chambers, blueprints, component lists and source code are available in ref. ¹⁵.

The causal chambers

Each chamber is a machine that contains a simple physical system and allows us to measure and manipulate some of its variables. The chambers contain a variety of sensors, for example, to measure light intensity or barometric pressure. To manipulate the physical system, actuators allow us to control, for example, the brightness of a light source or the speed at which fans turn. Each sensor can also be manipulated by modifying some of its parameters, such as the oversampling rate or reference voltage.

Throughout this paper, we refer to the actuators and sensor parameters as the manipulable variables of the chamber. A programmable onboard computer controls all the sensor parameters and actuators, enabling the chambers to conduct experiments and collect data without human supervision (Fig. 1). As a result, the chambers can quickly produce vast amounts of data, up to millions of observations or tens of thousands of images per day.

In the remainder of this section, we give an overview of each chamber, its physical system and some of the measured variables. Figure 2 provides diagrams of the chambers and their main components, and a detailed description of all the variables can be found in Supplementary Section II.

The wind tunnel

The wind tunnel (Fig. 2a,c) is a chamber with two controllable fans that push air through it and barometers that measure air pressure at different locations. A hatch at the back of the chamber controls an additional opening to the outside. A microphone measures the noise level of the fans, and a speaker allows for an independent effect on its reading.

The tunnel provides data from 32 numerical and categorical variables (Fig. 4a shows some examples), of which 11 are sensor measurements and 21 correspond to actuators and sensor parameters that can be manipulated. For example, we can control the load of the two fans (L_in, L_out) and measure their speed (({tilde{omega }}_{{rm{in}}},{tilde{omega }}_{{rm{out}}})), the current they draw (({tilde{C}}_{{rm{in}}},{tilde{C}}_{{rm{out}}})) and the resulting air pressure inside the chamber (({tilde{P}}_{{rm{dw}}},{tilde{P}}_{{rm{up}}})) or at its intake (({tilde{P}}_{{rm{int}}})). We can manipulate the sensor parameters like the oversampling rate of the barometers (O_dw, O_up, O_int, O_amb) or the resolution of the speed sensors (T_in, T_out), further affecting their measurements. In the circuit that drives the speaker, we can manipulate the potentiometers (A₁, A₂) that control the amplification, monitoring the resulting signal at different points of the circuit (({tilde{S}}_{1},{tilde{S}}_{2})) and through the microphone output ((tilde{M})).

The light tunnel

The light tunnel (Fig. 2b,d) is a chamber with a controllable light source at one end and two linear polarizers mounted on rotating frames. The relative angle between the polarizers dictates how much light passes through them (Fig. 4c,e) and sensors measure the light intensity before, between and after the polarizers. A camera on the side opposite the light source allows taking images from inside the tunnel.

The tunnel provides image data (Fig. 4e) and 41 numerical and categorical variables (Fig. 4b–d), out of which 32 can be manipulated. For example, we can control the intensity of the light source at three different wavelengths (R, G, B) and measure the drawn electric current ((tilde{C})). Using motors, we can rotate the polarizer frames to the desired angles (θ₁, θ₂) and measure the effect on light intensity at different wavelengths (({tilde{I}}_{1},{tilde{I}}_{2},{tilde{I}}_{3},{tilde{V}}_{1},{tilde{V}}_{2},{tilde{V}}_{3})). We can manipulate sensor parameters like the exposure time of the camera (T_Im) or the photodiode used by the light sensors (({D}_{1}^{I},{D}_{2}^{I},{D}_{3}^{I})), further affecting the readings of these sensors.

A testbed for algorithms

The chambers are designed to provide a testbed for a variety of algorithms from AI, machine learning and statistics. To set up validation tasks, we rely on two key properties of the chambers: that the encapsulated physical system is well understood and that we can manipulate it. For example, by manipulating actuators, we can evaluate a learned causal model in its prediction of interventional distributions. By contrast, when the relationships between actuators and sensors are well described by a natural law, we can set up a symbolic regression task in which we try to recover it from data. These are some examples of the tasks we set up in the ‘Case studies’ section, but many other possibilities exist.

In Fig. 3, we provide a graphical representation of the physical system in each chamber under different configurations, in the form of a directed graph relating its variables. In their most basic form, the chambers operate in the standard configuration, where the values of all the actuators and sensor parameters are explicitly given by the user in the experiment protocol (Fig. 1). The light tunnel operates without the camera to allow for the fastest measurement rate.

**Fig. 3: Representation of the known effects for different chamber configurations.**

**Fig. 4: Examples of data produced by the chambers.**

For additional flexibility in setting up the validation tasks, the chambers can also operate in extended configurations. For example, these can include additional variables, such as those from the camera in the light tunnel (Fig. 3b) or additional sensors included in the future. Furthermore, the extended configurations also allow us to assign the value of actuators and sensor parameters as a function of other variables in the system, such as sensor measurements. The assignment is done automatically by the computer onboard the chamber and allows us to introduce additional complexity into the system. For example, the pressure control configuration of the wind tunnel (Fig. 3d) implements a control mechanism that continuously updates the fan power (L_in, L_out) to keep the chamber pressure (({tilde{P}}_{{rm{dw}}})) constant. The assignment functions can be any stochastic or deterministic function that can be expressed in the Turing-complete language that controls the chamber computer. This allows us to modify the causal structure underlying the chambers, by introducing additional effects of varying strength between variables. Although this yields a vast space of possible configurations, for the moment, we only provide datasets from the four configurations shown in Fig. 3.

In Supplementary Section III, we provide a detailed description of all the effects (that is, edges) in Fig. 3, based on the background knowledge and carefully designed experiments (Supplementary Figs. 2–9). Furthermore, Supplementary Section IV evaluates mechanistic models that describe some of the effects, ranging from simple natural laws to more complex models involving the technical specifications of the actual components. For the more complex processes in the chambers, such as image capture in the light tunnel or the effects on wind tunnel pressure, we provide approximate models with increasing degrees of fidelity. In Fig. 6c, we compare the output of some of these models to measurements gathered from the chambers.

Causal ground truth

For readers with a background in causal inference, the graphs in Fig. 3 may be reminiscent of causal graphical models^2,3,16. In Supplementary Section V, we formalize a causal interpretation of the graphs and validate them with additional randomized experiments. In short, an edge X→Y signifies that an intervention on X will change the distribution of the subsequent measurements of Y. This interpretation allows us to treat the graphs shown in Fig. 3 as causal ground truths for a variety of causal inference tasks.

Under our interpretation, the absence of an edge between two variables does not preclude the existence of a causal effect between them. As with most real systems, effects between observed variables may exist beyond what we know or can validate through the procedures described in this paper, due to a lack of statistical power. Furthermore, there are confounding effects in which unmeasured variables simultaneously affect some of the variables in the chambers. For example, variations in the atmospheric pressure outside the chambers simultaneously affect all the barometric measurements. Supplementary Section V provides more details.

Case studies

We now show, through practical examples, how the chambers can be used to validate algorithms from a variety of fields. As a starting point, first, we provide a collection of datasets and set up tasks from a selection of research areas. Our choice is by no means exhaustive, and these case studies are intended as illustrations rather than comprehensive benchmarks. We describe each field and the corresponding tasks below and evaluate the performance of different algorithms, showing the results in Figs. 5 and 6.

**Fig. 5: Validating algorithms using the chambers (part 1).**

**Fig. 6: Validating algorithms using the chambers (part 2).**

For each case study, we provide a detailed description of the experimental procedure in the Methods, together with a well-documented code to reproduce the experiments in the paper repository at ref. ¹⁷. See the ‘Data availability statement’ for details of how to access the datasets.

Causal discovery

By offering a causal ground truth and the ability to carry out interventions, the chambers provide an opportunity to validate causal discovery algorithms^{3,18,19,20,21}, which aim to recover the cause-and-effect relationships from the data. The chambers provide data suited to validate a wide range of approaches, including those that rely on independent and identically distributed or time-series data²² with and without instantaneous or lagged causal effects, and causal structures with and without cycles^23,24. We consider the task of recovering the complete causal graph describing the effects in the system^1,25,26, and evaluate algorithms that take different types of data as the input: greedy equivalence search (GES)²⁷ for purely observational data, unknown-target interventional greedy sparsest permutation (UT-IGSP)²⁸ for interventional data with unknown targets and Peter-Clark momentary conditional independence (PCMCI+)²⁹ for time-series data. This constitutes an example selection of methods that is not exhaustive. Performance is measured by the recovery of the ground-truth graph (see the ‘Causal ground truth’ section). The results are shown in Fig. 5a. In line with their underlying assumptions, both GES and UT-IGSP recover the strong, linear effects from the light-source setting (R, G, B) to the light-sensor readings and drawn current. However, both methods struggle with the nonlinear effects of the polarizer angles (θ₁, θ₂) and the weak effects of the additional light-emitting diodes (LEDs; L₁₁,…, L₃₂), which are apparent only in the cases when the light-source brightness is low or the polarizers are crossed (Supplementary Fig. 10). For the time-series data from the wind tunnel (task a3), PCMCI+ displays a low recall and performs similar to random guessing, despite the data matching the settings it is intended for.

Out-of-distribution generalization

By manipulating the chamber actuators and sensor parameters, we can induce distribution shifts in a controlled manner. This enables us not only to test the performance of the prediction and inference algorithms on datasets with a distribution that differs from the training distribution but also to investigate under which assumptions on the shifts such methods perform well^30,31,32. As an illustration, we set up three simple tasks with different data modalities (Fig. 5b). The first consists of predicting the light-intensity reading ({tilde{I}}_{1}) from the other numeric variables of the light tunnel. We fit a simple linear regression with an increasing number of predictors and evaluate its predictive performance on data arising from interventions on the light-source intensity (R, G, B), sensor parameters (({T}_{2}^{I},{T}_{1}^{V},{T}_{2}^{V},{T}_{3}^{V})) and polarizer alignment (θ₁). For the second task, we predict the colour setting (R, G, B) of the light source from the images captured by the camera. We use a small convolutional neural network³³, which we evaluate on shifts induced by changing the distribution of colours, polarizer angles and camera parameters. The goal of the last task is to predict the hatch position H from the pressure curve (({tilde{P}}_{{rm{dw}}})) that results from applying a short impulse to the load L_in of the intake fan. We fit a simple feed-forward neural network and validate its performance on curves collected under different loads of the exhaust fan L_out, different barometer precision (O_dw) and from a barometer in a different position (({tilde{P}}_{{rm{up}}})). As expected, the performance of the methods degrades under distribution shifts. Even minute changes to the distribution of inputs, for example, due to an increase in the barometer oversampling rate (O_dw←8; Extended Data Fig. 1), can make the multilayer perceptron in task b3 fail. Interestingly, the notion of causal invariance³⁴ predicts the drop in performance of some models. For example, the mean absolute error (MAE) incurred by predicting the training-set mean (that is, the empty model) remains constant across environments, except in those where the causal parents of the response (Fig. 3a) receive an intervention (that is, R, G, B in tasks b1 and b2). In task b1, the model that includes only causal parents (({tilde{I}}_{1} approx R,G,B)) is the most stable across all the environments, whereas models that include additional (non-causal) variables achieve a better MAE in the training distribution but perform worse in environments in which these variables directly or indirectly receive an intervention.

Change point detection

Change point detection aims to identify abrupt changes or transitions in time-series data or its underlying data-generating process³⁵. By manipulating actuators and sensor parameters, we can induce changes in the measurements of the affected sensors, providing real datasets with a known ground truth in terms of change points. To validate offline change point detection algorithms^35,36, we generate time-series data with smooth and abrupt changes of increasing difficulty. We evaluate the non-parametric change point detection algorithm changeforest³⁷, and the results are shown in Fig. 5c. As expected, the method correctly recovers all the change points in the deterministic time-series data of the actuator input L_in. For the affected sensors, the method successfully detects abrupt changes in the signal or its regime, but fails to detect more subtle changes, such as those with only a slight effect on the variance (for example,({tilde{C}}_{{rm{in}}}) or (tilde{M}) in Fig. 5c).

ICA

ICA is a family of techniques that treat data as a mixture of latent components and aim to discover a demixing transformation that can accurately recover them^38,39. The linear variants of ICA^40,41 are well established, and recent developments in nonlinear ICA have cast it as a framework that holds potential for effectively tackling the challenge of disentanglement in complex data^6,38. We propose tasks that consist of recovering (up to indeterminacies such as scaling) the values of independently set actuators from the measurements of the sensors they affect. As a starting point, we set up three tasks (Fig. 6a): recovering the light-source setting (R, G, B) from the light-intensity measurements (({tilde{I}}_{1},{tilde{I}}_{2},{tilde{I}}_{3}), ({tilde{V}}_{1},{tilde{V}}_{2},{tilde{V}}_{3})), recovering the fan loads (L_in, L_out) and hatch position (H) from the barometric readings (({tilde{P}}_{{rm{dw}}},{tilde{P}}_{{rm{up}}},{tilde{P}}_{{rm{amb}}},{tilde{P}}_{{rm{int}}})) and recovering the configuration of the light source and polarizers (R, G, B, θ₁, θ₂) from the image data of the light tunnel. The tasks display increasing difficulty in terms of the complexity and dimensionality of the mixing transformation. As the first baseline, we apply FastICA⁴¹, which assumes a linear mixing function. Indeed, the method succeeds in estimating the actuator inputs for task d1 (Fig. 6), where the mixing function is approximately linear (Supplementary Section III.2). For the second task (d2), in which the effect of the actuators on the sensors is nonlinear (Supplementary Section IV.1.2), the method produces a distorted estimate of the actuators. For the third task (d3), in which the mixing function is both nonlinear and high dimensional, the method produces estimates in seemingly little agreement with the ground-truth signals.

Symbolic regression

Symbolic regression^12,42 aims to discover mathematical equations or expressions that best describe the underlying relationships in data, enabling interpretable and compact model representations. A common motivation is the automatic discovery of natural laws from the data⁴. Because simple natural laws effectively describe some of the relationships in the chambers, it is possible to provide symbolic regression tasks from real data, and evaluate the performance of such algorithms. As an example, we set up two tasks: recovering Bernoulli’s principle, which relates the barometric measurements of the upwind and downwind barometers (({tilde{P}}_{{rm{up}}},{tilde{P}}_{{rm{dw}}})); and Malus’s law, which describes the effect of linear polarizers (θ₁, θ₂) on the light-intensity readings of the third sensor (({tilde{I}}_{3},{tilde{V}}_{3})); more details can be found in Supplementary Sections IV.1.3 and IV.2.1, respectively. Bernoulli’s principle provides a task with a simple ground-truth function but a low signal-to-noise ratio, whereas Malus’s law provides a more complex function with weaker noise, representing two common challenges for symbolic regression algorithms. We apply the method described in ref. ⁴³ and show the results of five runs in Fig. 6b. The estimated expressions depend strongly on the random initialization of the method, although all of them attain a similar R² score on the data (Extended Data Fig. 2). When we apply the method to synthetic data following Malus’s law with added Gaussian noise, the dependence on random initialization disappears and the method returns the correct ground-truth expression in every run (Extended Data Fig. 2). This highlights a scenario in which synthetic benchmarks may be unreliable for estimating a method’s performance in the real world.

Physics-informed machine learning

Physics-informed machine learning integrates physical laws or domain-specific knowledge into machine learning models to enhance their accuracy and generalizability⁴⁴. To validate such approaches, in Supplementary Section IV, we provide mechanistic models of several processes in the chambers, derived from first principles. For each process, we consider models of increasing complexity, allowing us to simulate sensor measurements with varying degrees of fidelity. This provides a testbed for simulation-based inference⁴⁵ and approaches that exploit potentially mis-specified models for inference or generation^46,47,48. As an illustration, in Fig. 6c, we compare measurements gathered from the chambers with the output of some of these models. In particular, we show the models describing the image capture process of the light tunnel, and the effects of fan loads (L_in, L_out) and hatch position (H) on other wind tunnel variables (({tilde{P}}_{{rm{dw}}},{tilde{omega }}_{{rm{in}}},{tilde{omega }}_{{rm{out}}})). Their description, together with additional models and their outputs, are available in Supplementary Section IV. To facilitate building additional models and simulators, we provide the datasheets for every chamber component in Supplementary Section VI, detailing its technical specifications and physical properties.

Discussion

We have constructed two devices to collect real-world datasets from well-understood but non-trivial physical systems. The devices provide a testbed beyond the simulated data for a variety of empirical inference algorithms in the broad field of AI. To illustrate their use, we have gathered an initial collection of datasets and used them to perform small case studies in different fields.

The case studies are intended to showcase the flexibility of the chambers in setting up validation tasks; providing exhaustive benchmarks is beyond the scope of this work. However, the mixed performance of algorithms in the case studies suggests that, although limited, they can already serve as useful benchmarks for these fields. In some cases, the shortcomings of the methods can be attributed to their underlying assumptions (for example, tasks a1, a2, d2 and d3). In others, such as tasks e and a3, a mismatch is highlighted between performance on synthetic and real data, which can lead to an overconfident assessment of a method’s capabilities. Task b shows that the chambers provide a principled environment to study phenomena such as causal invariance or the sensitivity of neural networks to small shifts in the distribution of their inputs.

We believe the presented chambers can be used for applications that go beyond the ones we have considered. In particular, the digital control of the chambers makes it possible to validate a variety of active learning, reinforcement learning and control algorithms.

The chambers are complementary to well-motivated, complex simulators of real phenomena. On one hand, such simulators allow us to approximate complex systems that are the intended application targets, such as mechanisms of the global climate or gene regulatory networks with hundreds of variables and interactions. On the other hand, it can be difficult (or impossible) to judge if the assumptions used to build these simulators will hold in the real world, and—more importantly—how their violation will affect an algorithm when we use it on real rather than simulated data. Well-understood systems like the chambers provide real-world data without relying on computer simulations and their models, as well as simplifying assumptions. However, the requirement of providing a reliable ground truth necessarily limits the chambers’ complexity and size. Therefore, the success of an algorithm on the chambers may not necessarily transfer to larger and more complex systems.

Our aim is that the chambers become a sanity check for algorithms designed to work in a variety of situations. Failures in these testbeds can indicate potential shortcomings in applications to more complex systems. This will allow researchers to test and refine algorithms and methods, and consider fundamental assumptions.

We make all the datasets collected from the chambers publicly available, including those used in the ‘Case studies’ section. Researchers can access them at https://causalchamber.org and through the Python causalchamber package available at https://pypi.org/project/causalchamber/ (see the ‘Data availability statement’ for more details). We will continue to expand this dataset repository, and we are open to suggestions of additional experiments that may prove interesting—please reach out to the corresponding author.

In ref. ¹⁵, we provide the blueprints and code to allow other researchers to build their own chambers. We hope these resources can be used as a starting point to build chambers around other well-understood systems that prove valuable for the validation of AI methodology.

Methods

Here we provide a brief description of the experimental setup for each case study described in the ‘Case studies’ section, together with a link to the corresponding datasets at https://causalchamber.org/, and to the corresponding code in the paper repository in ref. ¹⁷.

Case study: causal discovery

All the methods we evaluate in this case study return a directed acyclic graph (or a set of them) as an estimate. Given a single directed acyclic graph estimate (hat{G}:= (V,hat{E})) and a ground-truth graph G^⋆ ≔ (V, E^⋆), we compute the precision P and recall R in terms of directed edge recovery as

$$P:= frac{hat{E}cap {E}^{star }}{| hat{E}| },text{and},R:= frac{hat{E}cap {E}^{star }}{| {E}^{star }| },$$

(1)

where (hat{E}) and E^⋆ are the sets of directed edges in (hat{G}) and G^⋆, respectively. If a method outputs several directed acyclic graphs, we compute P and R for each element in this set.

Task a1: observational data

As input for GES, we take 10,000 observations from a subset of the variables (Fig. 5) in the uniform_reference experiment of the lt_interventions_standard_v1 dataset (https://github.com/juangamella/causal-chamber/tree/master/datasets/lt_interventions_standard_v1) and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/causal_discovery_iid.ipynb). As score for the algorithm, we use the bayesian information criterion (BIC) score with a Gaussian likelihood. GES returns the Markov equivalence class of the estimated data-generating graph, and for each graph, we compute the corresponding precision and recall in the recovery of the edges in the ground-truth graph.

Task a2: interventional data

We consider the same subset of variables as for task a1, taking data from several experiments in the lt_interventions_standard_v1 dataset (https://github.com/juangamella/causal-chamber/tree/master/datasets/lt_interventions_standard_v1) as input for UT-IGSP²⁸. As ‘observational data’, we take the 10,000 observations from the uniform_reference experiment. As ‘interventional data’, we take 1,000 observations from each experiment in which the considered variables receive an intervention; see the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/causal_discovery_iid.ipynb) for the experiment names. For the conditional independence and invariance tests, we use the default Gaussian tests implemented in the Python package of UT-IGSP, and run the algorithm at different significance levels (α, β) ∈ [10^–4, 10^–2]². We show the result for α = 0.008 and β = 0.009, which performs best in terms of both precision and recall (equation (1)).

Task a3: time-series data

As input to PCMCI+²⁹, we take 10,000 observations from a subset of the variables in the actuators_random_walk_1 experiment of the wt_walks_v1 dataset (https://github.com/juangamella/causal-chamber/tree/master/datasets/wt_walks_v1) and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/causal_discovery_time.ipynb). We run the method with partial correlation tests at significance level α = 1 × 10^–2 and a maximum of ten lags. From the resulting estimate, we drop edges from a variable to itself and edges for which orientation conflicts arise. We compute the precision and recall (equation (1)) for each of the two graphs in the resulting equivalence class.

Case study: out-of-distribution generalization

Task b1: regression from sensor measurements

We use the data from several experiments in the lt_interventions_standard_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/lt_interventions_standard_v1) and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/ood_sensors.ipynb). We begin by splitting the observations from the uniform_reference experiment into a training set (100 observations) and a validation set (1,000 observations, shown with Ø in the spider plot for task b1 (Fig. 5b)). As additional validation sets (1,000 observations each), we select experiments in which the variables (R,G,B,{T}_{1}^{V},{T}_{2}^{V},{T}_{3}^{V},{T}_{2}^{I}), θ₁ receive an intervention; the accompanying code provides the experiment names. These validation sets correspond to the additional axes in the spider plot of task b1 in Fig. 5b. On the training set, we fit linear models with the intercept using ordinary least squares, with response ({tilde{I}}_{1}) and different sets of predictors: {R}, {R, G}, {R, G, B}, (R,G,B,{tilde{V}}_{1}) and ({{tilde{I}}_{2},{tilde{I}}_{3},{tilde{V}}_{1},{tilde{V}}_{2},{tilde{V}}_{3}}). As a baseline, we consider the model that predicts the average of ({tilde{I}}_{1}) in the training set. For each resulting model, we compute the MAE on each of the validation sets. The additional scatter plot for task b1 in Fig. 5b corresponds to the pooled data across all the validation sets.

Task b2: regression from images

We use the images from the lt_color_regression_v1 datasets (https://github.com/juangamella/causal-chamber/tree/main/datasets/lt_color_regression_v1) and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/ood_images.ipynb), at a size of 100 × 100 pixels. We split the data from the reference experiment into a training and validation set (9,000 and 500 observations, respectively). As additional validation sets, we take those arising from shifts in the distribution of the response R, G, B (bright_colors experiment) and from interventions on the parameters of the camera; the accompanying code provides the experiment names. We subsample each of the additional validation sets to a size of 500 observations. As a regression model, we use a small LeNet-like convolutional neural network⁴⁹; the code provides more details. As a loss function, we use the mean-squared error in predicting the light-source settings R, G, B, which we minimize using the stochastic gradient descent. We fit the model a total of 16 times, each with a different random initialization of the network weights. For each resulting model, we compute the MAE on each validation set, and plot the results in task b2 in Fig. 5b. As a baseline, we consider the model that predicts the average of R, G, B in the training set.

Task b3: regression from impulse–response curves

We use the data from several experiments in the wt_intake_impulse_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/wt_intake_impulse_v1), corresponding to different settings of the exhaust load L_ou_t and oversampling rates O_dw of the downwind barometer; the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/ood_impulses.ipynb) provides the experiment names. We split the data from the load_out_0.5_osr_downwind_4 experiment into a training and validation set (4,000 and 900 observations, respectively). As a regression model, we use a multilayer perceptron with an input layer of size 50 (the impulse length), an output layer of size 1 and two additional hidden layers with 200 neurons and rectified linear unit activations. As a loss function, we use the mean-squared error in predicting the hatch position H, and train the model using stochastic gradient descent. We fit the model a total of 16 times, each with a different random initialization of the network weights. For each resulting model, we compute the MAE on validation sets from the training distribution and additional experiments. Each corresponds to the different axes in the spider plot for task b3 in Fig. 5b. As a baseline, we consider the model that predicts the average of H in the training set.

Case study: change point detection

We take the data from the load_in_seed_9 experiment in the wt_changepoints_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/wt_changepoints_v1), and apply the changeforest algorithm³⁷ to each of the time-series data, namely, L_in, ({tilde{omega }}_{{rm{in}}}), ({tilde{P}}_{{rm{dw}}}-{tilde{P}}_{{rm{amb}}}), ({tilde{C}}_{{rm{in}}}), (tilde{M}). For the algorithm, we use the ‘random_forest’ method and default hyperparameters; the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/changepoints.ipynb) provides the details. As the ground truth for the change points (Fig. 5c, vertical grey lines), we take the time points in which L_in is set to a new level. In all the datasets collected from the chambers, the column intervention takes a value of 1 for the first measurement after an intervention on any of the chamber variables.

Case study: ICA

Task d1: recovering light-source colour

We use the color_mix experiment from the lt_walks_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/lt_walks_v1) and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/ica.ipynb). As an input to the FastICA algorithm⁴¹, we take the light-intensity measurements ({tilde{I}}_{1},{tilde{I}}_{2},{tilde{I}}_{3}), ({tilde{V}}_{1},{tilde{V}}_{2}), ({tilde{V}}_{3}), to which we first apply a whitening transformation. We run the algorithm with six components (sources). For each ground-truth source (R, G, B), we show the recovered signal with the highest Pearson correlation coefficient (in absolute value).

Task d2: recovering fan loads and hatch position

We use the loads_hatch_mix_slow experiment from the wt_walks_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/wt_walks_v1) and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/ica.ipynb). As an input to the FastICA algorithm⁴¹, we take the barometric pressure measurements ({tilde{P}}_{{rm{dw}}},{tilde{P}}_{{rm{up}}},{tilde{P}}_{{rm{amb}}}), ({tilde{P}}_{{rm{int}}}), to which we first apply a whitening transformation. We run the algorithm with four components (sources). For each ground-truth source (L_in, L_out, H), we show the recovered signal with the highest Pearson correlation coefficient (in absolute value).

Task d3: recovering actuators from images

As an input to the FastICA algorithm⁴¹, we use the images from the actuator_mix experiment in the lt_camera_walks_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/lt_camera_walks_v1), at a size of 50 × 50 pixels, and used the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/ica.ipynb). We first flatten the images, such that each pixel becomes an input variable, applying a whitening transformation as an additional preprocessing step. We run the algorithm with five components (sources). For each ground-truth source (R, G, B, θ₁, θ₂), we show the recovered signal with the highest Pearson correlation coefficient (in absolute value).

Case study: symbolic regression

We use a pretrained version of the model described in ref. ⁴³; the code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/symbolic_regression.ipynb) provides more details. For both tasks, we use the same hyperparameters as that in the demonstration provided at https://github.com/facebookresearch/symbolicregression/blob/main/Example.ipynb and run the algorithm with five different random initializations. As input for the first task, we randomly sample 1,000 observations from the random_loads_intake experiment in the wt_bernoulli_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/wt_bernoulli_v1); for the second task, we use 1,000 observations from the white_255 experiment in the lt_malus_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/lt_malus_v1). We show the estimated expressions in Fig. 6b, rounding the constants to one decimal place.

Case study: mechanistic models

We compare the output of some of the models defined in Supplementary Section IV with actual measurements collected from the chamber. For the wind tunnel models, we use the steps experiment from the wt_test_v1 (https://github.com/juangamella/causal-chamber/tree/main/datasets/wt_test_v1), setting their parameters to the values suggested in Supplementary Section IV.1. For the models of the image capture process of the light tunnel, we take the images from the palette experiment in the lt_camera_test_v1 dataset (https://github.com/juangamella/causal-chamber/tree/main/datasets/lt_camera_test_v1) and use the parameters suggested in Supplementary Section IV.2.2. All the models are implemented in the Python package available at https://github.com/juangamella/causal-chamber#mechanistic-models; the accompanying code (https://github.com/juangamella/causal-chamber-paper/blob/main/case_studies/mechanistic_models.ipynb) provides examples.