A scalable synergy-first backbone decomposition of higher-order structures in complex systems

Introduction

To what degree can a “whole” complex system be said to be “greater than the sum of its parts?” This is one of the core questions in modern complexity theory, as the emergence of higher-order phenomena from the interactions between lower-order elements is a defining feature of complex systems throughout the natural and artificial worlds¹. Some of the most promising tools for interrogating these emergent phenomena come from information theory, which provides a mathematically rigorous formal language with which to explore part-whole relationships in multivariate systems². In the context of information theory, information that is present in the joint state of the whole, but none of the parts is known as “synergy”, and has been proposed as a formal statistic of emergent behaviours in complex systems^3,4,5. Synergy has also been found to be ubiquitous in natural and artificial systems, having been observed in networks of cortical neurons^6,7, whole-brain fMRI dynamics^8,9,10,11, climatological systems¹², interactions between social identities and life-outcomes (often called “intersectionality”)¹³, and heart-rate dynamics¹⁴. Furthermore, synergy has been proposed to inform about clinically-meaningful processes such as ageing¹⁵, traumatic brain injury^16,17, and the actions of anaesthetic¹⁷ and psychedelic drugs¹⁸. This list is non-exhaustive.

Despite the clear significance of synergistic information in a variety of fields and contexts, the mathematical tools for exploring synergistic dependencies in empirical data remain limited. There are currently two broad families of methods: those based on the partial information decomposition (PID)^19,20, and those based on the O-information²¹ and other multivariate extensions of mutual information²². While both families of methods are powerful, and have proven to be useful in many contexts, they suffer from limitations that restrict their applicability in many cases. The PID-like approaches (such as the partial entropy decomposition^23,24, the integrated information decomposition²⁵, and the generalized information decomposition²⁶) provide a “complete” picture of the entire structure of multivariate information, in the form of a partially ordered lattice of “atomic” interactions between elements. Unfortunately, the size of the lattice (and consequently, the number of operations required to compute the synergy) balloons super-exponentially: the number of information atoms in a system with k elements is the k^th Dedekind number minus two. It is simply not feasible to apply any PID-like analysis to a system with more than five elements. As the system grows, the atoms themselves also become largely uninterpretable: it’s hard to imagine a study where we care deeply about that information that could be learned by observing (X₁ and X₂) or (X₁ and X₃ and X₇) or X₄ or (X₅ and X₆ and X₇), let alone how it differs from the information that could be learned by observing (X₃ and X₂) or (X₁ and X₃ and X₇) or X₄ or (X₁ and X₆ and X₇). Finally, and perhaps most problematically for the study of synergy, almost all PIDs take a redundancy-first approach, and synergy is only implicitly defined as “that information left over when all the redundancies have been partialed out” (although a notable exception is the synergistic disclosure framework, discussed below²⁷).

The O-information family of measures are more heuristic. Rather than provide a complete picture of the structure of multivariate information, they provide insight into whether a system is dominated by redundancy or synergy. The O-information and related measures scale gracefully with system size, but do not provide much of a map of how synergies of different orders are present in the system. Furthermore, the O-information generally takes a very conservative definition of synergy and a very liberal definition of redundancy: for a k-element system, the synergy is only that information that is in all k elements, and any information in any set of elements of size less than k gets classified as “redundancy”, even if it is information only accessible in the joint state of many elements²⁶. A more liberal definition of synergy that allows researchers to consider different orders of synergy (beyond just the topmost) would be useful in many contexts.

In this work, we introduce a heuristic approach to analysing synergistic information in complex systems. This approach is synergy-first (beginning with an intuitive definition of what it means for information to be synergistic), scales much more manageably than the PID, but also provides insights into the presence (or absence) of synergy at every scale. This approach is localizable²⁸, and can be applied to a variety of information-theoretic quantities, including (but not limited to) the entropy (in which case the decomposition is strictly non-negative), the single-target mutual information, the Kullback-Leibler divergance, and the total correlation. Finally, we show that the logic of this decomposition can be applied outside the specific world of information theory to interrogate the structure of complex systems using other methods entirely by decomposing the higher-order contribution of edges to the communicability of a graph.

What is synergistic information?

Above, we defined “synergy” intuitively as “that information that is in the whole but none of the parts.” How can we make this definition formal? Consider an information source X comprised of k channels (for conceptual simplicity, we will assume everything is discrete): X = {X₁, …, X_k}. At every moment, we observe some specific realization of X = x, which is drawn from a joint probability distribution ({mathbb{P}}({{{bf{X}}}})) on the support set ({{{boldsymbol{{{{mathcal{X}}}}}}}}). The total Shannon information content (or local entropy) of x is given by the local entropy: (h({{{bf{x}}}})=-log {mathbb{P}}({{{bf{x}}}})). While there has been considerable prior work on redundancy-based approaches for decomposing h(x) (for three options see refs. ^23,24,10), here we will take a synergy-first approach and say:

The synergistic information in x = {x₁, …, x_k} is that information which is only disclosed if X₁ = x₁ ∧ … ∧ X_k = x_k are observed simultaneously, and cannot be disclosed by any subset of x of size k − 1 or less.

What are the implications of this definition? Suppose that our goal was to infer the state of X. If all the channels were functioning correctly, this would be trivial: we would just have to look and see that X = x. However, if one of the channels were to fail, the problem of inferring x becomes necessarily harder. Assuming we have access to the joint distribution ({mathbb{P}}({{{bf{X}}}})), it may be possible to use correlations between the various X_i (or higher-order correlations between various subsets of X) to recover x, fully or partially. In the context of the PID, this might be called the “redundancy”. We are interested in assessing the synergy directly though, and in our context the synergy is that information about x that cannot be recovered without knowing the state of every channel simultaneously, and which is destroyed when access to any one channel is lost.

Consider, if the channel between the source and the observer were to break and stop returning a value for X₁, then presumably the synergistic information that could have been learned from {x₁, …, x_k} will be destroyed, since only information that could be learned from {x₂, …, x_k} is still accessible. In fact, this synergistic information would be destroyed by any single failure of anyX_i, as that would be enough to disrupt the requirement that X₁ = x₁ ∧ … ∧ X_k = x_k.

Inspired by Makkeh et al.,’s work linking redundant and synergistic information to logical connectives²⁹, we could rephrase the definition of synergy, then as:

The synergistic information in x = {x₁, …, x_k} is that information that would be destroyed if X₁ alone failed or X₂ alone failed … or X_k alone failed.

It doesn’t matter which X_i fails: the failure of any one alone is enough to destroy the synergy. This leads us to the straightforward formal definition:

$${h}^{syn}({{{bf{x}}}})=min_{i}left[h({{{bf{x}}}})-h({{{{bf{x}}}}}^{-i})right]$$

(1)

$$=min_{i}left[h({x}_{i}| {{{{bf{x}}}}}^{-i})right]$$

(2)

Where x⁻ⁱ refers to the joint state of every x_j in x excluding x_i. This is the amount of information that you would lose if any element failed, regardless of which one it is. The value of h^syn(x) is that portion of the information in x that depends on the joint state of all elements and is fragile to the failure of any one. It’s implicitly defined as a local value, but the expected synergy can be computed in the usual way:

$${H}^{syn}({{{bf{X}}}})={{mathbb{E}}}_{{mathbb{P}}({{{bf{X}}}})}left[{h}^{syn}({{{bf{x}}}})right]$$

(3)

The α-synergy backbone

The information that would be destroyed by any single failure is clearly the most fragile information. All the remaining information must be robust to at least one failure. We can define the dual measure of robustness as:

$${h}^{rbst}({{{bf{x}}}})=h({{{bf{x}}}})-{h}^{syn}({{{bf{x}}}})$$

(4)

This notion of robustness has some parallels to the notion of redundancy in the partial information decomposition, although we will not dwell on it here. One might naturally ask: what if more than one element fails? What information is robust to the failure of any two elements? What information is synergistic, not in the joint-state of all k elements, but any subset of k − 1 elements?

This is the basic logic of the synergy “backbone” decomposition. The first thing we will do is re-name the measure h^syn(x) to ({h}_{1}^{syn}({{{bf{x}}}})) to indicate that it is the information that is destroyed by a single failure. One could easily extend this logic to the 2-synergy:

$${h}_{2}^{syn}({{{bf{x}}}})=min_{ij}left[h({{{bf{x}}}})-h({{{{bf{x}}}}}^{-{ij}})right]$$

(5)

$$=min_{ij}left[h({{{{bf{x}}}}}^{ij}| {{{{bf{x}}}}}^{-{ij}})right]$$

(6)

This is the information in x that would be lost if two channels, X_i and X_j, failed, regardless of which two failed specifically. Returning to the link with logical connectives, it is the information that would be lost if X₁ and X₂ failed or X₁ and X₃ failed or … and so on. In general, we can define the α-synergy as the information destroyed when any set of α elements fails:

$${h}_{alpha }^{syn}({{{bf{x}}}})=mathop{min}limits_{{bf{a}}subset{k}atop| {{{bf{a}}}}| =alpha},left[h({{{bf{x}}}})-h({{{{bf{x}}}}}^{-{{{bf{a}}}}})right]$$

(7)

$$=mathop{min}limits_{{bf{a}}subset{k}atop| {{{bf{a}}}}| =alpha},left[h({{{{bf{x}}}}}^{{{{bf{a}}}}}| {{{{bf{x}}}}}^{-{{{bf{a}}}}})right]$$

(8)

Where {k} is shorthand for the set {0, …, k}. For a visualization of the logic of the decomposition, see Fig. 1, which details the serial bipartitioning, as well as visualizing the α-synergy and ∂α-synergy backbones for a randomly generated, seven-node boolean network.

A scalable synergy-first backbone decomposition of higher-order structures in complex systems — **Fig. 1: The bipartition approach to local α-synergistic entropy. Top row.**

It is clear that the α-synergistic entropy function is non-negative (due to the non-negativity of local conditional entropy), and it is monotonically increasing on the interval {0, …, k} (for proof, see Supplementary Note 1, Proof 1). The combination of non-negativity and monotonicity imply that the α-synergy function can decompose h(x) into a set of non-negative partial synergies that account for the information intrinsic to each scale. Recall that the 1-synergy is the information in x that would be lost if any one element failed, regardless of which specific element it actually is. Similarly, the 2-synergy is the information in x that would be lost if any two elements failed. However, the information in the 2-synergy must, definitionally, contain the 1-synergy, since the two elements failing simultaneously must include at least as much fragile information as the 1-synergy. In the same vein as the classic partial information decomposition, we can propose a bootstrapping approach to partial out the synergistic information intrinsic to each scale by defining the α-partial synergy function:

$$partial {h}_{alpha }({{{bf{x}}}})={h}_{alpha }^{syn}({{{bf{x}}}})-sumlimits_{beta < alpha }partial {h}_{beta }({{{bf{x}}}})$$

(9)

We relax the use of the syn superscript for the partial entropy function, as it is assumed that any partial term is being computed with the α-synergy function. This completes the decomposition of the local entropy:

$$h({{{bf{x}}}})=sumlimits_{i=1}^{k}partial {h}_{i}^{syn}({{{bf{x}}}})$$

(10)

As with the α-synergy function, an expected value over all realizations can be computed in the usual way:

$$partial {H}_{alpha }({{{bf{X}}}})={{mathbb{E}}}_{{{{bf{X}}}}}left[partial {h}_{alpha }({{{bf{x}}}})right]$$

(11)

This spectrum of α-partial synergies forms the “backbone” decomposition of the entropy of X. In contrast to the redundancy-based partial entropy decomposition, which will produce a large lattice of different “atomic” entropies, this backbone decomposition only produces k values, arranged in order of increasingly robust synergies from 1 to k.

Examples

Here we will explore the α-synergistic entropy decomposition for four different example systems, and discuss how the decomposition represents the structure of the systems. The four distributions are:

Logical XOR

The logical XOR system (Table 1) is generally considered to be the ur-example of statistical synergy²¹, as all the pairwise mutual informations I(X_i; Y) = 0 bit, while the joint mutual information I(X₁, X₂; Y) = 1 bit. When we apply the α-synergy decomposition to the joint entropy H(X₁, X₂, Y), we get the backbone seen in Table 2.

Table 1 The truth table for the logical XOR system

Full size table

Table 2 The α-synergistic entropy decomposition for the logical XOR system

Full size table

Initially, this may appear “wrong”: the XOR system is “pure synergy”, but the value of the ∂H₁ atom (the most synergistic part of the entropy) is 0 bit and the rest of the entropy is in the “robust” atoms. Why? The answer is that, the XOR gate is resilient to single failures: it is possible to compute the joint state of the whole given the states of any two variables: for example, if X₁ = 0 and X₂ = 0, it doesn’t matter if Y fails, there is only on joint-state consistent with the given information: (0,0,0).

This highlights what will become a recurring theme in this paper: that different “kinds” of synergy can have different profiles over a system. In this case, we are talking about synergistic entropy which, as we will see below, is a qualitatively different thing than synergistic information.

Giant bit

In contrast to the XOR gate, which is purely synergistic, the giant bit distribution is generally considered to be the ur-example of redundancy (see Table 3): knowing the state of any single variable immediately resolves all uncertainty about the state of the whole system. We can see this reflected in the α-synergistic entropy decomposition recorded in Table 4.

Table 3 The truth table for the giant bit system

Full size table

Table 4 The α-synergistic entropy decomposition of the giant bit system

Full size table

Unlike the logical XOR system, the α-synergistic entropy decomposition of the giant bit system is straightforward and behaves exactly as expected: it is robust to single- and double-element failures, as the global state can be reconstructed from any single surviving node.

W distribution

The W-distribution (Table 5) resembles a set of stop-lights or a one-hot encoding: every channel is mutually exclusive, with only one allowed to be on at a given time. The α-synergistic entropy decomposition backbone can be seen in Table 6.

Table 5 The truth table for the W-system

Full size table

Table 6 The α-synergistic entropy decomposition of the W distribution

Full size table

Like the logical XOR system results, this backbone might seem a little bit counter-intuitive: all the information is robust to single- and double-element failures. However, this can be understood by recognizing that a given global configuration can be completely reconstructed from a single channel if that channel is in the 1 state. If X₁ = 0, X₂ = 0, and Y = 1, then the failure of X₁ and X₂ is of no consequence, since Y can only equal 1 when the other channels are 0. Since this definition of synergy considers the minimum information lost upon single- or double-element failures, in the W-distribution, it is the elements in the 0 state that will be counted as contributing to the highest-order synergies.

Maximum entropy

The maximum entropy distribution (Table 7) can be thought of as the “random” or “unstructured” distribution: there are no statistical dependencies between any of the three elements, all of whom are independent of each-other and are themselves maximum entropy. When we compute the α-synergistic entropy decomposition (Table 8), we find:

Table 7 The truth table for the maximum entropy system

Full size table

Table 8 The α-synergistic entropy decomposition of the maximum entropy distribution

Full size table

These results indicate that the maximum entropy distribution, despite having no “structure” in the form of informative dependencies between variables, nevertheless has structured entropy: every scale has its own allotment of synergistic entropy.

Alternative formulations

The α-synergy decomposition, like the PID, is a flexible framework and can be modified in a variety of ways to suit different problems. Here we will briefly discuss how it might be applied to continuous random variables, as well as alternative functions that could be used instead of the minimum function.

Discrete vs. differential entropy

In this paper, we assume that all random variables are discrete, with finite-sized support sets. This is typical in information theory, which is most naturally at home among discrete random variables. However, the natural world is not discrete (at least, not at the levels that most scientists are interested in), and so the question of how to apply a given information-theoretic analysis to continuous random variables inevitably comes up. The generalization of the discrete Shannon entropy to the continuous, differential entropy is reasonably straightforward, and closed-form estimators exist for a variety of parametric distributions³⁰. The most commonly seen differential entropy estimators are for univariate and multivariate Gaussian distributions, where the local entropy ({h}^{{{{mathcal{N}}}}}({{{bf{x}}}})) can be computed as:

$${h}^{{{{mathcal{N}}}}}({{{bf{x}}}})=-ln left[frac{exp left(-frac{1}{2}{left({{{bf{x}}}}-{{{boldsymbol{mu }}}}right)}^{{{{rm{T}}}}}{{{{boldsymbol{Sigma }}}}}^{-1}left({{{bf{x}}}}-{{{boldsymbol{mu }}}}right)right)}{sqrt{{(2pi )}^{k}| {{{boldsymbol{Sigma }}}}| }}right]$$

(12)

Where μ is the vector of marginal means, Σ is the covariance matrix of X, k is the dimensionality of x, Σ⁻¹ is the matrix inverse of Σ, and ∣Σ∣ the determinant. The local differential entropy can be used to build up all the usual information-theoretic measures, both local and expected (for a review, see ref. ³¹), although it is important to note that the differential entropy differs from the discrete entropy in several key ways, that could complicate the analyses presented here.

The most significant is that the local differential entropy can be negative, since the local Gaussian probability density can be greater than unity (depending on the specific values of μ and Σ). Since much of the interpretation of the partial entropy relies on the non-negativity of joint and conditional entropies, this presents a conceptual, if not a practical, problem for attempts to generalize discrete information-theoretic tools to continuous random variables. Despite this, Gaussian estimators of differential entropy are common in complex systems science, particular in neuroimaging, where continuous-valued time series are ubiquitous (refs. ^8,9,32), and in some cases, exact, closed-form Gaussian estimators of previously-discrete measures can be derived analytically^33,34. Future work understanding the behaviour of the α-synergy decomposition for differential entropies will help deepen our understanding of higher-order information sharing in many types of complex systems.

Alternative functions

So far, we have defined synergy as the minimum information loss over all possible sets of α failures. This matches the intuition that the α-synergy is a kind of intersection: that information lost common to all subsets of size α. However, other formulations may also be useful, for example, instead of computing ({min }_{{{{bf{a}}}}}left[h({{{bf{x}}}})-h({{{{bf{x}}}}}^{-{{{bf{a}}}}})right]), one might compute the average value of h(x) − h(x^−a) over all x^−a. In this case, the interpretation is slightly different: rather than asking “what information is guaranteed to be lost if any set of α elements failed”, it asks “how much information would the observer expect to lose if any set of α elements failed”. These are subtlety different definitions of synergy (perhaps analogous to the problem of multiple redundancy functions in the PID), although the behaviour of the resulting α-synergy functions is largely the same (the average case is also non-negative and monotonically increasing, see Supplementary Note 3, Proof 3). This alternative formulation more closely resembles the Tononi-Sporns-Edelman (TSE) complexity³², discussed in detail below.

Alternately, one could adopt a worst-case-scenario perspective and replace the min function with a max function. Then the synergy becomes a measure of the most fragile information contained in any combination of elements. Like the average approach, the use of the max function also preserves the required behaviour of the α-synergy function, albeit with a different interpretations (see Supplementary Note 2, Proof 2). Different specific cases and contexts may call for different formulations.

Computing the α-synergy decomposition for large systems

While the α-synergy decomposition scales more managably than the PID, it can still get unwieldy for large systems. While the runtime complexity of the PID grows with the Dedekind numbers, the complexity of the α-synergy decomposition grows with the Bell numbers. As we can see in Eq. (8), the α-synergistic entropy implicitly requires computing the conditional entropy every subset of x of size α and its complement, over the range of integers 1…k. This works out to every possible subset of x. For a system with eight elements, that is ≈ 4140 bipartitions to test (although compare that to the number of partial information atoms in an eight element system: 56,130,437,228,687,557,907,788).

The problem of finding the bipartition of size α that minimizes the local conditional entropy is closely related to the problem of finding the minimum information bipartition in the context of integrated information theory³⁵, and has been explored extensively over the last two decades. In cases where a system is too large for the minimum local conditional entropy to be computed directly, a number of heuristic options are available. One is simply random sampling, recording the minimally entropic bipartition as it goes. Alternately, one might consider an optimization approach, such as simulated annealing, or a more bespoke approach like Queyranne’s algorithm³⁶, to find the winning bipartition. If one uses the alternative formulation based on expected values over bipartitions discussed above, the approach is somewhat easier, as it requires merely sampling a large number of possible x^a, as was done in^8,37 with the TSE complexity, rather than optimizing a particular minimum value.

If taking an optimization or random-sampling approach to find the α-synergy for a large system, care must be taken that the monotonicity criteria is not violated. If the algorithm fails to find the global minima for each value of α, it is possible (although unlikely) that the “winning” bipartition for scale β is less than the “winning” bipartition for scale α, even if α < β. This will result in negative partial synergistic entropy atoms, compromising the interpretation of the decomposition. Maintaining the non-negativity of partial entropy atoms becomes important when generalizing the synergistic entropy decomposition to other measures, such as the Kullback-Leibler divergence.

Extending the α-synergy decomposition

The Shannon entropy forms the foundation of a large number of more complex information-theoretic measures. By decomposing the component entropies of these measures, we can generalize the α-synergistic entropy decomposition, formalizing notions of synergy in a variety of other measures. This section will largely recapitulate the logic of ref. ²⁶: first showing how the local entropy decomposition induces a more general decomposition of the Kullback-Leibler divergence, from which it is then possible to construct α-synergy decompositions of the negentropy, total correlation, and single-target mutual information. By the end, we will have seen that the α-synergy decomposition can be deployed in almost every situation that the “full” partial information decomposition is deployed in.

The Kullback-Leibler divergence

Recently Varley showed that a partial entropy decomposition can be used to construct a decomposition of the Kullback-Leibler divergence²⁶. The divergence quanitifies the information gained when one updates from a prior set of beliefs (typically indicated by ({mathbb{Q}}({{{bf{X}}}}))) to a new set of posterior beliefs (typically indexed by ({mathbb{P}}({{{bf{X}}}}))). The typical formal presentation is:

$${D}^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{X}}}})=sumlimits_{{{{bf{x}}}}in {{{bf{X}}}}}{mathbb{P}}({{{bf{x}}}})log frac{{mathbb{P}}({{{bf{x}}}})}{{mathbb{Q}}({{{bf{x}}}})}$$

(13)

$$={{mathbb{E}}}_{{mathbb{P}}({{{bf{X}}}})}left[log frac{{mathbb{P}}({{{bf{x}}}})}{{mathbb{Q}}({{{bf{x}}}})}right]$$

(14)

Which is read as “the divergence of ({mathbb{P}}) from ({mathbb{Q}}).” This can be re-written in information theoretic terms to show:

$${D}^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{X}}}})={{mathbb{E}}}_{{mathbb{P}}({{{bf{X}}}})}left[{h}^{{mathbb{Q}}}({{{bf{x}}}})-{h}^{{mathbb{P}}}({{{bf{x}}}})right]$$

(15)

Where ({h}^{{mathbb{Q}}}({{{bf{x}}}})) indicates that the local entropy of x is computed from the distribution ({mathbb{Q}}({{{bf{x}}}})). The α-partial synergy decomposition allows us to decompose ({h}^{{mathbb{Q}}}({{{bf{x}}}})) and ({h}^{{mathbb{P}}}({{{bf{x}}}})) into k pairs of non-negative partial synergy atoms. The difference between the elements of each pair is the local α-synergistic divergence:

$$partial {d}_{alpha }^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{x}}}})=partial {h}_{alpha }^{{mathbb{Q}}}({{{bf{x}}}})-partial {h}_{alpha }^{{mathbb{P}}}({{{bf{x}}}})$$

(16)

Unlike the partial entropy atoms, the partial divergence atoms are not non-negative, however the negative values are easily interpretable: (partial {d}_{alpha }^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{x}}}}) ,<, 0) simply implies that there is more α-synergistic surprise in x when we believe our posterior ({mathbb{P}}) than when we believed our prior ({mathbb{Q}}). The local α-synergistic divergence can be aggregated into an expected value:

$$partial {D}_{alpha }^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{X}}}})={{mathbb{E}}}_{{mathbb{P}}({{{bf{X}}}})}left[partial {d}_{alpha }^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{x}}}})right]$$

(17)

Which is also not non-negative, although again, this is not particularly mysterious: it just means that, on average, here was more α-synergistic surprise in the prior than the posterior at the scale given by α. This completes the decomposition of the expected Kullback-Leibler divergence:

$${D}^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{X}}}})=sumlimits_{i=1}^{k}partial {D}_{-i}^{{mathbb{P}}| | {mathbb{Q}}}({{{bf{X}}}})$$

(18)

A variety of different measures can be written in terms of the Kullback-Leibler divergence, and are now fair game for α-synergy decomposition. Below we will discuss three: the negentropy and the total correlation (two measures of divergence from maximum-entropy independence) and the single-target mutual information (the original impetus for the partial information decomposition).

Special cases of the Kullback-Leibler divergence

The Kullback-Leibler divergence defines a large number of different information-theoretic metrics, which can be decomposed into various levels of α-synergy. An exhaustive discussion of all of them is beyond the scope of this manuscript, but two areas that warrant further exploration are the negentropy³⁸ and the total correlation³⁹.

Both are measures of “deviation from independence”, although they define “independence” in different ways. For a random variable X with probability distribution ({mathbb{P}}({{{bf{X}}}})), the negentropy is:

$$N({{{bf{X}}}})=sumlimits_{{{{bf{x}}}}in {{{bf{X}}}}}{mathbb{P}}({{{bf{x}}}})log frac{{mathbb{P}}({{{bf{x}}}})}{{mathbb{U}}({{{bf{x}}}})}$$

(19)

Where ({mathbb{U}}({{{bf{X}}}})) is the uniform probability distribution: for all x in ({{{mathcal{X}}}}), ({mathbb{U}}({{{bf{x}}}})=1/| {{{mathcal{X}}}}|). The uniform distribution is maximally entropic, it has no “structure”, and under the uniform distribution every subset of X is also maximum-entropy. The negentropy, then, quantifies something like “how different is X from a kind of ideal gas?” Alternately, Rosas et al., describe it as “the information about the system that is contained in its statistics.” ⁴⁰.

A closely-related measure, the total correlation, is also a measure of independence, but it preserves the first-order marginal structure:

$$TC({{{bf{X}}}})=sumlimits_{{{{bf{x}}}}in {{{boldsymbol{{{{mathcal{X}}}}}}}}}{mathbb{P}}({{{bf{x}}}})log frac{{mathbb{P}}({{{bf{x}}}})}{prod nolimits_{i = 1}^{k}{mathbb{P}}({x}_{i})}$$

(20)

The total correlation is zero if every x_i ∈ X is independent of every other x_j ∈ X, even if the individual x_i themselves are not maximum entropy. In some sense it is a more conservative definition of “deviation from independence.”

Both the negentropy and the total correlation quantify how much information is gained when we update our beliefs from a prior that X’s statistics are “unstructured” to a posterior of the true statistics. They define what it means to be “structured” in different ways (maximum entropy versus marginally independent), but the basic intuition is the same. When we apply the α-synergy decomposition, what we get is a partition of the information gain into k atoms, each of which depends on progressively lower-order combinations of elements that remain stable to perturbation.

Examples, revisited

Here, we will revisit the same distributions described above: logical XOR (Table 1), giant bit (Table 3), W (Table 5), and maximum entropy (Table 7), but this time we will be decomposing the total correlation, rather than the entropy. Table 9 shows:

Table 9 The α-synergistic total correlation decompositions for four simple systems

Full size table

When decomposing the total correlation rather than the entropy, we are presented with a very different distribution of α-synergies. The XOR gate now behaves as expected, with all of the synergy in the “whole”. Similarly, the W-distribution now shows a mixture of fragile and robust synergies (analogous to a mixture of synergies and redundancies in the context of a PID). The maximum entropy distribution, which previously had α-synergy at every scale, now has no information at any scale, since the prior and posterior are the same.

The curious system is the giant bit: in the context of α-synergistic entropy, the distribution behaved “as expected”: maximally robust. In the context of the α-synergistic total correlation, however, there is an apparently shift towards synergy, which is surprising given that the giant bit is considered to be the maximally redundant system. Mathematically, it is clear where this comes from: it follows directly from the subtraction of Table 4 from Table 8, but the interpretation is non-obvious. One clue is to remember that the total correlation is a measure of uncertainty resolved when updating our beliefs. From this perspective, when we update our beliefs, we are removing higher-order uncertainties that would be vulnerable to single random failures in the prior distribution, but are not vulnerable in the posterior, highly redundant distribution. The information we are gaining in the case of the giant bit is all the higher-order uncertainty in the maximum entropy distribution that is “resolved” when learning the true, redundancy-dominated distribution.

This shows that, in the context of total correlation, “synergy” means something qualitatively different than it does in the context of entropy (and, as we will see below, single-target mutual information).

Single-target mutual information

The original impetus for the partial information decomposition was the problem of decomposing the information that a set of inputs disclosed about a single target: I(X₁, …, X_k; Y). Thus far, we have only considered un-directed statistics: the joint entropy, the Kullback-Leibler divergence, etc. However, it would be nice if the α-synergy decomposition could also be brought to bear on the problem of directed information. This turns out to be straight-forward: once again following the logic of Varley²⁶, we can re-construct the directed mutual information from undirected Kullback-Leibler divergences. Recall that:

$$I({{{bf{X}}}};Y)={{mathbb{E}}}_{{mathbb{P}}(Y)}left[mathop{sum}_{{{{bf{x}}}}in {{{boldsymbol{{{{mathcal{X}}}}}}}}}{mathbb{P}}({{{bf{x}}}}| y)log frac{{mathbb{P}}({{{bf{x}}}}| y)}{{mathbb{P}}({{{bf{x}}}})}right]$$

(21)

To decompose the information that X discloses about Y then, requires doing the α-synergy decomposition on the Kullback-Leibler divergence of ({mathbb{P}}({{{bf{X}}}}| y)) from ({mathbb{P}}({{{bf{X}}}})) for every (yin {{{mathcal{Y}}}}). The resulting set of α-synergy atoms gives the average amount of information that X discloses about Y that would be lost if any single X_i failed, if any pair X_i, X_j failed, and so on for all α in {1, …, k}.

Like the lattice-based decomposition of the Kullback-Leibler divergence, the α-synergy decomposition of the mutual information inherits the unusual property that the number of atoms will change depending on exactly how the mutual information is defined²⁶. If we use the formulation given in Eq. (21), there will be as many atoms as there are elements of X. However, there is another, equivalent formulation of I(X; Y), also based on Kullback-Leibler divergences:

$$I({{{bf{X}}}};Y)=mathop{sum}limits_{{{{bf{x}}}},yin {{{boldsymbol{{{{mathcal{X}}}}}}}}times {{{boldsymbol{{{{mathcal{Y}}}}}}}}}{mathbb{P}}({{{bf{x}}}},y)log frac{{mathbb{P}}({{{bf{x}}}},y)}{{mathbb{P}}({{{bf{x}}}}){mathbb{P}}(y)}$$

(22)

In this case, rather than returning an atom for every X_i in X, it will return ∣X∣ + 1 atoms, since in this formulation, the joint state of {x, y} is considered directly, rather than separated as in Eq. (21). Previously, Varley speculated as to the conceptual implications of this²⁶ and interested readers are invited to visit the cited literature for further discussion.

One derivative of the partial information decomposition that does not have an obvious counterpart in the α-synergy framework is the integrated information decomposition (ΦID)⁴¹. The ΦID relaxes the single-target specification of the classic PID, allowing for the decomposition of the information shared by two sets of random variables: I(X; Y). Currently, the ΦID is not describable as a Kullback-Leibler divergence, instead being based on a product-lattice structure, and cannot currently be reconstructed from the framework presented here. Future work on the α-synergy decomposition should focus one finding a way to produce an analogue of the ΦID in the same way that the single-target decomposition above is analogous to the PID.

Examples, revisited again

Here we will, for the last time, revisit our four example distributions: logical XOR (Table 1), giant bit (Table 3), W (Table 5), and maximum entropy (Table 7). In this case, we will be decomposing the mutual information I(X₁, X₂; Y), using the decomposition induced by Eq. (21). This is the α-synergistic information decomposition’s take on the problem originally solved by Williams and Beer with the partial information decomposition¹⁹: how can the information that a set of inputs X₁ and X₂ disclose about a target Y be decomposed? Table 10 shows:

Table 10 The α-synergistic single-target mutual information decompositions for four simple systems

Full size table

Unlike the α-synergistic entropy and α-synergistic total correlation, these results all match the basic intuitions about their respective distributions. The logical-XOR gate is purely synergistic: the loss of any input X_i destroys all information about Y. The giant bit distribution is purely redundant: the loss of any single X₁ has no impact on the information disclosed about Y. The W-distribution combines features of synergy and robustness, while the maximum-entropy distribution contains no information at all.

Collectively, these results highlight the fact that the question of how one chooses to operationalize “synergy”, be it synergistic entropy, synergistic total correlation, or synergistic mutual information, can make a significant difference to how the distribution under study is represented. Going forward, it may not make sense to talk about “synergy” as if it is a single, fixed feature of a given distribution: synergy must be defined in terms of a function, and the nature of that function will result in different distributions of synergies.

Considerations for researchers

The α-synergy decomposition is very flexible: it can be applied to a large variety of informtion-theoretic measures (entropy, Kullback-Leibler divergence, single-target mutual information, etc), as well as modified to account for different perspectives on the definition of synergy my replacing the (min) function with alternatives. This makes the framework powerful, but it also presents a prospective researcher with a number of apparently ad hoc choices to make: what measure should be decomposed? What synergy function should be used?

The “correct” parameters will vary depending on the nature of the questions being asked, but here we will provide a brief discussion of the various considerations that should be taken into account when designing an analysis.

The first is the question of whether one is interested in a decomposition of the entropy of a system, or a measure of information like the Kullback-Leibler divergence, mutual information, or total correlation. As we have seen in the examples, these two broad categories behave very differently and conflating the two may lead to incorrect conclusions. Assuming that one wants to decompose information, rather then entropy, the next question is whether a directed (where information is disclosed about a shared target) or undirected analysis is more appropriate. If directed, then a decomposition of the single-target mutual information is appropriate, while if undirected, then the total correlation or negentropy will be more appropriate.

Finally, there is the question of whether the synergy should be defined in term of the minimum information lost, the maximum information lost, or the expected information lost. In general, the minimum information lost is probably the most principled approach, especially when it is possible to brute-force solve all possible bipartitions of the system. However, if one is working on a large system where it is impossible to find the globally optimal minimum, then the expected loss may be more appropriate as it can be more easily estimated by a random sampling approach (as was done in⁸).

Extensions beyond information: structural synergy

Thus far, we have focused on the information-theoretic measure of local entropy, and the various measures that can be constructed from sums and differences of it. However, the logic of the α-synergy decomposition can be applied beyond the world of information theory to a larger family of functions on sets, so long as a few conditions are met. For a function f() on a set X to induce an interpretable α-synergy decomposition, we conjecture that it must satisfy a minimal set of properties:

1.

Localizability: If f(X) is an expected value, the f() must also be defined on every local instance of X = x.
2.

Symmetry: f(x) is invariant under any permutation of the elements of x.
3.

Non-negativity: (f({bf{x}}) ,>, 0).
4.

Monotonicity: if y ⊆ x, then f(y) ≤ f(x).

If f() satisfies these rules, then one can define an α-synergy function on it:

$${f}_{alpha }^{syn}({{{bf{x}}}})=mathop{min}limits_{{{bf{a}}subset {k}atop| {{{bf{a}}}}| =alpha }}left[f({{{bf{x}}}})-f({{{{bf{x}}}}}^{-{{{bf{a}}}}})right]$$

(23)

and recover the same non-negative backbone decomposition as described above. This opens the door to the analysis of higher-order interactions beyond the information-theoretic perspective, and how the structure of a system changes as it disintegrates.

To demonstrate this, we will consider a classic family of models in complexity science: the bivariate network, and the question of communication over edges from node to node. There are many different strategies for communicating over a complex network⁴², and generally, the ease with which a signal can be sent from an arbitrary source to an arbitrary target is considered as a measure of network “integration” ⁴³. Given a sufficiently well-behaved measure (satisfying the desiderata detailed above), we can ask “how much of the overall integration depends on the particular pattern of edges in the network?” We refer to this idea as the “structural synergy.”

The measure that we use here is the communicability⁴⁴, which measures how readily a signal radiating from a source node will reach all of the others, assuming unbiased, random diffusion. For an adjacency matrix M, the communicability is given by:

$$C({{{bf{M}}}})={mathbb{E}}left[{left({e}^{{{{bf{M}}}}}right)}_{ij}right]$$

(24)

Where the expectation is taken over all edges M_ij. The communicability can be readily localized: the integration between any two nodes i and j is just value of cell (i, j) in the communicability matrix e^M. So, for every edge, it is possible to define the difference between the communicability between i and j when the network is unperturbed and the communicability between i and j after some set of edges has failed. The interpretation of the α-synergy in this case is the same as in the information-theoretic case, although the specific measure is different.

The 1-synergy is that integration that depends on the existence of all edges, and would be destroyed by the failure of any single edge. Likewise, the 2-synergy is the integrations that depends on the existence of all pairs of edges and so-on.

In Fig. 2 an example decomposition is demonstrated for a small Erdos-Renyi graph with ten nodes and nineteen edges. It is clear that there is very little synergistic communicability in this graph: the difference between the 1-synergy and the total communicability spans four orders of magnitude. This is unsurprising given that the communicability strategy inherently involves diffusing signals over every possible path – almost all edges have to fail before there’s no way to get a signal from i to j due to the redundancy inherent in the measure. The structural synergy, while arguably negligible, is nevertheless calculable and present in the network – future work exploring how the topology of pairwise edges influences the higher-order synergies, and the significance for dynamics on the networks may reveal novel links between how the structure of lower-order causal mechanisms facilitates the emergence of higher-order statistical structures.

The structural synergy is one of a number of recent approaches for using information theory to explore emergent structures in graphs and networks that has been developed. Rosvall et al. used the notion of information compression as a framework for finding communities in complex networks (known as the Infomap algorithm)^45,46. Later, Klein et al. formulated an approach for characterizing coarse-grained higher scales, taking an explicitly emergentist perspective⁴⁷, finding that some networks can be coarse-grained in such a way that the predictive information at the macro-scale is greater than the predictive information at the micro-scale. Further, work has found that biological evolution seems to optimize for this property in molecular interaction networks⁴⁸. Klein et al.,’s approach operationalizes the information-content of a network via a random-walker model (how does the uncertainty about the location of a walker on the network at time t change upon learning the location of the walker at time t − 1). Subsequently, Varley showed that the notion of coarse-graining emergence explored by Klein could be localized to individual edges⁵, coining the term “flickering emergence” to describe the local variability in emergent structure over the edges of the network, or through time in a dynamic process. Since the structural synergy can also be formulated with respect to random walker dynamics (e.g. the communicability), we propose that there may be links between the structural synergy and the effective emergence, although the precise nature of the relationship remains an area of future work. Recently, Luppi et al. proposed an approach by which the interactions of different layers of a multi-layer network could be understood in terms of redundant, unique, and synergistic interactions between shortest paths, which they term the partial network decomposition (PND)⁴⁹. The notion of structural synergy in the context of the α-synergy decomposition is distinct from the notion of synergy in the PND for several reasons: the first is that the structural synergy is not specifically defined on shortest paths, but rather any functional graph invariant that satisfies the given desiderata (although we note that the efficiency, or the reciprocal of the shortest path length, would work). Second, the PND compares two networks defined on the same set of nodes, while the structural synergy characterizes a single network qua itself.

In the context of networks, the notion of structural synergy is something of an inversion of the usual approach to analysing complex networks. Typically, analyses focus on individual nodes (first-order structures), or pairwise edge-centric perspectives^50,51 (second order structures). In contrast, the structural synergy takes a top-down approach, describing the irreducible structure in the whole as a function of the joint-state of all of the parts.

Discussion

In this work, we have introduced the general notion of a decomposition of synergistic “structure” in complex systems that encompass both information-theoretic, and topological definitions of “structure”, and which can be applied to a variety of further contexts.

This approach was developed initially to address the limitations in commonly used, information-theoretic approaches for understanding statistical “synergy”. The two most commonly used approaches are the partial information decomposition¹⁹ (and subsequent derivatives^23,24,26,41), and the O-information²¹ (and its derivatives^52,53). The partial information decomposition-based approach provides a “complete” map of the structure of multivariate information in the form of the antichain lattice, however, the super-exponential growth of the lattice makes it impractical for all but the smallest toy systems. Furthermore, it is a “redundancy-first” approach: synergistic information is implicitly defined in terms of redundancy, and different definitions of redundancy imply qualitatively and quantitatively different notions of synergy. In contrast, the O-information scales much more gracefully than the information decompositions, however it only reports whether redundancy or synergy dominates a system, and has a very conservative notion of “synergy”, analogous to the 1-synergy in our approach.

In summary, the PID is “complete”, but does not scale, and hinges on a measure of redundancy, while the O-information scales nicely and does not require defining redundancy, but does not provide a map of the structure of the system in question. The α-synergy decomposition was designed to balance these trade-offs. It is a “synergy-first” approach, hinging on a definition of synergy based on channel failures (somewhat like the dual total correlation⁵⁴) and so scales more elegantly than the antichain lattice. Unlike the O-information, however, we get a measure of partial synergy for every scale, rather than just the top. Nevertheless, the α-synergy has its limitations as well. The most significant is that it does not reveal how synergy is distributed over the various elements and sets of elements, instead homogenizing the system under the summary statistics of the minimum/maximum/average functions: all of the structure is squished down onto the one-dimensional backbone. Furthermore, while it scales more elegantly than the PID, the complete α-synergy decomposition still eventually becomes intractable, requiring optimizations that may break some of the mathematical guarantees that make it work. Whether a given scientist reaches for the PID/PED/GID, the O-information and related measures, or the α-synergy decomposition will depend on the specific nature of the questions being asked, and care should be taken to ensure that the right approach is being used for a given analysis.

The measure that the α-synergy decomposition has the most in common with is the Tononi-Sporns-Edelman complexity³², which takes a similar approach to inferring the structure of a system by sweeping all possible partitions of its elements. The TSE complexity is based on the total correlation, and explicitly compares the expected deviation from independence at each scale to a built-in null model. In contrast, the α-synergy decomposition can be applied to any function that can be constructed from sums or differences of entropies (including the total correlation) and does not have the built-in null. Furthermore, in the TSE complexity, each subset is considered individually, while in our approach, it is the interaction between a subset and its complement that is significant (the term h(x^a∣x^−a)). The TSE-complexity is just one of a number of information-theoretic approaches to multi-scale structure inference in complex systems. Another closely-related measure to both the TSE-complexity and the α-synergy is the connected information⁵⁵, which presents an elegant approach to higher-order interactions by comparing how the information structure of a system changes when the correlations of different orders are disrupted. Like the α-synergy decomposition, the connected information presents a one-dimensional backbone describing the scale-specific information, and can also be generalized to account for directed mutual information. A crucial difference, however, is that the connected information requires serially computing maximum entropy distributions subject to marginal constraints of varying degrees. For even modestly sized systems, this is generally an intractable problem, which puts the connected information in the same category as the PID: a mathematically elegant and conceptually powerful framework that struggles with application to modestly sized datasets. Other, related constructs include the entropy and complexity profiles⁵⁶ and the marginal utility of information⁵⁷, both of which also take a multivariate information-theoretic approach to the problem of structure at multiple scales in complex systems.

The α-synergy also bears some similarity to the “synergistic disclosure” decomposition from Rosas et al.²⁷. The synergistic disclosure framework is a synergy-first based approach (unlike the usual PID), however it still relies on the antichain lattice structure, and so suffers from the explosive growth in the number of terms. To address this, Rosas et al., provide a compressed, one-dimensional representation of the lattice (the “backbone”, from which our approach takes its name), with a very similar structure to the backbone presented here. The synergistic disclosure approach relies on a very different notion of synergy, however, one based on the logic of data privacy⁵⁸ and is specific to the expected mutual information (i.e. it cannot be readily localized or extended to other measures like the Kullback-Leibler divergence or the total correlation). Finally, the α-synergy decomposition shows some resemblances to the gradients of O-information recently introduced by Scalgiarini et al.,⁵³, which also considers how the information structure of a system changes in response to sets of failures, although like the TSE complexity, it depends on a particular measure (in this case, the O-information) rather than being a general approach. The gradients of O-information are also focused on finding how specific elements, or low-order combinatons of elements contribute to higher-order information circuits, while the α-synergy decomposition ignores node-specific information entirely to generate summary statistics. Finding a way to combine these approaches (gradients of α-synergy) might help get the best of both worlds: a tractable picture of multi-scale synergy that also shows how it is represented over the specific elements.

One possible application of structural synergy is the study of cascading failures in complex systems and the inherent tradeoffs between robustness and efficiency. Following the COVID-19 pandemic, many global supply chains that relied on just-in-time logistics failed when disruptions to manufacturing and distribution compromised the precise timing necessary for highly-efficient organization to function⁵⁹. This could be seen as a case of structural synergy in action: just-in-time supply chains require every part of the system to be in the correct “state” at the correct time. Any deviation from that state (analogous to a failure in our framework) compromises the required synergy and pushes the system towards a new configuration. Extending the structural synergy and α-synergy decompositions to dynamical processes may help provide new insights into the design of efficient and robust systems.

In this paper, we have introduced the α-synergy decomposition, an approach to measuring the synergistic information in a multivariate system. The α-synergy decomposition was designed to balance the tradeoffs associated with existing measures: it scales more gracefully than the partial information decomposition¹⁹ at the expense of losing element-specific information. Conversely, it is less scalable than the O-information²¹, but it provides a “spectrum” of synergies for each scale, rather than just the redundancy-synergy balance. From the decomposition of the entropy, it is possible to reconstruct almost all other formulations of the partial entropy decomposition, including the single-target mutual information and the generalized Kullback-Leibler decomposition. Excitingly, the same logic can be applied outside of the domain of information theory: we introduce the notion of “structural synergy” and describe the contexts in which it can be applied. The structural synergy can be used to assess how the particular pattern of elements or interactions in a complex system contributes to some property of interest such as integration or productive capacity.

Introduction

What is synergistic information?

The α-synergy backbone

Examples

Logical XOR

Giant bit

W distribution

Maximum entropy

Alternative formulations

Discrete vs. differential entropy

Alternative functions

Computing the α-synergy decomposition for large systems

Extending the α-synergy decomposition

The Kullback-Leibler divergence

Special cases of the Kullback-Leibler divergence

Examples, revisited

Single-target mutual information

Examples, revisited again

Considerations for researchers

Extensions beyond information: structural synergy

Discussion

Related Articles

Responses