Dynamical reversibility and a new theory of causal emergence based on SVD

Introduction
We live in a world surrounded by a multitude of complex systems. These systems engage in time-irreversible stochastic dynamics, leading to entropy production and disorder accumulation1. Despite this, there exists a belief that beneath the disorder in complex systems lie profound patterns and regularities2,3. Consequently, researchers endeavor to derive causal laws from these dynamic systems at a macroscopic scale, while disregarding detailed micro-level information as appropriate4,5,6. Ultimately, the goal is to develop an effective theory or model capable of elucidating the causality of complex systems at a macroscopic level.
Such idea can be captured by the theoretical framework known as causal emergence(CE) proposed by Hoel et al.5,6,7. This framework builds upon an information-theoretic measure called Effective Information (EI)8, which quantifies the causal influence between successive states within a Markov dynamical system. Through illustrative examples, they demonstrate that coarse-grained Markov dynamics, when measured by EI at the macro level, can exhibit stronger causal power than at the micro level. Nevertheless, one of the foremost challenges is that the manifestation of CE relies on the specific manner in which we coarse-grain the system. Different coarse-graining methods may yield entirely disparate outcomes for CE7. Although this issue can be mitigated by maximizing EI5,6,9,10, the challenges such as computational complexity, the question of solution uniqueness, ambiguity, and the non-commutativity of marginalization and abstraction operations continue to persist11. Is it possible to construct a more robust theory of CE that is independent of the coarse-graining method?
Although Rosas et al.12 proposed a new framework for CE based on integrated information decomposition theory12,13,14 and they utilize the synergistic information from all micro-variables across two consecutive time steps to quantify emergence, which does not require a predefined coarse-graining strategy, it still involves iterating through all variable combinations to derive synergistic information, resulting in significant computational complexity. Rosas et al. also proposed an approximate method to mitigate the complexity12, but it requires a pre-defined macro-variable. In addition, Barnett and Seth15 introduced a novel framework for quantifying emergence based on the concept of dynamical independence. If the micro-dynamics are unrelated to the prediction of macro-dynamics, the complex system is considered to exhibit emergent macroscopic processes. However, this framework has only been applied to linear systems to date. Both of these methods for quantifying emergence are established on mutual information derived from data, leading to outcomes that are influenced by data distribution. Consequently, these results may not exclusively capture the “causal” or the dynamic essence of the system.
Building on Hoel’s theory, this paper aims to develop a CE framework grounded in an intriguing yet distinct concept: dynamical reversibility. Despite there are a bunch of discussions on the reversibility of a Markov chain and its connections with causality16,17,18,19,20,21, the reversibility discussed here differs from the conventional concepts. It refers to the invertibility of the transition probability matrix (TPM) in a Markov dynamics, which relates to the dynamics’ ability to maintain information about the process’s states (similar to reversible computing22 or the unitary process in quantum mechanics23). In contrast, the conventional reversibility conception involves restoring the same state distributions from a stationary distribution by reversing the process. Notably, exact dynamical reversibility implies the conventional reversibility of a Markov chain, as we will demonstrate in this paper. There appears to be a contradiction between the reversibility of Markov dynamics and the conventional view that macroscopic processes are always irreversible. However, we do not assert that macro-dynamics are reversible; instead, we introduce an indicator called approximate dynamical reversibility to quantify their degree of reversibility.
The concept of causality explored here focuses on the measure of causation in Markov dynamics rather than traditional notions involving interventions and counterfactuals. As noted by Hoel et al.24, all measures of causation are combinations of two causal primitives: sufficiency and necessity. Sufficiency quantifies the probability of an effect e occurring given that the cause c occurs, while necessity measures the probability of e not occurring if c does not occur. A high measure of causation arises only when c is both a sufficient and necessary condition for e. This actually implies a bijective(reversible) functional map between all possible causes (c or ¬ c) and all effects (e or ¬ e) if the causation measure is maximized.
This leads to an intriguing connection between reversibility and causality in Markov dynamics: when the TPM of a Markov chain approaches invertibility, the previous state effectively becomes both a sufficient and necessary condition approximately for determining the state at the next time step. This point is further supported by examples found in refs. 5,6, where EI, as a measure of causation, is maximized when the underlying dynamics are reversible. Therefore, we can re-frame the theory of CE as an endeavor to obtain a reversible macro-level dynamics by appropriately disregarding micro-level information.
The close connection between EI as a measure of dynamical causality and the approximate dynamical reversibility of the same Markov chain allows us to easily arrive at a simple understanding of emergence: the emergence of causality is essentially equivalent to the emergence of reversibility. However, the concept of reversibility can provide us with even deeper insights. First, if we view a Markovian dynamics as a communication channel that transmit the information of the system’s state into the future6—where each state’s probabilistic transition can be seen as an information pathway—then the more reversible the dynamics, the more efficient these information pathways become, meaning that the average amount of information transmitted through each pathway increases. Through singular value decomposition (SVD), we can reveal that the essence of CE lies in the presence of redundant information pathways in the system’s dynamics. These pathways are linear dependent row vectors, they either transmit no information or transmit very little information (corresponding to singular vectors associated with zero or near-zero singular values). Most of the information, however, is transmitted through the less but more reversible core pathways in the dynamics (corresponding to the singular vectors associated with larger singular values). As a result, the degree of CE in dynamics can then be quantified as the potential maximal improvement in information efficiency (which is equivalent to reversibility efficiency) under certain accuracy constraints. And the optimal coarse-graining strategy for maximizing EI should eliminate ineffective information pathways while preserving dynamics6, aligning with the directions of the largest singular vectors.
Another related topic, the lumpability of Markov chains25, also delves into the process of coarse-graining a Markov chain26 and its relationship with reversibility27. This conception mainly focuses on the legitimacy of a grouping method for states during the process of coarse-graining28. A lumpable grouping method should guarantee the coarse-grained macro-dynamics being a legitimate Markov chain and its TPM, the time evolution operator should commute with the coarse-graining operator. However, the criteria for determining the lumpability primarily focus on the consistency of the Markov dynamics rather than the causality assessed by EI, and the reversibility they are concerned with is not dynamical reversibility27. Therefore, the concepts explored in this paper serve as a complement to the understanding of the lumpability of Markov chains.
This paper commences by introducing an indicator designed to measure the proximity of a Markov chain to be dynamically reversible, utilizing the singular values of its TPM. Subsequently, it provides several formal definitions and mathematical theorems to establish the validity of the indicator and its close association with EI. Additionally, a novel definition for CE based on the approximate dynamical reversibility is introduced, followed by the validation of this definition and a demonstration of its equivalence to, and distinctions from, EI maximization-based causal emergence using multiple examples including Boolean networks, cellular automata, and complex networks. Finally, a more streamlined and potent coarse-graining method for general Markov chains, employing the singular value decomposition (SVD) of the TPM, is proposed.
Results
Theories
Effective information and causal emergence
First, we will briefly introduce Hoel et al.’s theory of Causal Emergence (CE), which is grounded in the information-theoretic measure known as Effective Information (EI). This measure was initially introduced in8 and has since been employed to quantify causal emergence in the work of Hoel et al. in5. For a given Markov chain χ with a discrete state space ({mathcal{S}}) and the transitional probability matrix (TPM) P, where the element of P at the ith row and the jth column, pij, is the conditional probability that the system transitions to state j at the current time, given that it was in state i at the previous time step, then EI is defined as:
where Xt, Xt+1, ∀ t≥0 represent the state variables defined on ({mathcal{S}}) at time step t and t + 1, respectively. The do-operator, denoted as do(Xt ~ U), embodies Pearl’s intervention concept as outlined in ref. 29. This intervention enforces Xt to adhere to a uniform (maximum entropy) distribution on ({mathcal{S}}), specifically Pr(Xt = i) = 1/N, where (iin {mathcal{S}}) and N represents the total number of states in ({mathcal{S}}). Given that (Pr({X}_{t+1}=j)={sum }_{iin {mathcal{S}}}{p}_{ij}cdot Pr({X}_{t}=i)), the do-operator indirectly influences Pr(Xt+1) as well. Consequently, the EI metric quantifies the mutual information between Xt and Xt+1 subsequent to this intervention, thereby measuring the strength of the causal influence exerted by Xt on Xt+1.
The rationale behind using the do operator is to ensure that the EI metric purely captures the characteristics of the underlying dynamics, specifically the TPM (P), while remaining unaffected by the actual distribution of Xt7. This point can be more clear by showing another equivalent form of EI5:
where Pi = (pi1, pi2, ⋯ , piN) is the ith row vector of P, and P can therefore be written as (P={({P}_{1}^{T},{P}_{2}^{T},cdot cdot cdot ,{P}_{N}^{T})}^{T}). In Equation (2), ⋅ represents the scalar product between two vectors, and (log) is the element-wise logarithmic operator for vectors, DKL( ⋅ ∥ ⋅ ) is the KL-divergence between two probability distributions and (bar{P}=frac{1}{N}mathop{sum }nolimits_{i}^{N}{P}_{i}) is the average vector of all the N row vectors of the TPM. Thus, EI measures the average KL divergence between any Pi and their average (bar{P}). To be noted that, this form of EI is the generalized Jensen-Shannon divergence as mentioned in30. All the logarithms in Eq. (3) is based on 2.
The vector form representation of the TPM can indicate that the dynamics can be regarded as an information channel as pointed out in6, and each row vector represents an information pathway. While, the similarities among the row vectors represent a redundancy in the dynamics, which serves as an information channel5. As demonstrated in Equation (2), EI quantified the averaged differences between these row vectors.
Furthermore, EI can be decomposed into two terms5:
the first term is determinism (-bar{H}({P}_{i})equiv frac{1}{N}mathop{sum }nolimits_{i = 1}^{N}{P}_{i}cdot log {P}_{i}) which measures how the current state can deterministically(sufficiently) influence the state in next time step, and the second term is non-degeneracy (H(bar{P})equiv -mathop{sum }nolimits_{j = 1}^{N}{bar{P}}_{cdot j}log {bar{P}}_{cdot j}) which measures how exactly we can infer(necessarily) the state in previous time step from the current state. In their original definition in ref. 5, both the determinism and the degeneracy are added (log N) to guarantee them to be positive. This decomposition reveals why EI can measure the strength of causal effect of a Markov chain and the connections with the reversibility of dynamics because the determinism can be understood as a kind of sufficiency and the degeneracy is a kind of necessity7,24.
This point can be more clear by the examples shown in Fig. 1, where four cases of TPMs are shown and the EI values and their normalized forms (effequiv EI/{log }_{2}N) are also demonstrated below. It is not difficult to observe that as the TPM is close to an invertible matrix(a permutation matrix, see Proposition 1 in Supplementary A.2.1), EI is larger.

a and b are calculation results for different TPMs examples. c and d show the calculation results for micro and macro TPMs.
Causal emergence occurs when the coarse-grained TPM possessing larger EI than the original TPM. As shown in the example in Fig. 1(d), which is the coarse-grained TPM of the example in Fig. 1c. And the degree of CE can be calculated as the difference between the EIs as mentioned in ref. 5:
where, ({P}^{{prime} }) is the coarse-grained TPM of P. In this example, the coarse-graining is implemented by collapsing the first three rows and columns of the TPM in Fig. 1c into one macro-state. Thereafter, (EI({P}^{{prime} })=1)(or eff = 1.0) in (d) is clearly larger than EI(P) = 0.81(or eff = 0.41) in (c), which manifests that the strength of cause-effect in macro-level (coarse-grained TPM in (d)) is larger than the micro-level (c), thus, causal emergence occurs, and the degree of CE is 1 − 0.81 = 0.19.
To be noted, the extent of causal emergence could vary with changes in the coarse-graining method, and in certain cases, it might even be negative as shown in refs. 5,7. Thus, to quantify CE, it is essential to search for an optimal coarse-graining strategy that maximizes the EI of macro-dynamics. However, it is important to note that this optimal solution may not be unique11,31, and the best coarse-graining strategy could violate the lumpability requirement. This can result in ambiguity when merging different causal states and disrupt the commutativity between marginalization (the time evolution operator, i.e., the TPM) and abstraction (the coarse-graining operator), which is essential for maintaining consistent dynamics before and after coarse-graining, as previously discussed11.
Dynamical reversibility
Second, we will introduce the concept of dynamical reversibility for a Markov chain, and propose a quantitative indicator to measure the proximity for a general Markov chain to be dynamically reversible.
a. Definitions and Properties
Definition 1
For a given Markov chain χ and the corresponding TPM P, if P simultaneously satisfies: 1. P is an invertible matrix, that is, there exists a matrix P−1, such that P ⋅ P−1 = I; and 2. P−1 is also an effective TPM of another Markov chain χ−1, then χ and P can be called dynamically reversible.
It is important to clarify that in this context, a dynamically reversible Markov chain differs from the commonly used term “time reversible” Markov chain20,32, as dynamical reversibility necessitates the ability of P to be reversibly applied to each individual state, whereas the latter focuses on the reversibility of state space distributions. Actually, we can prove that the former implies the latter (Lemma 3 in Supplementary A.2.1).
Further, the following proposition states that all the Markov chains satisfying the two conditions mentioned in Definition 1 are permutation matrices.
Proposition 1
For a given Markov chain χ and the corresponding TPM P, if P is dynamically reversible as defined in Definition 1, if and only if P is a permutation matrix.
Proof.
The proof is referred to Supplementary A.2.
However, permutation matrices are rare in the whole class of all possible TPMs. That means that, in the case of a general TPM, it is not inherently reversible. Hence, an indicator that can quantify the proximity for a general TPM being reversible is required.
It seems that the rank r of P can be this indicator because if and only if r < N, the matrix P is irreversible, and P becomes more degenerate the smaller r is. However, a non-degenerate(full rank) P is not always dynamically reversible. Even if an inverse P−1 exists, it may not function as an effective Transition Probability Matrix (TPM), as this requires all elements in P−1 to fall between 0 and 1, alongside the fulfillment of the normalization condition (where the one-norm of the ith vector (({{P}^{-1}})_{i}) in P−1 should equal one: (parallel!({{P}^{-1}})_{i}{parallel}_{1}=1)). While, according to Proposition 1, TPMs must be permutation matrices to be dynamically reversible. Thus, the matrices being close to permutation matrices should be “more” reversible. One of an important observations is that all the row vectors in a permutation matrix are one-hot vectors(the vectors with only one element is 1, all other elements are zero). This characteristic can be captured by the Frobenius norm of P, ∥P∥F. Actually, ∥P∥F is maximized if and only if the row vectors in P are one-hot vectors (see Lemma 5 in Supplementary A.2.1).
Therefore, the indicator that characterizes the approximate dynamical reversibility should be a kind of mixture of the rank and the Frobenius norm. While the rank of a matrix P can also be written as:
where σi≥0 is the ith singular value of P. Furthermore, according to Lemma 5 in Supplementary A.2.1, the Frobenius norm can be written as:
which is also the sum of the squares of singular values. Both the rank and the Frobenius norm are connected through the singular values of P.
Formally, the approximate dynamical reversibility of P can be formalized by the following definition:
Definition 2
Suppose the transitional probability matrix (TPM) is P for a Markov chain χ, and its singular values are (σ1≥σ2≥ ⋯ ≥σN≥0), then the α-ordered approximate dynamical reversibility of P is defined as:
where α ∈ (0, 2) is a parameter.
Actually, Γα is the Schatten norm of P: ({Gamma }_{alpha }={parallel{P}}{parallel }_{alpha }^{alpha }) when α≥1 (it is also called nuclear norm when α = 1), while it is quasinorm when 0 < α < 133,34,35,36.
This definition is reasonable to characterize the approximate dynamical reversibility because the exact dynamical reversibility can be obtained by maximizing Γα as mentioned by the following Proposition:
Proposition 2
The maximum of Γα is N for any α ∈ (0, 2), and it can be achieved if and only if P is a permutation matrix.
Proof.
The proof can be referred by Supplementary A.2.2.
Further, Γα is lower bounded by ({parallel{P}}{parallel }_{F}^{alpha }) according to Lemma 11, and this lower bound can be increased if the dynamics P is more deterministic (more one-hot row vectors in P) according to Lemma 12. For fully deterministic TPMs (all row vectors are one-hot vectors), Γα can be further increased as the number of orthogonal vectors become larger as claimed by Lemma 13. In general, when P is close to a permutation matrix, Γα approaches its maximum. These propositions and lemmas guarantee that Γα is a reasonable indicator to measure the approximate dynamical reversibility for any given P. All the mathematical proofs are given in Supplementary A.2.2.
b. Determinism and degeneracy
To be noticed that by adjusting parameter α ∈ (0, 2), we can make Γα more reflective of P’s determinism or degeneracy5. When α → 0, Γα converges to the rank of P, which resembles the non-degeneracy term in the definition of EI (Equation (3)) because r decreases as P is more and more degenerate. However, α is not allowed to take exact 0 in Definition 2 because rank(P) is not a continuous function of P, and maximizing rank(P) does not have to lead to permutation matrices.
Similarly, Γα converges to ({parallel{P}}{parallel }_{F}^{2}) when α → 2, but α does not become exactly 2 in Definition 2 because the maximization of Γα=2 does not imply P being reversible. ∥P∥F is comparable with the determinism term in the definition of EI (Equation (3)) because when there are more and more one-hot row vectors in P, the maximum transitional probability in P become larger and larger which means the underlying dynamics becomes more deterministic.
In practice, we always take α = 1 to balance the propensity of Γα for measuring determinism and degeneracy, and Γα=1 is called nuclear norm which has many potential applications36,37. When α < 1 the measure Γα quantifies more on the non-degeneracy of P. In literature, Γα≤1 is always utilized as an approximation of the rank function38,39. On the other side, the measure Γα characterizes more on the determinism of P when α > 1.
Considering the importance of α = 1, we mostly show the results on α = 1, and we abbreviate Γ1 as Γ in the following texts.
c. Normalization and Examples
Since Γα is size-dependent, we need to normalize them by dividing the size of P
to characterize the size-independent approximate dynamical reversibility such that the comparisons between Markov chains with different sizes are more reasonable. It can be proven that γα is always smaller than or equal to 1 as a derived result of Proposition 2.
This quantity also evaluates the averaged dynamical reversibility, or the efficiency of information transmission through P by each information pathway, treating the Markov dynamics as an information channel as stated in6, and each state’s transitions as an information pathway.
In Fig. 1, we show Γαs and the normalized ones, γαs, on the four examples by setting α = 1. Γ varies from 2 (case c, d) to 3.81 (case a), and γ = Γ/N varies from 0.5 (case c) to 1 (case d) in Fig. 1. It is clear that γ is larger if the TPM is closer to a reversible matrix. And the correlation between γ and eff can be observed in these examples.
Connections between Γ
α and E
I
On one side, EI characterizes the strength of causal effect of a Markov chain; on the other side, Γα can quantitatively capture the approximate dynamical reversibility of the Markov chain. We have claimed that the causality and reversibility are deeply connected in the introduction, thus, we will discuss the connections between EI and Γα in this sub-section.
First, we found that EI and (log {Gamma }_{alpha }) shares the same minimum and maximum as mentioned in the following proposition.
Proposition 3
For any TPM P and α ∈ (0, 1), both the logarithm of Γα and EI share identical minimum value of 0 and one common minimum point at (P=frac{1}{N}{{mathbb{1}}}_{Ntimes N}). They also exhibit the same maximum value of (log N) with maximizing points corresponding to P being a permutation matrix. Where the notation ({{mathbb{1}}}_{Ntimes N}) denotes a matrix where all elements are equal to 1.
Proof.
The proof can be referred to Supplementary A.3.
Thus, (log {Gamma }_{alpha }) and EI can reach their maximal values (log N) when P is reversible (permutation matrix). They also achieve their minimal values (0) when ({P}_{i}={mathbb{1}}/N,forall iin {1,2,cdots ,,N}), where ({mathbb{1}}equiv (1,1,cdots ,,1)). However, we can prove that ({mathbb{1}}/N) is not the unique minimum point of EI, any TPM with Pi = Pj for any i, j ∈ {1, 2, ⋯ , N} can make EI = 0 (see Lemma 1 and Corollary 1 in Supplementary A.1).
Second, EI is upper and lower bounded by an affine term of (log {Gamma }_{alpha }). This point can be formally stated as the following theorem.
Theorem 1
For any TPM P, its effective information EI is upper bounded by (frac{2}{alpha }log {Gamma }_{alpha }), and lower bounded by (log {Gamma }_{alpha }-log N).
Proof.
The proof can be referred to Supplementary A.3.
Therefore, we have the following inequality:
Actually, a tighter upper bound for EI, (EIle log {Gamma }_{alpha }), is found empirically and numerically as the results shown in the next section. We also found that EI and Γα always positively correlated in many cases. Therefore, we propose an approximate relationship exists:
A new quantification for causal emergence
One of the major contributions of this paper is a new quantification for causal emergence based on dynamical reversibility and singular values, and this quantification is independent on any selection of coarse-graining method. Because Γα is the summation of the α powers of the singular values of P, removing zero or approximate zero singular values does not change Γα. Therefore, we can use the spectrum of singular values to characterize the property of CE for a Markov dynamics.
First, two new definitions about causal emergence are given.
Definition 3
For a given Markov chain χ with TPM P, if r ≡ rank(P) < N then clear causal emergence occurs in this system. And the degree of CE is
Definition 4
For a given Markov chain χ with TPM P, suppose its singular values are σ1≥σ2≥ ⋯ ≥ ⋯ ≥σN≥0. For a given real value ϵ ∈ [0, σ1], if there is an integer i ∈ {1, 2, ⋯ , N} such that σi > ϵ, then there is vague causal emergence with the level of vagueness ϵ occurred in the system. And the degree of CE is:
where ({r}_{epsilon }=max {i| {sigma }_{i} > epsilon })
These definitions are independent of any coarse-graining method. As a result, it represents an intrinsic and objective property of Markov dynamics. Thus, the occurrences of both clear and vague CE, as well as the extent of such emergence, can be objectively quantified.
It is not difficult to see that clear CE is a special case of vague CE when ϵ = 0, and it has theoretical values particularly when the singular values can be solved analytically. Furthermore, the judgment on the occurrence of CE is independent of α because it relates only on the rank. As a result, the notion of clear CE is only determined by P and is parameter free.
While, in practice, the threshold ϵ must be given because the singular values may approach 0 infinitely while P is full rank. ϵ can be selected according to the relatively clear cut-offs in the spectrum of singular values (or the logarithm of singular values). If ϵ is very small (say ϵ < 10−10), we can also say CE occurs roughly. Some indicators, e.g., effective rank, can help us to select the appropriate cut-off40,41.
As Proposition 4 claims, ΔΓα(ϵ) ∈ [0, N − 1] for any ϵ≥0, and CE occurs if and only if ΔΓα(ϵ) > 0 according to Corollary 2. The proposition, the corollary and the proves can be found in Supplementary A.3.1.
Second, the rationale behind Definitions 3 and 4 for causal emergence stems from the observation that, when the coarse-graining strategy is chosen by projecting the probability masses of micro-states onto the directions aligned with the singular vectors corresponding to the largest singular values, both γα and EI of the coarse-grained TPM can be increased. Consequently, setting the coarse-graining method to the vectors being parallel to the major singular vectors is a necessary condition of EI maximization, as supported by a theoretical analysis and two numeric examples demonstrated in Supplementary B.
Similar to Equation (4), Equations (11) and (12) characterize the potentially maximal increase in information transfer efficiency or averaged dynamical reversibility of the TPM achieved with the least information loss through an optimal coarse-graining strategy for EI maximization(even if the strategy itself does not need to be explicitly determined). In this context, the threshold ϵ in Equation (12) can be interpreted as the precision requirement for the optimal coarse-grained macro-dynamics.
In equations (11) and (12), the efficiency is quantified using the state-averaged dynamical reversibility, ({gamma }_{alpha }(r)equiv mathop{sum }nolimits_{i = 1}^{r}{sigma }_{i}/r), where r = rϵ or r = N, represents the effective number of states, which also corresponds to the number of effective information pathways. Alternatively, the efficiency can also be quantified by replacing γα(r) with (log (mathop{sum }nolimits_{i = 1}^{r}{sigma }_{i})/log r), which serves as an analog of (effequiv EI/log N) by leveraging the approximate relationship (EI sim log {Gamma }_{alpha }). In this formulation, equations (11) and (12) can be interpreted as the differences in eff between macro- and micro-dynamics under the potentially optimal coarse-graining strategy, further reinforcing their role as measures of CE.
As a result, we can conclude that the essence of the CE phenomenon lies in the presence of redundant information pathways underlying P, which are non-reversible and represented by singular vectors corresponding to zero or near-zero singular values. The quantification of CE is achieved by measuring the potentially best improvement in averaged reversibility(γα(r)) or information transmission efficiency when these redundant channels are removed by potentially optimal coarse-graining strategy.
Experiments
In this section, we will show the numeric results on the comparison between EI and Γα, the quantification of CE under the new framework. A new method of coarse-graining based on SVD is proposed to compare the results derived from EI. In this section, we use Γ to abbreviate Γα=1 if there is no extra declarations.
Comparisons of E
I and Γ
a. Similarities
In section “Connections between Γα and EI”, we have derived that EI is upper and lower bounded by a linear term of (log {Gamma }_{alpha }) and we conjecture an approximate relationship: (EI sim log {Gamma }_{alpha }) in theory. We will further verify these conclusions by numeric experiments in this section.
We compare Γα and EI on a variety of normalized TPMs generated by three different methods: 1) softening of permutation matrices; 2) softening of degenerate matrices; and 3) totally random matrices. Permutation matrices consist of orthogonal one-hot row vectors, whereas the row vectors in degenerate matrices may contain repeated rows.
The approach to soften permutation and degenerate matrices involves assigning an intensity value pij to each entry positioned at the ith row and the jth column of the generated permutation matrix P. Here, ({p}_{ij}=frac{1}{sqrt{2pi }sigma }exp (-frac{{(j-{c}_{i})}^{2}}{{sigma }^{2}})), where ci represents the position of the element with a value of one in the ith row vector as a one-hot vector, and σ is a parameter that regulates the softening intensity. For detailed information, refer to Supplementary Section C.1.
Notice that, in all cases, the final values of pij are normalized by dividing by (mathop{sum }nolimits_{j = 1}^{N}{p}_{ij}), ensuring that pij represents transitional probabilities. The results are shown in Fig. 2. The details of these generative models are in Supplementary C.

a shows the approximate relationship EI ~ Γ for randomly generated TPMs which are softening of permutation matrices controlled by different σ (detailed could be found in Section C.1). The theoretical and empirical upper bounds are also shown as dashed lines. The lower bound is omitted because it is size-dependent. Each curve is obtained by tuning up the magnitude of softening on a randomly generated permutation matrix with different sizes. b demonstrates the same relationship between EI and Γ for the TPMs which are generated by the similar softening method, but which are based on a variance of the N × N identity matrix. The variance is to change N − r row vectors of identity matrix to the same one-hot vectors (with the value one as the first element), where r is the rank of the matrix, N = 50. And the number N − r can be treated as a control of the degeneracy of the TPM. On this figure, all the upper bounds and lower bounds are shown as dashed lines. c shows the same relationship for the combination of randomly sampled normalized row vectors for various sizes (N ∈ {2, 3, ⋯ , 100}). On each size, 100 such random matrices are sampled to get the scatter points. The scatter points for particular sizes N = 20, 30, 50 are rendered with red to show the nearly logarithmic relation between EI and Γ. The empirical upper bound is also shown as the dashed line. d demonstrates the dependence of the difference between EI and Γ ((log Gamma -EI)) on the softening magnitude σ for the matrices generatedf. e, f shows the density plots of EI (c) and Γ (d) with different parameters p and q computed for a parameterized simplest TPM: (P=left(begin{array}{cc} p & 1-p \1-q & q end{array}right)) with size 2 × 2.
As shown in Fig. 2a–c, a positive correlation is observed on all these examples, and the approximate relationship (EI sim log Gamma) is confirmed for large N ≫ 1. This relation is obviously observed in Fig. 2a, b, but degenerates to a nearly linear relation in Fig. 2b since limited value region of Γ is covered. More results on different α can be referred to Supplementary Section C.1.
We also show the upper and lower bounds of EI by the red dashed lines in Fig. 2a, b. However, in Fig. 2c, the theoretical bounds are not visible due to the concentration of points in a small area.
Empirically, a tighter upper bound of EI by (log {Gamma }_{alpha }) is found as the grey broken lines shown in Fig. 2. Therefore, we conjecture a new relationship, (EIle log {Gamma }_{alpha }) holds, but the rigor proof left for future works.
We also obtain an analytic solution for EI and Γ on the simplest parameterized TPM with size N = 2 (see Supplementary D), and we show the landscape how EI and Γ are dependent on the parameters p and q, where p and q are the diagonal elements for the 2*2 TPM, (P=left(begin{array}{cc}p&1-p\ 1-q&qend{array}right)). It is clear that both EI and Γ attain their maximum values on the diagonal regions (p = q = 0 or p = q = 1, in this case, P is a permutation matrix). The differences between Fig. 2e, f are apparent: 1) Γ has a peak value when p ≈ 1 − q but EI has not because it takes same value(0) when p ≈ 1 − q and q ≈ 1 − p; 2) a broader region with EI ≈ 0 is observed, while the region with Γ ≈ 1 is much smaller; 3) an asymptotic transition from 0 to maximal N = 2 is observed for Γ, but not for EI because Γ is a convex function but EI is not(see Corollary 1 in Supplementary Material for section A.1).
While differences exist, our observations indicate that for the majority of regions within p and q, there is a strong correlation between EI and Γα in Fig. 2(e) and (f).
b. Differences
Although deep connections between EI and Γα have been found, differences between these two indicators exist.
Firstly, EI quantifies the distinctions between each row vector and the average row vector of P through KL divergence as defined in Equation (2). Put differently, EI gauges the resemblances among the row vectors. Conversely, Γα assesses the dynamic reversibility, particularly as α approaches 0, which correlates with the linear interdependence among the row vectors. While the linear interdependence of row vectors suggests their similarity – meaning two identical row vectors are linearly dependent – the reverse is not necessarily true. Consequently, Γα not only captures the similarities among row vectors but also the proximity of P to a dynamically reversible matrix. In contrast, EI cannot accomplish this task.
This assertion can be validated through the following numerical experiments: we can create TPMs by blending linearly dependent row vectors with linearly independent row vectors, where the number of independent vectors, or the rank, is a controlled parameter. The matrices are constructed using r orthogonal one-hot vectors as the base vectors and N—r real vectors. To soften the base vectors with a magnitude of σ, we employ a method similar to that in the previous section and Supplementary Section C.1. The remaining N—r real vectors are generated by linearly combining the base vectors, with coefficients sampled from random real numbers uniformly distributed on the interval [0, 1]. The size N = 50 is fixed in these experiments. We then quantify the disparity between Γ and EI, with the outcomes depicted in Fig. 2d.
It becomes evident that for small values of r, with increasing σ(the magnitude of the softening), the disparity between (log Gamma) and EI diminishes, given that the linear independence of P strengthens as vectors become more distinct. This underscores that linear dependency does not equate to the similarity between row vectors. However, as the number of independent row vectors grows, if σ remains small, P converges towards a permutation matrix. Consequently, both EI and (log Gamma) reach identical maximum values. This elucidates why a slight bump is noticeable when r is substantial.
Secondly, a significant distinction exists between EI and Γα even in the scenarios where all row vectors are identical, resulting in EI = 0 while ({Gamma}_{alpha}={parallel bar{P}}{parallel}^{alpha}cdot {N}^{alpha /2}), a quantity that can vary with ({parallel bar{P}}!parallel) (refer to Lemma 8, Supplementary A.2). This discrepancy implies that Γα—as opposed to EI—can provide more comprehensive insights regarding the row vectors beyond their similarity to the average row vector.
The differences between EI and Γα suggest that the linear dependency of the information pathways, represented by the row vectors in P, may influence both their correlations and the CE of the dynamics, but cannot be captured by EI. This resembles the concept of proportional lumpability in Markov chains42.
Quantifying causal emergence
Examples for clear and vague CE are shown in this section. First, to show the validity of our new framework for CE, particularly, the equivalence to the framework of EI maximization, several Markov dynamics on Boolean networks which have been proposed in Hoel et al.’s papers5,6 are selected to test our method, and compare with the results for CE derived from the method of EI. All the coarse-grained strategies in these examples are optimal for EI maximization as claimed by Hoel et al. in ref. 5.
Two examples of TPMs generated from the same Boolean network model with identical node mechanism for clear emergence and vague emergence are shown in Fig. 3a–i, respectively. The TPM in Fig. 3d is derived from the Boolean network and its node mechanism in Fig. 3a, b directly. Their singular value spectra are shown in Fig. 3e, h, respectively.

a The original stochastic Boolean network model, each node can only interact with its network neighbors; b Shared node dynamics for all nodes in (a). Each row corresponds to the states combination of one node’s all neighbors in previous time step, and each column is the probability to take 0 or 1 of the node at current time step. c The coarse-grained Boolean network of (a) which is extracted from the TPMs and the relations between micro- and macro nodes illustrated in (f) and (i) by identifying the macro-state α = 0, β = 0 as the micro-states for (0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 0), (0, 1, 0, 0), (0, 1, 0, 1), (0, 1, 1, 0), (1, 0, 0, 0), (1, 0, 0, 1), (1, 0, 1, 0), α = 0, β = 1 as the micro-states for (0, 0, 1, 1), (0, 1, 1, 1), (1, 0, 1, 1), α = 1, β = 0 as the micro-states for (1, 1, 0, 0), (1, 1, 0, 1), (1, 1, 1, 0), and α = 0, β = 1 as the micro-states for (1, 1, 1, 1). d The corresponding TPM of (a) and (b). e The singular value spectrum for (d). g A perturbed TPM from (d). h The singular value spectrum for (g). f, i are the reduced TPMs and the projection matrices (below) after the application of our coarse-graining method on the original TPMs in (d) and (g), respectively. j is the visulization of the original network sampled from the stochastic block model with p = 0.9(the probability for inner community connections) and q = 0.1(the probability for inter-community connections), and the nodes are colored with different hues to distinguish the blocks to which they belong. There are 5 blocks in total. The edges are undirected and binary. The TPM is obtained by normalizing the adjacency matrix by dividing by each node’s degree. k The singular value spectrum of three samples of the stochastic block model network with different p and q. l is the reduced network of (j) obtained through our coarse-graining method, with the node grouping results aligning with the initial block settings.
There are only 4 non-zero singular values (Fig. 3(e)) for the first example in (d), therefore, clear CE occurs, and the degree of CE is ΔΓ = 0.75. This judgment for the occurrence of CE is the same as the example mentioned in ref. 5.
Vague CE can be shown on the TPM in Fig. 3(g), which is derived from (d) by adding random Gaussian noise with strength (std = 0.03) on the TPM in (d) and normaliz. As a result, the singular spectrum is obtained as shown in Fig. 3h. We select ϵ = 0.2 as the threshold such that only 4 large singular values are left. The degree of CE is ΔΓ(0.2) = 0.69. The value of ϵ is selected according to the spectrum of singular values in Fig. 3(h) where a clear cut-off at the index of 3 and ϵ = 0.2 can be observed.
Figure 4a–f shows another example of clear CE on a more complex Boolean network model from the reference of5, where 6 nodes with the same node mechanism can be grouped into 3 super-nodes to show CE. The corresponding TPM of the original Boolean network model is shown in Fig. 4c. The spectrum of the singular values is shown in Fig. 4d where 8 non-zero values are found. The degree of this clear CE is ΔΓ = 2.23. The same judgment of the occurrence of CE is obtained as in ref. 5. More examples on Boolean networks in refs. 5,6 can be referred to Supplementary Section E.1.

a is the Boolean network model with 6 nodes and 12 edges, the micro-mechanism can be referred to ref. 5. b is the coarse-grained Boolean network model according to the coarse-grained TPM of (e). c is the corresponding TPM of (a). d is the spectrum for the singular values of (c) with only 8 non-zero values. e is the coarse-graining of (c). f is the projection matrix from the micro-states to the macro-states obtained according to our coarse-graining method based on SVD. g is the evolution of the 40th elementary cellular automaton (the rule is 000 → 0, 001 → 0, 010 → 1, 011 → 0, 100 → 1, 101 → 0, 110 → 0, 111 → 0). h shows the two spectra of the singular values for four distinct local TPMs of the same cellular automaton. The local TPM elucidates the process by which a cell moves from its present state to the subsequent state within a specified environment, encompassing the states of the focal cell and its two neighboring cells. i shows the quantification of CE for local TPM (ΔΓ, the red dots indicate the cells where the quantities of CE are non-zeros.) as well as the original evolution of the original automaton (the background).
The quantification for CE can be applied on complex networks (Fig. 3j–l) and cellular automata (Fig. 4g–i). The example of vague CE is shown for complex networks generated by stochastic block models(SBM) with three sets of parameters (inner or intra connection probabilities) on Fig. 3j–l and the same number of blocks(communities) which is 5. The TPMs are obtained by normalizing the adjacency matrix of the network by each node’s degree. One of the exampled networks with 100 nodes and 5 blocks (communities) is shown in Fig. 3(j) and its spectrum of singular values is shown in Fig. 3k where a clear cut-off (ϵ = 0.3, rϵ = 5) can be observed at the horizontal coordinate as same as the number of blocks. We can ascertain that vague CE occurs in this network model with the degree of CE, ΔΓ(0.3) = 0.56. Two more spectra of the networks generated by the SBM with the same size and the number of blocks(5) but different parameters are shown in the same figure.
The definition on clear CE can be applied on cellular automata to discover its local emergent structures as shown in Fig. 4g–i. In this example, we quantify clear CE for local TPMs of a cellular automaton (number 40 elementary one-dimensional cellular automaton). The local TPM is obtained by the local windows including each cell and its two neighbors. The possible spectra of singular values for these local TPMs are shown in Fig. 4h where clear CE may or may not occur. Figure 4i shows the distributions of clear CEs (ΔΓ) for all cells and time steps with the red dot markers for ΔΓ > 0. We also plot the original evolution of this automaton as the background. From this experiment, we can draw a conclusion that we can identify the local emergent structures with the quantification of causal emergence for local TPMs.
Coarse-graining based on SVD
Although our quantification method of CE is coarse-graining method agnostic, to compare the results against the theory based on EI, we invent a concise coarse-graining method based on SVD. The details about this method can be referred to section ‘Method’.
We test our method on all the examples shown in Figs. 3 and 4. First, for the two TPMs generated according to the same Boolean network model shown in Fig. 3d, g, their coarse TPMs are shown in Fig. 3f, i, respectively. The macro-level Boolean network model (Fig. 3c) can be read out from the TPMs and the projection matrix Φ. Notably, the Γs in the coarse TPMs are almost identical to the original ones, which means that, in this scenario, our method effectively maintains Γ with minimal alteration. We further test the CE examples in the refs. 5,6, and almost identical coarse TPMs can be obtained.
Second, the reduced TPM of the original one (Fig. 4a) can be obtained by the same coarse-graining method as shown in Fig. 4e, and the projection matrix Φ is shown in (f). The coarse-grained Boolean network can be read out from the reduced TPM and the projection matrix as shown in Fig. 4b. In this example, although Γ is much reduced (from Γ = 20.39 to Γ = 8.0) due to the loss of information in coarse-graining, the normalized approximate dynamical reversibility increases (from γ = 0.32 increases to γ = 1.0). The consistent coarse-graining results from the SVD and EI maximization methods confirm the equivalence of the two theories for CE.
The same coarse-graining method can also be applied in complex networks generated by SBM. The reduced network with 5 nodes that is derived from the original network (Fig. 3j) is shown in Fig. 3l. In this example, a relatively large decrease of Γ (from Γ = 13.30 to Γ = 3.33) and a large increase of γ (from γ = 0.13 to γ = 0.67) are also observed simultaneously. This indicates a large amount of information is lost during the coarse-graining, while a relatively more effective small network model with larger normalized approximate dynamical reversibility can be obtained.
We also compared the results of the SVD method and the EI maximization method on SBM networks and cellular automata, and found similarities between the vectors representing the optimal coarse-graining strategy for EI maximization and the singular vectors associated with the largest singular values, as detailed in Supplementary Section B.4.
Discussion
In summary, our investigation reveals that the Schatten norm, defined as the sum of the α powers of all singular values of a TPM, serves as a valuable indicator of the approximate dynamical reversibility within Markov dynamics. Both theoretical insights and empirical examinations substantiate a positive association, indicating a rough correspondence between the Effective Information (EI) metric and the logarithm of the measure Γα. Notably, it is crucial to distinguish between these two metrics. While Γα effectively quantifies the degeneracy of a TPM, particularly as α → 0, encompassing scenarios where all row vectors are linear correlation, EI solely characterizes the similarity between these row vectors and falls short in distinguishing degenerate yet dissimilar row vectors.
Expanding on the concept of Γα, we have introduced a novel CE theory that offers a more refined definition, capturing the intrinsic properties of a system regardless of the specific coarse-graining technique employed. By applying SVD to a TPM, we have revealed that the core of CE lies in the presence of redundant irreversible information pathways in the dynamics. Therefore, the degree of CE can be quantified by the maximum potential improvement in the averaged approximate dynamical reversibility or the efficiency of information transfer within the dynamics if the redundant information pathways are discarded.
The validity of our CE framework is supported by the approximate relationship (EI sim log {Gamma }_{alpha }) and the strong positive correlation observed in numerical examples. We also demonstrated the equivalence of the SVD-based and EI maximization-based frameworks for CE quantification, with the former serving as a necessary condition for the latter. All numerical experiments conducted on Boolean networks, cellular automata, and complex networks discussed in this paper support this conclusion.
Although this theoretical framework parallels the CE theory based on EI, quantifying CE using SVD offers greater insights. Firstly, this work demonstrated that the redundancy in Markov dynamics is evident in the correlations between different information pathways, represented by the row vectors of the TPM. While these correlations can be partially characterized by EI, the differences between Γα and EI discussed in Section ’Comparisons of EI and Γ’ indicate that more correlations may arise from the linear dependencies among row vectors. These cannot be quantified by EI but can be captured by Γα. Secondly, due to the independency of our framework on the concrete coarse-graining strategy, the problem of finding the optimal coarse-graining strategy for EI maximization and the problem of ambiguity and the violation for the commutativity do not exist. The SVD-based quantification of CE discusses a potentially optimal increase on the efficiency of information transmission. Finally, our method implies a way to find the optimal coarse-graining strategy for maximizing EI: to project more probability masses of states onto the directions of singular vectors corresponding to the largest singular values, that is the basic idea of the coarse-graining method mentioned in section ‘Method’.
Our framework has several potential extensions. One interesting direction is to incorporate Rosas’ framework of CE based on integrated information decomposition12. The synergetic nature of dynamics should be represented by the TPM and characterized through SVD. Another promising extension is to explore the relationship between SVD and Φ, the degree of information integration, as integrated information theory is also grounded in effective information43,44. Our work aligns with the low rank hypothesis in complex systems as discussed in refs. 40,45. While these studies also utilized the SVD method to uncover emergent phenomena in complex systems, their objectives differ. This paper focuses on the transition probability matrix (TPM) that describes system dynamics, whereas Thibeault et al. analyze the coefficient matrix of complex dynamics, and Chen et al. concentrate on the matrix of fluctuations. Further research is needed to explore the relationships among these approaches.
Furthermore, this research establishes a potential link between statistical physics and artificial intelligence by framing both coarse-graining and macro-dynamics (coarse-grained TPM) as computational processes performed by an intelligent agent. Traditional statistical mechanics indicate that the information lost during coarse-graining corresponds to Boltzmann entropy46. The resulting emergent macro-dynamics can be viewed as the agent’s internal model representing the underlying micro-dynamics47. Consequently, this work on CE suggests that the agent seeks to develop a reversible dynamical model of reality while incurring information loss during the coarse-graining observation process. Therefore, the intelligent agent’s challenge lies in finding an optimal balance between the information loss from coarse-graining and the resulting gains in reversibility. Ironically, as quantum mechanics suggests, the fundamental world is dynamically reversible.
This work has several weak points. First, the definition and quantification of emergence are abstract due to the absence of a coarse-graining strategy, which fails to capture the conventional idea that the whole is greater than the sum of its parts. Second, discussions have primarily focused on state space, while an approach incorporating variable space would be more practical, despite the exponential growth of state space size with the number of variables, posing a significant challenge. Lastly, this work relies on the TPM of dynamics, but estimating the TPM from data is difficult.
Another weak point in this study is that the theoretical upper and lower bounds of EI by (log {Gamma }_{alpha }) should be further studied since a tighter upper bound is empirically observed from numeric experiments, but cannot be proved. More mathematical tools and methods should be developed to study this problem. As a result, further research is warranted to address these issues.
Methods
Although clear or vague CE phenomena mentioned in section “A new quantification for causal emergence” can be defined and quantified without coarse-graining, a simpler coarse-grained description for the original system is needed to compare with the results derived from EI. Therefore, we also provide a concise coarse-graining method based on the singular value decomposition of P to obtain a macro-level reduced TPM. The basic idea is to project the row vectors Pi, ∀ i ∈ {1, 2, ⋯ , N} in P onto the sub-spaces spanned by the singular vectors of P such that the major information of P is conserved, as well as Γα is kept unchanged.
Coarse-graining strategy based on SVD
Concretely, the coarse-graining method contains five steps:
-
1.
We first make SVD decomposition for P (suppose P is irreducible and recurrent such that stationary distribution exist):
$$P=Ucdot Sigma cdot {V}^{T},$$(13)where, U and V are two orthogonal and normalized matrices with dimension N × N, and Σ = diag(σ1, σ2, ⋅ ⋅ ⋅ , σN) is a diagonal matrix which contains all the ordered singular values.
-
2.
Selecting a threshold ϵ and the corresponding rϵ according to the singular value spectrum;
-
3.
Reducing the dimensionality of row vectors Pi in P from N to r by calculating the following Equation:
$$tilde{P}equiv Pcdot {V}_{Ntimes r},$$(14)where ({V}_{Ntimes r}=({V}_{1}^{T},{V}_{2}^{T},cdot cdot cdot ,{V}_{r}^{T})).
-
4.
Clustering all row vectors in (tilde{P}) into r groups by K-means algorithm to obtain a projection matrix Φ, which is defined as:
$${Phi}_{ij}=left{begin{array}{ll}1&,{text{if}},,{tilde{P}}_{i},,{text{is in the}},,j,{text{th},{text{group}}},\ 0&,{text{otherwise}},,end{array}right.$$(15)for ∀ i ∈ {1, 2, ⋯ , N}, and ∀ j ∈ {1, 2, ⋯ , r}.
-
5.
Obtain the reduced TPM according to P and Φ.
To illustrate how we can obtain the reduced TPM, we will first define a matrix called stationary flow matrix as follows:
where μ is the stationary distribution of P which satisfies P ⋅ μ = μ.
Secondly, we will derive the reduced flow matrix according to Φ and F:
where, ({F}^{{prime} }) is the reduced stationary flow matrix. In fact, this matrix multiplication inherently aggregates the flux across all micro-states (nodes) within a group to derive the flux for the corresponding macro-states (nodes). Finally, the reduced TPM can be derived directly by the following formula:
Finally, ({P}^{{prime} }) is the coarse-grained TPM.
Explanation
We will explain why this coarse-graining strategy outlined in the previous sub-section works here.
In the first step, the reason why we SVD decompose the matrix P is that the singular values of P actually are the squared roots of the eigenvalues of PT⋅ P because: ({P}^{T}cdot P={(Pcdot {P}^{T})}^{T}=(Vcdot Sigma cdot {U}^{T})cdot (Ucdot Sigma cdot {V}^{T})=Vcdot {Sigma }^{2}cdot {V}^{T}), thus, we will try our best to utilize the corresponding eigenvectors in V to reduce P, because they may contain more important information about P.
In the third step, the eigenvectors with the largest eigenvalues are the major axes to explain the auto-correlation matrix PT ⋅ P. Therefore we can use PCA method to reduce the dimensionality of Pi, ∀ i ∈ {1, 2, ⋯ , N}, it is equivalent to projecting Pi onto the subspace spanned by the r first eigenvectors.
In the fourth step, we cluster all the row vectors Pi, ∀ i ∈ {1, 2, ⋯ , N} into k < r groups according to the new feature vectors of Pi. Actually, according to the previous studies48, these r major eigenvectors can be treated as the centroids of the clusters obtained by K-means algorithm for the row vectors Pi, ∀ i ∈ {1, 2, ⋯ , N}. Therefore, we cluster all row vectors in P by K-means algorithm, and the row vectors in one group should aggregate around the corresponding eigenvectors.
The final step is to obtain the reduced TPM according to the clustering result or Φ in the previous step. This is a classic problem of lumping a Markov chain25,27. There are many lumping methods, and this method is employed because directly coarse-graining the TPM based on clustering results may violate lumpacity and probability normalization conditions. Instead, we coarse-grain the dynamics using stationary distributions, ensuring that both the stationary distribution and total stationary flux remain unchanged during the coarse-graining, as they function as conserved quantities like energy or material flows49. Consequently, we develop the method described in the previous subsection, which ensures adherence to normalization conditions and maintains the commutativity between the macro-dynamics TPM and the coarse-graining operator as proved in Proposition 5.
Responses