Correlating measures of hierarchical structures in artificial neural networks with their performance

Introduction

It is widely recognized that the performance of neural networks depends on their architecture¹. On one hand, the design of these structures is inspired by biological neural networks. For instance, convolutional networks draw from the concept of visual receptive fields^2,3. On the other hand, the introduction of new network architectures often leads to significant performance leaps, such as the Transformer, which incorporates aspects of attention mechanisms⁴. While it is known that the architecture of artificial neural networks is a key determinant of their functionality and performance, in practice, the development of neural networks is moving towards increasingly larger scales, more complex structures, and greater numbers of layers. For example, the scale of networks, in terms of the number of parameters, has grown from the thousands in LeNet to the millions in AlexNet, and then to the billions in GPT; while the architecture has evolved from LeNet to AlexNet, VGG, GoogLeNet, various Autoencoders, and more recently to Transformer^5,6,7,8,9.

While large-scale networks demonstrate formidable capabilities in practical applications, the increased complexity of neural networks implies a need for more training data, longer training times, higher expenses, and a growing concern for energy consumption¹⁰. If it were possible to estimate the appropriate neural network structure and its complexity for specific tasks in advance, substantial computational resources could be saved. Additionally, in situations where computational resources are limited, such as in small devices or space equipment, it might be necessary to use the smallest possible neural networks¹¹. Therefore, an understanding of the general laws governing the relationship between the structure and performance of neural networks also holds significant practical value.

Researchers are actively investigating the relationship between neural network structures and their performance from various perspectives. By representing neural networks as graphs, one can employ existing graph metrics or define new ones to elucidate the link between network performance and these metrics across different applications^12,13. Some research directly focuses on the particular structure of networks^14,15,16, while other studies introduce novel neural network structures to examine their impact on network performance enhancement¹⁷. These studies have raised numerous fundamental questions that are urgently in need of being answered. Questions such as whether there is a systematic relationship between neural network structures and their performance, or whether high-performing neural networks possess distinct structural characteristics, are central to this research area.

Hierarchy is an important feature among various structural characteristics. In the research into both artificial and biological neural networks, the importance and the potential universality of hierarchical structures have been discovered (see the two paragraphs at the end of this section). However, quantitatively characterizing the hierarchical structure of neural networks is not a trivial task. The Ladderpath approach, a recently developed mathematical tool within the broader category of Algorithmic Information Theory (AIT), offers a rigorous analytical approach for examining the structural information of a target system^18,19.

This approach pinpoints repeating substructures (or “modules”) within the target system, which themselves may consist of even smaller repeating units, progressively breaking down to the most basic building blocks. These repeating substructures can then be reorganized into a hierarchical, modular-nested, tree-like structure. One extreme scenario is a system with a flattened hierarchy, essentially devoid of repeating structures, and can be likened to completely disordered structures or entirely random genetic sequences. The other extreme involves simple repetition of small substructures, like a doubling sequence (2 becomes 4, 4 becomes 8, 8 becomes 16, and so forth), which is comparable to crystals. Systems deviating from these extremes and positioned in the middle are those with a rich hierarchical structure. We hypothesize that the degree of richness in this hierarchical structure is positively correlated with the performance of neural networks.

The Ladderpath approach has recently been successfully applied in living and chemical systems, such as analyzing the evolution of protein sequences¹⁹; a similar method has also been used to investigate the origins of life through the analysis of molecular structure²⁰. The applicability of this approach in these areas is linked to the tight structure-function relationship prevalent in evolutionary systems (e.g., the structure of molecules dictates their physicochemical properties, and the structure of drug molecules determines their binding with receptor proteins). Similarly, this structure-function relationship is likely to exist in another key evolutionary system: intelligence. This paper employs the Ladderpath approach to analyze the structure-function relationship in artificial neural networks. We acknowledge that our work is currently limited to multilayer perceptrons (MLP) and networks of constrained sizes due to the technical challenges of ladderpath calculations, as we have shown that these calculations are an NP-hard problem¹⁸. However, our primary motivation is to demonstrate the effectiveness and feasibility of this AIT approach in studying the hierarchical and nested structures of artificial neural networks. More importantly, coupled with the recent work on protein sequences as demonstrated in¹⁹, this type of structure-function relationship appears to manifest in several seemingly unrelated systems, suggesting a certain universality that merits attention.

Before moving forward, let us review the previous literature on hierarchical structures. The first part is about hierarchical structures in artificial neural networks. For certain specific networks, such as reservoir computing, which is an increasingly prominent neural network framework for studying dynamical systems²¹, research has shown that their computational capacity is maximized when they are in a critical state²² (although some scholars believe that this maximization of computational capacity is conditional and only true under specific circumstances²³). In 2021, Wang et al. discovered that in well-trained reservoir networks, the nodes synchronize in clusters, and the size distribution of the synchronization clusters follows a power law¹⁵. It is important to note that a power-law distribution is often a key characteristic of a critical state, and this distribution is typically a manifestation of a richly hierarchical structure²⁴. In 2020, Leskovec et al. proposed a method to represent neural networks as graphs, termed relational graphs, which characterize the inter-layer connections of neural networks¹³. They discovered that in cases where a neural network performs well, the average path length and clustering coefficient of its corresponding relational graph tend to always fall within a certain range (it is notable that these two indices intuitively represent characteristics that are typically opposite in nature). Furthermore, the training process tends to shift these two structural metrics towards this range. In 2018, Ying et al. introduced a differentiable graph pooling method, designed to integrate complex hierarchical representations of graphs, and further connect deeper graph neural network models to these representations¹⁶. This method has significantly enhanced accuracy in graph classification benchmark tasks compared to other pooling methods, underscoring the critical role of utilizing hierarchical information for neural network performance. In 2016, Bengio et al. proposed three complexity metrics for the architecture of recurrent neural networks from a graph perspective: recurrent depth, feedforward depth, and recurrent skip coefficient. Experiments have revealed that increasing recurrent and feedforward depth can enhance network performance, and augmenting the recurrent skip coefficient can improve performance in tasks with long-term dependencies¹².

Then, this paragraph covers hierarchical structures in biological neural networks. In real biological neural networks, similar studies have also shown a positive correlation between hierarchical structure and good performance. Baum et al. analyzed samples from youths aged 8–22 in the Philadelphia Neurodevelopmental Cohort to study the evolution of structural brain networks over time²⁵. They observed that as individuals age, the network structure undergoes a modular separation process, where connections within modules strengthen and inter-module connections weaken. Furthermore, this separation appears to be beneficial for the development of brain functions in youths. Vidaurre et al., using whole-brain resting-state fMRI data, employed methods such as hidden Markov chains and hierarchical clustering to analyze the organizational dynamics of brain networks over time²⁶. They discovered a distinct hierarchical structure within brain dynamics, which shows significant correlations with individual behavior and genetic predispositions. Their research underscores a close connection between the hierarchical structure and function in actual brain networks. In 2021, Zhou et al. studied the relationship between network structure and its dynamic properties from different perspectives. They discovered that modular network topologies can significantly reduce both operational and connectivity costs, thus achieving a joint optimization of efficiency²⁷. Also in 2021, Luo reviewed common circuit motifs and architectural plans, exploring how these circuit architectures assemble to achieve various functions²⁸. These circuit motifs can be considered as “words”, which combine into circuit architectures that might operate at the level of “sentences”. This work emphasizes the importance of hierarchical structures: only after gaining a better understanding of the nested modular hierarchical structures among these neuronal connections can we begin to understand the “paragraphs” (e.g., brain regions) and eventually the “article”, which may inspire more important advances in artificial intelligence.

Methods

Recap the Ladderpath approach

The Ladderpath approach, detailed in^18,19, offers a quantitative approach to analyze structural information in systems, ranging from sequences, molecules to images. It iteratively identifies repeating substructures, called ladderons, which are essentially reused modules. These substructures may be composed of smaller repeating units, cascading down to the system’s most basic building blocks. These ladderons collectively form a hierarchical, nested, tree-like structure known as a laddergraph.

Using the sequence ABCDBCDBCDCDACAC (denoted as ({mathcal{X}})) as an example to illustrate, the laddergraph is computed as shown in Fig. 1, demonstrating how to construct the target sequence ({mathcal{X}}) from the four basic building blocks A, B, C, and D in the most efficient manner. First, combine C and D to construct CD, and A and C to construct AC, which takes 2 steps; then combining B with CD to construct BCD is another step. Next, construct the target sequence ({mathcal{X}}) using already generated substructures (or “modules”) and basic building blocks. This involves combining A, BCD, BCD, BCD, CD, AC, and AC, thus taking 6 steps; in total, this is 9 steps so far; the final step involves outputting ({mathcal{X}}). Therefore, constructing ({mathcal{X}}) in the most efficient manner takes 10 steps in total.

Correlating measures of hierarchical structures in artificial neural networks with their performance — **Fig. 1: Laddergraphs of different systems.**

({mathcal{X}}) has 16 letters, but only 10 steps were needed due to the reuse of some repeating substructures: for example, the previously constructed CD is reused in constructing BCD; while BCD appears three times, the subsequent instances of BCD can directly use the initially constructed BCD, saving many steps. This principle of reuse is a fundamental concept in AIT and the Ladderpath approach.

Note that in our example, some steps can be interchanged, such as whether to construct CD or AC first, which does not matter. However, the order between CD and BCD must not be reversed, because BCD is based on CD. Therefore, the laddergraph also corresponds to a partially ordered multiset, which can be denoted as

$${,text{B},text{D},text{A}(2),text{C}(2)/!!/text{AC},text{CD}/!!/text{BCD},(2)}$$

(1)

Steps within the same level (i.e., separated by “(/!!/)”) can be interchanged, but not across levels. This is why the sequence exhibits a hierarchical structure.

Now, we can introduce several important notions within the Ladderpath approach. In the example above, the “10 steps” are defined as the ladderpath-index of ({mathcal{X}}), while another quantity, the order-index ω, is defined as the length of ({mathcal{X}}) minus its ladderpath-index, which is (omega ({mathcal{X}})=16-10=6). Mathematically, we have shown in ref. ¹⁸ that ω always equals the sum of the “reduced lengths” l of each ladderon (where the reduced length is defined as the length of each ladderon minus one). In this case, (omega ({mathcal{X}})={l}_{{rm{AC}}}+{l}_{{rm{CD}}}+2times {l}_{{rm{BCD}}}) where l_AC is the length of AC minus 1, namely (2 − 1), l_CD = 2 − 1, and l_BCD = 3 − 1; The multiplier of 2 for l_BCD is because its multiplicity in the partially ordered multiset, as shown in Eq. (1), is 2, indicating that it was reused twice. Thus, the order-index ω counts the sizes of all repeating substructures in a system, thereby essentially characterizing how ordered a system is in an absolute measure. For a more detailed theoretical explanation and mathematical derivation, please refer to ref. ¹⁸.

This recap may be considered lengthy, which could seem overwhelming, and some notions might appear unnecessary at first glance. In fact, this is because the Ladderpath approach within AIT was not specifically developed to characterize neural networks but was developed from a more general perspective, which may make it seem verbose. However, it has been successfully applied in seemingly unrelated fields such as protein sequences¹⁹, and our subsequent findings also demonstrate that this approach performs well in artificial neural networks. This indicates that the approach and the relationships between the hierarchical structure and performance it describes are universal to a certain extent, making it worthwhile.

Ladderpath characterizes hierarchical structures

After the recap, we proceed by taking more complex sequences as examples. We will demonstrate the laddergraphs of three typical categories of sequences: minimal repetition, akin to completely random, disordered systems (Fig. 1b); simple repetition, similar to a crystalline structure, where the pattern progresses from 2 to 4, 4 to 8, 8 to 16, and so on (Fig. 1c); and the ones that lie between these two extremes, showcasing the richest hierarchical structure (Fig. 1d).

To better characterize the hierarchical structure, we have, in ref. ¹⁹, defined a relative measure on top of the absolute measure ω, called the order-rate η. To begin with, let us first examine the distribution of ω of various sequences versus their lengths S, as shown in Fig. 2. For a given length S, we can observe that ω has both a maximum and a minimum value: the maximum, denoted as ({omega }_{max }(S)), corresponds to sequences that are completely identical; the minimum, denoted as ω₀(S), corresponds to purely random sequences. The minimum value arises because, in sequences with a finite number of bases (here being A, B, C, and D), repeating substructures will inevitably appear as the length increases, meaning even the purely random sequence will not have an ω of zero. Therefore, we need to normalize it, leading to the definition of the relative measure order-rate η(x) for sequence x:

$$eta (x):=frac{omega (x)-{omega }_{0}(S)}{{omega }_{max }(S)-{omega }_{0}(S)}$$

(2)

where S is the length of the sequence x. An η of 0 means complete disorder (Fig. 1b), 1 implies full order (Fig. 1c), and around 0.5 indicates a richly structured hierarchy (Fig. 1d).

In fact, in the Ladderpath approach, η is related to the complexity of a system. A system is not considered complex if entirely random (η ≈ 0) or ordered (η ≈ 1); complexity emerges only in the intermediate state (η ≈ 0.5). This distinguishes Ladderpath from similar concepts such as Kolmogorov complexity, addition chain, assembly theory, and the “adjacent possible”^29,30,31.

In the context of neural networks, we can apply the Ladderpath approach to systematically organize the network’s repeating substructures in a nested hierarchical manner. As illustrated in Fig. 3a, b, the parts highlighted by red, yellow, and green lines represent these substructures (with the same color indicating identical, reused modules). From this, we can establish a hierarchical relationship (and illustrate the laddergraph): the red substructure is encompassed within both the yellow and green ones. Nevertheless, directly calculating the laddergraph is quite challenging (and this problem is inherently NP-hard¹⁸). Hence, we first transform the network into a set of sequences and then compute it (the method for this transformation will be detailed in section “Sequence representation of a neural network”, and there are already algorithms developed for computing laddergraphs of sequences at the scale of 10,000 in length¹⁹). Finally, we hypothesize that the neural network’s ability to extract and integrate information is at its peak (achieving its best performance) when its hierarchical structure is the richest (i.e., when module reuse and tinkering are most pronounced), corresponding to the order-rate η around 0.5.

**Fig. 3: The diagram illustrating how to employ the Ladderpath approach to analyze a neural network, and reorganize the repeating substructures into a laddergraph.**

Experiment setup

To investigate the connection between the structure of a neural network and its functionality, we chose a rather simple task: recognizing whether a three-digit number is odd or even. This was addressed using a MLP. The input consists of three neurons, representing the hundreds, tens, and ones place, respectively, while the output has two neurons indicating odd or even. The hidden layers vary, either 1, 2, 3, or 4 layers, each comprising a different number of neurons. To simplify the analysis, we limited each MLP to a maximum of 200 edges. With this constraint, an MLP with one hidden layer could have only 40 distinct variations: Given the input layer has 3 neurons and the output layer has 2 neurons, if the hidden layer has 40 neurons (we can denote this MLP [3,40,2]), there are a total of 3 × 40 + 40 × 2 = 200 edges; the MLP [3,41,2] would have 205 edges. For an MLP with two, three, and four hidden layers, we constructed 200 distinct architectures for each category (such as [3,7,17,2], [3,8,3,5,2], and [3,3,5,14,2] which have 2, 3, and 3 hidden layers respectively), contributing to the total of 640 different architectures.

The network connection weights were randomly initialized, and all 640 architectures underwent an identical training period, each for 2000 epochs. During the training, we monitored the changes in performance (measured by accuracy) and the network’s order-rate η (calculated based on the Ladderpath approach), to explore the relationship between these aspects. Section “Results” presents the evolutionary and statistical relationships between structure and function.

Sequence representation of a neural network

To utilize the algorithm based on the Ladderpath approach¹⁹ for studying the structure-function relationship in neural networks, we need to use a sequence or a set of sequences to represent a neural network’s structure. By envisioning a neural network as a signal (i.e., the input) propagation problem or an information flow problem within a network, the collection of all paths that these signals traverse can represent the network structure. Since the connection weights between neurons are real numbers, we first need to coarse-grain them, using different symbols to represent weights within the same range.

The higher the degree of coarse-graining, the more conducive the extracted information is for sequence analysis; the lower the degree of coarse-graining, the more information retained. However, this also results in more unnecessary details, which can make subsequent analysis more difficult. Therefore, we conducted experiments with various degrees of coarse-graining to find a balance that is as large as possible without significantly diminishing functional performance. For the particular systems we selected, our numerical experiments showed that when the coarse-graining interval is set to a maximum of 0.1, it has a minimal impact on the neural network’s performance (see Supplementary Note 1 for detailed information).

Subsequently, we obtain the appropriately coarse-grained graph, and we can convert the graph into a set of sequences. Neurons are the nodes of this graph, and nodes within the same layer are assigned the same symbol due to their shared activation function. Connections between neurons are the edges of the graph, and edges with unique weights (discretized as previously mentioned) are considered to carry distinct information, and are thus represented using different symbols. For example, the path highlighted in gray in Fig. 3 can be denoted as AzBxC. Through this method, we can detail every path an input signal traverses, and aggregate these paths into a set of sequences to depict the graph. We can then employ the Ladderpath approach to examine the sequences, and thereby investigate the characteristics of the neural network’s structure.

Results

Best performance when the hierarchical structure is the richest

To begin with, we statistically investigate the relationship between the order-rate η and the performance of neural networks (measured by the accuracy of the odd-even recognition task). Figure 4 shows the distribution of η vs. accuracy for the aforementioned 640 different architectures after the same duration of training (2000 epochs). We observed that the overall distribution of η values was relatively concentrated, and that networks with good performance predominantly had η values within the range of about 0.3 and 0.6. Note that one reason for the fewer networks on both sides in Fig. 4 is that the η of neural networks naturally tends to move towards center during training (which we shall see more details in section “Network evolves to enrich hierarchical structures through training” soon), regardless of whether the accuracy eventually maximized.

**Fig. 4: The distribution of the order-rate η vs. the accuracy in the odd-even recognition task performed by the neural networks.**

As discussed in section “Ladderpath characterizes hierarchical structures”, the order-rate η characterizes the hierarchical relationships within a system. When η is approximately 0, the network structure is completely disordered. In contrast, when η is around 1, the structure is highly ordered. An η value near 0.5 indicates the richest hierarchical structure, or the most “complex” structure, according to the Ladderpath approach. We conjecture that neither a completely disordered nor an excessively ordered network structure is good for efficient information extraction and integration, thereby affecting the performance of a neural network. The observation in Fig. 4 supports this conjecture, suggesting that neural networks often perform best when they exhibit the richest hierarchical structure, which corresponds to an η value in the middle range.

We also employed MLPs for several other tasks to observe their behaviors on a broader scale. For details on the additional five tasks, which include one more classification task, two regression tasks, and two time series forecasting tasks, please refer to Supplementary Note 2. The results from these five tasks are consistent with the observations reported in this section.

Lastly, note that we can observe that there are some cases where the order-rate η is around 0.5, but the performance is poor. The reason for this actually lies in the fact that the rich hierarchical and nested structure of a network (characterized by η being around 0.5) is not a sufficient condition for its performance. Consider, for example, an MLP that has been well-trained to perform task A and exhibits a rich hierarchical structure (with its η around 0.5). If we use this network for task B, it is almost certain that it will not perform well on this new task (even though the network structure remains unchanged), necessitating re-training with data specific to task B. Even if the re-trained network for task B later has an η around 0.5, its structure might still differ significantly from the original one for task A. In fact, the ultimate reason is that a single η value can correspond to different network structures; thus, η can serve as a reference for assessing a network structure but cannot uniquely define a network.

Network evolves to enrich hierarchical structures through training

From the statistical analysis presented above, we observe a strong correlation between the order-rate η and the performance of neural networks, after extensive training. Our focus now shifts to examining how the η value of a neural network evolves during the course of training. Figure 5a illustrates the behavior of the randomly initialized MLP [3,7,17,2] (thereby, η is initially around 0.1 and 0.2). As training progresses, the η value tends towards 0.5, simultaneously accompanied by a gradual increase in accuracy, indicating a robust correlation between the two. Figure 5b shows the same MLP initialized uniformly, where all weights are set identically (resulting in an initial η approximately equal to 1). Similar to the previous case, with ongoing training, the η value of this network also approaches the center, and its accuracy improves correspondingly. This trend is also observed in another MLP architecture [3,6,15,2], as demonstrated in Fig. 5c and d.

**Fig. 5: The evolution of η and accuracy during training, with different initializations.**

Note that it is not the case that all networks that ultimately reach an η value in the center will perform sufficiently well (as can be seen from Fig. 4, where some networks with η in the middle still exhibit poor performance). Nonetheless, if a network is to effectively handle the task, its η needs to be in the middle range. This observation suggests two key points: (1) Training, namely the process of gradient descent in search of solutions, essentially leads to the neural network developing a richer hierarchical structure; (2) While a rich hierarchical structure does not guarantee successful problem handling, having such a rich hierarchical structure is essential if the network is to perform well.

Initial value of order-rate η makes a difference

For many neural networks and other areas in machine learning, the selection of initial values is crucial. An improper choice of these initial values can result in slower convergence and worse performance. Based on the previously discussed relationship between the order-rate η and the performance of a neural network, we hypothesize that initializing a neural network so that its order-rate η is set to a mid-range value might facilitate easier training success or yield better performance. The following experiments are conducted to test this hypothesis.

Initially, we selected a network architecture, specifically [3,8,3,5,2]. We used various initialization to establish a broad spectrum of initial η values. After that, we trained each network for 2000 epochs to observe the final performance of these neural networks, namely their accuracy in the task of differentiating between odd and even numbers.

The results, illustrated in Fig. 6, were quite revealing. We found that networks with an initial η value set at a medium level (approximately within the range of 0.35 to 0.7) demonstrated a higher likelihood of achieving an accuracy exceeding 0.8, while networks initialized outside of this range almost invariably failed to reach such an accuracy through training. Furthermore, we observed that the average accuracy after training was also at its peak when η was initialized at this medium level (noting that the baseline accuracy for the odd-even recognition task is 0.5, which is achievable through random guessing). This suggests a critical correlation between the initial value of η and the network’s eventual performance.

**Fig. 6: Distribution of accuracy after training with various initializations. The specific MLP architecture used is [3,8,3,5,2].**

Discussion

Research into the relationship between the structure and functionality of neural networks aids theoretically in understanding the mathematical principles behind neural networks and artificial intelligence. Practically, it helps in optimizing configurations for specific tasks and also provides fresh ideas for designing new types of neural networks. We employ the recently developed Ladderpath approach which is within the broader category of AIT to study the structure-function relationship in neural networks. The Ladderpath approach suggests that the complexity of a system should be evaluated on two axes: one is the ladderpath-index, relating to the difficulty of reconstructing the system, and the other is the order-index, which is roughly positively correlated with the number of repeating substructures in the system and their hierarchical relationships. Systems with a high ladderpath-index but a low order-index correspond to disordered systems, while those with a high order-index but a low ladderpath-index are more akin to systems like crystals; both are special cases. The Ladderpath approach suggests that “complexity” should lie in the middle of these two axes, thus the order-rate η has been proposed¹⁹, which describes this relative relationship: when η is around 0.5, complexity is at its peak, and the hierarchical relationships are the richest. Our hypothesis is that the complexity of a structure may correspond to the capacity or performance of its functionality. This structure-function relationship has already been validated in areas like chemical molecules²⁰ and amino acid sequences¹⁹.

Our experiments confirmed our hypothesis in two aspects. On one hand, as seen in section “Best performance when the hierarchical structure is the richest”, in those networks that perform the best, statistically, their η values are within the range of about 0.3 and 0.6, suggesting that networks with richer hierarchical structures and higher complexity tend to perform better. On the other hand, as observed in section “Network evolves to enrich hierarchical structures through training“, regardless of whether the initial neural network is completely disordered (randomly initialized) or completely ordered (all weights being the same), η tends to move towards the middle range during training, indicating that the network’s hierarchical structure becomes increasingly rich as training progresses. These observations are similar to the work of You et al. (2020)¹³, where a well-performing neural network’s corresponding Relational Graph (which depicts the inter-layer connections of the neural network) will have average path lengths and clustering coefficients (which intuitively have opposite effects) within a certain suitable range, and the training process moves these indicators towards this range. As mentioned earlier, the Ladderpath approach has been supported in distinct systems, including chemical molecules and amino acid sequences. This suggests that the observed structure-function relationship could be more universally applicable.

In addition to the odd-even recognition task, we conducted five additional tasks: one more classification task, two regression tasks, and two time series forecasting tasks using MLPs (the results are shown in Supplementary Note 2). The results of these five tasks align well with the conclusions drawn here.

Taking into account the observations made, we may consider whether setting the initial η of the neural network in the middle range could improve its performance after training. An appropriate initialization may significantly boost the neural network’s effectiveness and performance. Our experimental results, as presented in section “ Initial value of order-rate η makes a difference“, support this idea: initializing η in the middle range tends to result in the highest average accuracy and the greatest likelihood of achieving the best performance. Therefore, η could be a useful metric for assessing the quality of neural network initialization. However, due to the technical challenges of ladderpath calculations, we must acknowledge that our analysis is currently limited to networks of constrained sizes and MLPs. Nonetheless, our results still provide valuable insights: For example, it is not difficult to see that the shared-weight architecture, as utilized in convolutional networks, effectively increases the system’s reuse, or in other words, increases the system’s orderliness. The shared-weight architecture facilitates the transition of their η values towards the middle range, in contrast to MLPs where weights are entirely independent. Therefore, practically speaking, although it is currently challenging to directly compute η for large-scale MLPs, the implication is that even merely adopting some weight sharing in MLPs, as opposed to random initialization of weights, could potentially enhance training speed and performance. This is a perspective that has not been considered in this light before.

Finally, it deserves to be mentioned that this work has limitations and areas that merit further exploration. First, our observations raise the question of whether backpropagation in search of the optimal solution is a process of increasing the complexity of the neural network structure. This is because the most complex structures correspond to η values in the middle range, which signifies the richest hierarchical and nested relationships among repeating substructures. Second, as mentioned, our work is currently limited to networks of constrained sizes and MLPs (although they are indeed extremely useful in practice, as they are often employed in the final layers of many architectures to perform output classification tasks). This limitation is due to technical problems: currently, the algorithm we developed for ladderpath calculation can only handle sequences ranging from 10,000 to 100,000 in length due to the NP-hard nature of the problem. Although our current algorithm already represents a 2–3 order of magnitude improvement over previous works, which could only handle sequences of a few hundred^18,20, it still falls short of dealing with the sequences corresponding to larger networks. For larger networks, a potential approach might involve random sampling of all signal paths, which is a future direction of exploration.

Nevertheless, this type of analysis, which utilizes the Ladderpath approach and AIT, has shown promising value. The significance of this work currently lies more at the theoretical level than at the practical level. It connects the hierarchical and nested structure of neural networks with their functional performance. Introducing the Ladderpath approach and AIT into this analysis is crucial, although it is not yet widely recognized within the artificial neural network community. More practical applications are expected to follow. Additionally, and perhaps even more importantly, the results obtained under the same theoretical framework are applicable to other seemingly unrelated systems, such as protein sequences (this work was recently published¹⁹). These findings highlight the universal significance of this relationship and deserve special attention.