Strong and weak random walks on signed networks

Strong and weak random walks on signed networks

Introduction

Random walks play a central role in several aspects of network science1. Random walks are often used as a linear, tractable model for diffusive processes on networks, allowing us to explore questions such as the impact of certain network properties2, e.g. their community structure or their degree distribution, on the speed of the diffusion, e.g. via the mixing time3. Random walks are also a key ingredient in several methods to extract information from large-scale networks. Important areas of application include node centrality which can be captured by the density of walkers on that node in the stationary state4, community detection where groups of nodes can be defined as communities based on their tendency to retain flows of walkers for long times5,6, or the design of kernels/embeddings capturing the similarity between nodes on a graph7,8. Among the advantages of random-walk based approaches, one can cite their intuitive interpretation, in terms of flows along the edges, their connection to well-established theories, such as that of Markov chains, their linearity, opening up the toolbox of linear algebra, and their ease of use in a distributed setting, each of the walkers being independent from the others, and their capacity to probe the structure of the network at multiple scales.

In this article, our purpose is to explore how random walks can be defined on, and help us understand the structure of, signed networks. In signed networks, edges carry a value, that is positive or negative to model trustful or mistrustful, activatory or inhibitory, cooperative or antagonistic interactions, depending on the context9,10,11. Random walks on signed networks have been studied in the past, either implicitly, by investigating the properties of the signed Laplacian e.g. for node embedding12 or explicitly, by defining rules for how traversing a negative edge affects the random walker13, in a model that we will refer to as a strong random walk for reasons that will appear clear later. As in many generalisation of “classical networks”, generalising the notion of random walks to signed networks is not unique, and the choice should be motivated either by the process that one aims to model or by the underlying structure that one would like to extract. What makes the power of random walks on unsigned networks is their deep connection to specific structural patterns, communities and bottlenecks, e.g. via the Cheeger inequality14. Communities can be seen as noisy connected components, i.e. connected components to which random edges have been added. The important question to define the right model of random walks on signed networks is thus: what is the prototypical structure that one aims to identify?

The large-scale structure of signed networks is usually viewed through the prism of balance theory. There exist two forms of balance. Strong balance15 refers to the situation when the network can be partitioned into two groups, such that positive edges are inside the groups, and negative edges between them. Equivalently, balanced networks are such that any cycle in the graph has an even number of negative edges. A weaker version of balance16 relies on the condition that there is no cycle with only one single negative edge. Graphs that exhibit weak balance can be partitioned into k clusters with positive edges inside, and negative edges connecting them. It is well-known that the signed Laplacian and the equivalent strong random walk allow to efficiently recover networks with polarisation between two groups, akin to strong balance12. It is also well-known that they may fail to uncover and properly characterise networks exhibiting a structure related to weak balance, and thus to serve as a model for the inference of a structure into more than two groups10. Alternative approaches have been proposed to solve this problem, for instance through linear programming as in the method proposed in17 which locates the optimal partition of the graph, or through other notions of Laplacian, e.g. SPONGE18 or the repelling Laplacian19, but they do not have a random walk interpretation and thus do not inherit the advantages described above. The main contribution of this article is the definition of a random walk process, called the weak random walk, specifically designed for structures derived from weak balance, hence particularly fitted for clustering problems involving more than two groups in signed networks. In essence, the process consists in allowing the walker to explore the graph around a seed node only until it encounters a negative edge for a second time.

This article is organised as follows. In section II, we provide an overview of the key concepts on unsigned networks and random walks, in particular related to the definition of a similarity matrix between nodes. In section III, we introduce a first type of random walk on signed networks, called the strong walk, and show its connections to the signed Laplacian. In section IV, we then discuss how to exploit such a random walk in order to define a similarity matrix between nodes of a signed network. In section V, we identify the limitations of the strong walks for networks exhibiting more than two clusters, and introduce weak walks to remedy the problem. In section VI and VII, we compare the results of simple unsupervised and semi-supervised algorithms based on the two types of walks, showing that weak walks present an advantage in terms of accuracy for certain families of networks. Finally, in section VIII, we conclude and discuss implications of our work for the design of algorithms.

Results

Notations and basics

We consider an undirected signed graph, characterised by the adjacency matrix As;ij, so that As;ij = 1 indicates the presence of a positive edge. As;ij = −1 indicates the presence of a negative edge. We decompose that matrix into two matrices, encoding the positive and negative edges separately, each associated to an unsigned, undirected network: As;ij = A+;ij − A−;ij. In the signed graph, the total number of connections of each node, positive or negative, is thus given by ki = ∑jAs;ij, which defines the degree. The total number of nodes is denoted as N, and that of edges is denoted by M. In the following, we will also use the diagonal matrix of degrees Kij = kiδij and the adjacency matrix of the graph where signs are not considered Au;ij = A+;ij + A−;ij. Note that ki = ∑jAu;ij. Our purpose is to model diffusive processes on signed networks. Here, we will focus on discrete-time walks, but everything can be generalised to continuous times. In the following derivations and experiments, we consider unweighted networks, but the generalisation to weighted ones is straightforward. The methods we describe can be applied to both complete and non-complete graphs, with the condition that each node must have at least one edge.

In the case when the graph does not have negative edges Au;ij = A+;ij, a discrete-time random walk evolves according to

$${n}_{j}(t+1)=sumlimits_{i}{n}_{i}(t){T}_{u;ij}$$
(1)

where ni(t) is the density of walkers on node i at time t and Tu = K−1Au is the transition matrix of the associated Markov chain. This equation describes a microscopic process where an ensemble of walkers all move synchronously at discrete times. At each step, a walker located on some node i jumps to one of its neighbours, with a probability equal to one over the number of such neighbouring nodes. The spectral properties of the transition matrix, and how they relate to the associated graph structure, are well known1. In particular, we note that Tu shares the same spectrum as the symmetric matrix K−1/2AuK−1/2, which implies that its eigenvalues are all real and that its left and right eigenvectors, uα and vα respectively, are related by uα = Kvα. The eigenvalues λα are in the interval [−1, 1], and the multiplicity of the dominant eigenvalue, λ1 = 1 is equal to the number of connected components in the graph. The difference between λ1 and the largest non-one eigenvalue, λ1 − λ2, called the spectral gap, determines the mixing time of the process and quantifies the presence of bottlenecks hindering the flow of walkers across the system2. When the graph is connected, its corresponding left eigenvector, denoted by π is given by πi = ki/(2M), which characterises how walkers are distributed across the nodes in the stationary state.

As we mentioned in the introduction, random walks are at the core of several algorithms6,20. In particular, they are useful to capture the strength of indirect connections between nodes. In situations when the graph is sparse, so that only a small fraction of the pairs of nodes are connected (This is the case in the vast majority of real-world applications, especially when the graph is sufficiently large.), random-walk-based kernels allow to estimate the proximity of pairs of nodes that are not directly connected, by incorporating, in an unbiased (or in a supervised) way, the information of walks of any length. In case of unsigned networks, the resulting similarity between two nodes is obtained from an appropriately weighted sum of the walks between them. Typically, the existence of many short walks between two nodes ensures their proximity. Important examples include the diffusion, or heat kernels21,22,23, the mean commute time24,25, but also the random-walk with restart similarity26,27,28. For reasons that will appear clear later, and that random walks in our model on signed networks tend to stop after a finite number of steps, the latter generalises most naturally (according to us) in the signed context.

Random walks with restart are a special type of random walk with teleportation. Take a node i, we consider a process where a walker may, at each step, either perform a jump to one of the neighbouring nodes with probability p, or teleport to the original node i with probability (1 − p). It is straightforward to show that its stationary state is given by

$${bf{n}}=(1-p){{bf{e}}}_{i}{(I-p{T}_{u})}^{-1},$$
(2)

where ei is a row vector made of zeros except for entry i where it takes the value 1. The stationary state is, intuitively, peaked around the pinned node i, and becomes more spread, giving more importance to distant neighbours, when p increases. When p approaches 1, more and more importance is given to longer walks, and the vector converges to the stationary state π defined above. After taking a Taylor series of this expression, it is very similar to the kernel used in Deepwalk29, and defines the weighted neighborhood of each node at different scales.

A similarity matrix can be obtained by repeating the process for each node and by stacking the N vectors, leading to

$$kappa (p)=(1-p)K{(I-p{T}_{u})}^{-1},$$
(3)

where we scale here each line by the node degree, to ensure that the matrix is symmetric. This quantity, or variations around it, have been used in a variety of contexts7, including node classification and clustering. Note that

$$begin{array}{rcl}{kappa }_{ij}(p)&=&(1-p){k}_{i}{[{(I-p{T}_{u})}^{-1}]}_{ij},\ {kappa }_{ij}(p)&=&(1-p){k}_{i}{left[sumlimits_{k}{p}^{k}{T}_{u}^{k}right]}_{ij}.end{array}$$
(4)

As each term ({k}_{i}{T}_{u;ij}^{k}) takes the (i, j) element from AuK−1Au. . . . K−1Au, κ(p) is symmetric for any value of p (The symmetry arises due to the detailed balance when teleportation to nodes is done proportional to the stationary state of the walk without restart.). Moreover, the ith line (row) sums to ki, meaning that κ(p) can be interpreted as a weighted graph, interpolating between the original graph (with the addition of self-loops) when p approaches 0 and low rank matrix obtained from the outer product of the stationary states, kikj/(2M), when p approaches 1. For any value of p, κij(p) can be fed into standard methods for weighted networks. For instance, its Newman-Girvan modularity30 can be used to uncover its multi-scale modularity, an approach reminiscent of Markov stability31,32,33.

Strong walks

Introduction and basic properties

To generalize the classical methods defined in the previous section and, more generally, any algorithm based on random walks, one needs ways to define a random walk on a signed network. The natural way to do it is to consider the dynamics of two types of walkers, positive walkers and negative walkers13. In the following, we will therefore start from describing how the number of positive walkers, n+;i(t), and that of negative walkers, n−;i(t), on each node i evolve in time, so as to define different versions of random walks on signed networks. In this section, we give an interpretation of established methods for signed networks based on the signed Laplacian in the language of random walks.

The strong random walk, as we call it, is defined as follows. Positive and negative walkers perform a random walk by taking an edge, independently of its sign. If the walker takes a positive edge, it keeps its own sign; else it changes sign. The rate equation for the process is:

$$begin{array}{r}{n}_{+;j}(t+1)=sumlimits_{i}left({n}_{+;i}{T}_{+;ij}+{n}_{-;i}{T}_{-;ij}right),\ {n}_{-;j}(t+1)=sumlimits_{i}left({n}_{-;i}{T}_{+;ij}+{n}_{+;i}{T}_{-;ij}right),end{array}$$
(5)

where

$$begin{array}{r}{T}_{+;ij}={A}_{+;ij}/{k}_{i},\ {T}_{-;ij}={A}_{-;ij}/{k}_{i},end{array}$$
(6)

indicate the transition probabilities for a walker from i to j and whether to keep or change the sign respectively. In matrix notation:

$$begin{array}{r}{T}_{+}={K}^{-1}{A}_{+},\ {T}_{-}={K}^{-1}{A}_{-}.end{array}$$
(7)

If we consider the dynamics of the walkers independently of their sign and consider the time evolution of ni(t) = n+;i(t) + n−;i(t), one recovers the classical equation for random walk on networks

$${n}_{j}(t+1)=sumlimits_{i}{n}_{i}(t){T}_{u;ij},$$
(8)

where Tu = K−1Au. If, in contrast, we consider the time evolution of the difference Δi = n+;i − n−;i, that can be understood as the total “charge” on node i, we find

$${Delta }_{j}(t+1)=sumlimits_{i}{Delta }_{i}(t){T}_{s;ij},$$
(9)

where Ts = K−1As, is the signed transition matrix from which one can build the standard signed Laplacian, whose spectral properties are closely associated to the notion of strong balance and have been used for node embeddings and clustering10,12.

Another interesting perspective is to consider the sets of equations (5) as one single system of equations driven by the matrix

$$begin{array}{r}T=left(begin{array}{rc}{T}_{+}&{T}_{-}\ {T}_{-}&{T}_{+}end{array}right).end{array}$$
(10)

The latter is simply the (unsigned) transition matrix of a larger graph, composed of 2N nodes (each node appears twice, once for positive walkers, and once for negative walkers), with adjacency matrix

$$begin{array}{r}A=left(begin{array}{rc}{A}_{+}&{A}_{-}\ {A}_{-}&{A}_{+}end{array}right).end{array}$$
(11)

The whole diffusive process can thus be understood in terms of the spectral properties of that transition matrix. The system of equations decouples when performing the following change of variable: for each i, we define Δi = n+;i − n−;i and ni = n+;i + n−;i, where the determinant of the transformation matrix is 2, leading to the transition matrix

$$begin{array}{r}{T}^{{prime} }=left(begin{array}{rc}{T}_{s}&0\ 0&{T}_{u}end{array}right),end{array}$$
(12)

for these variables. The spectral properties of T are therefore determined by those of the two matrices Ts and Tu, with well-known special cases, e.g., when the graph is disconnected or (strongly) structurally balanced13.

Walk lengths

The strong walk can be equipped with a teleportation mechanism in order to keep walkers concentrated in the neighborhood of a target node. Given a teleportation probability (1 − ps), we consider the expected walk length before a teleportation. We know that if ps = 1, the strong walk will have infinity as the expected length, thus we assume ps < 1 here. For each node i, if the walk has length k, it means that teleportation does not happen at the first (k − 1) time steps but the kth time step, thus the number of steps that teleportation does not occur follows a negative binomial distribution with success probability being 1 − ps and the number of success being 1. Hence, the expected walk length is the sum of the expected steps that teleportation does not happen plus 1, i.e.,

$$begin{array}{r}{mathbb{E}}[l| {X}_{0}=i]=frac{{p}_{s}}{1-{p}_{s}}+1=frac{1}{1-{p}_{s}}.end{array}$$

Signed network similarity matrix

In the case of unsigned networks, most kernels are based on the principle that similar nodes tend to be connected by several, ideally short walks, which can be captured by the flow of walkers between them. In the case of signed graphs, one needs to distinguish between the type of walks, as some of them will make a pair of nodes more similar, while others will make that pair more dissimilar. Hence, a modelling choice has to be made to define “positive” versus “negative” walks.

Let us start with an intuition based on the notion of strong balance. A graph is said to be strongly balanced if it does not possess any cycle with an odd number of negative edges. Equivalently, this graph can be partitioned into two sets, say A and B, of nodes such that the positive edges are within each cluster, and the negative edges are between the two. This also implies that any walk between two nodes of the same set, say A, possesses an even number of negative edges, while any walk between two nodes of different sets possesses an odd number of negative edges.

Given this extra ingredient, we follow the steps leading us to the similarity matrix inspired by random walks with restart. For each node i, we consider a process where with probability ps, walkers may either jump to one of its neighbouring nodes, and possibly change sign, or teleport back to the pinned node with probability 1 − ps. During the teleportation, the walker automatically becomes positive, independently of its sign. This choice is motivated by the fact that a node should be similar to itself. As before, the process defines for each node i a vector, but it is now a 2N vector, where the first N entries encode the density of positive walkers (arriving through walks with an even number of negative edges, or positive walks), while the last N entries encode the density of negative walkers (arriving through negative walks).

The walker begins as a positive walker on some node i, ei, a 2N vector. The state of the walker after k steps, uk can be written in terms of the walker after k − 1 steps uk−1 as,

$${{bf{u}}}_{k}=(1-{p}_{s}){{bf{e}}}_{i}+{p}_{s}{{bf{u}}}_{k-1}T,$$
(13)

according to the dynamics defined by the transition matrix T. The stationary state of this process is defined by

$${bf{n}}=(1-{p}_{s}){{bf{e}}}_{i}{(I-{p}_{s}T)}^{-1}.$$
(14)

The first N entries of n encode the flow of walkers on positive walks from i to the other nodes, hence capturing similarity between the nodes. The last N entries encode the flow of walkers on negative walks from i to the other nodes, hence encoding dissimilarity between the nodes. More precisely, stacking the first N elements of this vector defines the i-th row of a similarity matrix κ+. Stacking the last N elements of this vector defines the i-th row a dissimilarity matrix κ. The two matrices can be seen as the positive and negative adjacency matrices of a graph where weights involve longer walks that explore larger area of the network when ps increases. Taking the differences between the matrices defines a signed similarity matrix, which compares the relative importance of positive and negative walks between nodes. Note that in Eq. (14), we give the same importance to each node, in contrast to the Eq. (3), as there is no straightforward way to make the similarity matrix symmetric by a weighting of the nodes. Other choices could be done, for instance, by weighting the nodes by their degree. In practice, for the experiments where we need to find the entire kernel, we symmetrize the signed similarity matrix after calculating it by adding the resulting matrix to its transpose.

Weak walks

Definition and basic properties

The walk (5) appears to be appropriate to uncover the similarity of two nodes from a strong balance perspective, as walks with an even number of negative edges will increase the similarity between the nodes, while walks with an odd number of negative edges will decrease their similarity. However, this is not the case when searching similar nodes based on the notion of weak balance; see Fig. 1. A graph is said to be weakly balanced if it does not possess cycles with one single negative edge. Weakly balanced graphs are k-clusterable, meaning that they can be partitioned into k factions, where the negative edges are between the factions and the positive edges are within the factions. The condition via the appearance of one single negative edge suggests to define the dissimilarity between two nodes by the existence of walks with one single negative edge between them. Any walks involving more than one negative edge are instead considered uninformative and should not be considered when estimating the similarity between two nodes.

Fig. 1: When estimating the similarity between two nodes, here colored in black, strong walks may lead to conflictual signals when the network is composed of more than 2 clusters, here illustrated via a positive walk in blue and a negative walk in red (solid: positive; dashed: negative).
Strong and weak random walks on signed networks

Weak walks only consider walks with at most one negative edge and thus properly consider the two nodes as dissimilar.

Full size image

These arguments suggest to modify the strong walk defined by imposing that a walker can become negative only once and that it disappears when taking a negative edge a second time. This argument causes an asymmetry in the node states, as positive and negative walkers now play a different role. The dynamics is encoded by the enlarged transition matrix

$${T}_{w}=left(begin{array}{cc}{T}_{+}&{T}_{-}\ 0&{T}_{+}end{array}right),$$
(15)

It is easy to show that it will keep the same structure, after several iterations, and that:

$${T}_{w}^{n}=left(begin{array}{cc}{T}_{+}^{n}&{U}_{n}\ 0&{T}_{+}^{n}end{array}right),$$
(16)

where ({U}_{n}=mathop{sum }nolimits_{i = 0}^{n-1}left(begin{array}{c}n-1\ iend{array}right){({T}_{+})}^{i}({T}_{-}){({T}_{+})}^{n-1-i}). As defined here, the process leads to a leak of probability, as walkers disappear after taking a negative edge for the second time, leading to a vanishing number of walkers asymptotically (except in trivial cases).

A natural way to deal with this leakage is to adopt a random walk with restart. We assume that a negative walker teleports back to a given target node when taking a negative edge. In contrast with the case of the strong walk, an additional teleportation is not necessary for the walks to remain finite before the restart, unless in trivial situations (e.g., a connected component without a single negative edge). Moreover, the length of the walks before the restart is not controlled by a parameter, but controlled by the structure of the signed network.

Suppose we start with 1 positive walker on node x, given by the initial state vector, ex, where this is a vector of length 2N. Then, the random walk with restart would be defined by the matrix:

$${T}_{x}=left(begin{array}{cc}{T}_{+}&{T}_{-}\ {M}_{x}&{T}_{+}end{array}right),$$
(17)

where Mx is equivalently the zero matrix except column x, which is equivalent to the vector v where:

$${v}_{i}=sumlimits_{j}{T}_{-;ij},$$
(18)

such that matrix T is now stochastic and has row sum equal to 1. The negative walker will return to node x as a positive walker when it encounters another negative edge. The evolution can be determined by taking the product of the initial state vector ex and the matrix Tx.

The walk can also be modified to have an additional teleportation parameter, pw, where each walker will teleport back to the pinned node with additional probability 1 − pw as a positive walker, in addition to the restart in the case of a negative walker encountering a negative edge. This results in an increase of the teleportation rate as compared to the strong walk case, thus a decrease in the expected walk length, as we will show. In this framework, we can then construct a signed similarity matrix in the very same way as for strong walks. For a random walk starting as a positive one on node x, the state of the weak walker after k steps, uk can be written in terms of the walker after k − 1 steps uk−1 as,

$${{bf{u}}}_{k}=(1-{p}_{w}){{bf{e}}}_{x}+{p}_{w}{{bf{u}}}_{k-1}{T}_{x},$$
(19)

in much the same fashion as in Equation (13). We note that we have primarily considered unweighted networks where the edges all have uniform weight of 1. However, the procedures we have outlined here can be easily generalized to weighted networks, where the weight can be interpreted as biasing the transitions of the walker. We note that although both the strong and weak walks are designed with the intuition provided by strong and weak balance, they are applicable to incompletely balanced graphs as well, which is the case of most real-life datasets. We show in Section VII the results of these walks on graphs which are unbalanced.

Walk length

We now characterise the weak walks from the perspective of expected walk length. We consider a teleportation probability 1 − pw. In this case, teleportation is not the only reason for the random walker to stop (or restart), where traversing negative edges for the second time can also trigger it. Hence, we cannot have a network-independent distribution, but need to consider the network structure. Particularly, we consider the enlarged transition matrix (15). We encode the expected number of visits in the matrix (Qin {{mathbb{R}}}^{2Ntimes 2N}), where N dummy nodes are created to separate the cases when the node is visited via paths of zero or one negative edge. Then,

$$Q=mathop{sum }limits_{k=0}^{infty }{p}_{w}^{k}{T}_{w}^{k}={(I-{p}_{w}{T}_{w})}^{-1},$$

where I is the identity matrix.

Hence, the expected walk length starting from node i is

$${mathbb{E}}[l| {X}_{0}=i]=mathop{sum }limits_{j=1}^{2N}{Q}_{ij}.$$

Note that we only count the first N rows in Q. Therefore, the expected walk length for weak walk is

$${mathbb{E}}[l]=mathop{sum }limits_{i=1}^{N}{mathbb{E}}[l| {X}_{0}=i]P({X}_{0}=i).$$

We verify the theoretical results with Monte Carlo simulations on Signed Stochastic Block Models (SSBMs, see section IV A for details), where edges occur with probability 0.05 and do not flip sign from the initial configuration; see Fig. 2. We consider a uniform initial distribution, repeat the random walks starting from each node for ns = 100 times and obtain the expected walk length up to either teleportation happens or it traverses the second negative edge.

Fig. 2
figure 2

Distribution of the expected walk lengths starting from different nodes, both from the theoretical results in blue and Monte-Carlo simulations in orange on 3 community symmetric SSBM with 100 nodes in each community, p = 0.05 and different levels of teleportation (left: pw = 1; middle: pw = 0.8; right: pw = 0.5, where the strong walk with ps = 0.8, 0.5 has expected walk length 5, 2, respectively).

Full size image

Applications

In this section we present an application of the weak walk method to signed network clustering. We first show the difference between the resulting signed similarity matrices for the weak and strong walks on two small signed network examples, for which ground truth clusters exist. Next, we compare the weak walk against two other methods by performing semi-supervised clustering on a series of synthetic SSBM networks, with varying levels of noise, community numbers, and link densities. Specifically, we compare the weak walk against (1) the strong walk and (2) the Signed Random Walk with Restart (SRWR), a three-parameter signed network walk designed for semi-supervised clustering which generates a ranking of nodes in signed networks34. The SRWR model considers a random walker that changes its sign as it takes positive or negative edges according to two parameters, as well as a teleportation parameter, and exhibits high accuracy at signed link prediction. The semi-supervised clustering for each random walk is performed in much the same fashion as with PageRank35; we begin with (positive) walkers on a set of labeled seed nodes. We use the similarity vector generated from the weak, strong and SRWR walks to obtain a ranking of each of the other nodes in the graph. Each unclassified node is assigned a label depending on in which walk it ranked the highest, inheriting the label of that seed node. We use the parameter-free version of the weak walk with no teleportation (pw = 1). The strong walk has one one parameter to adjust, the teleportation parameter (ps), while SRWR has three parameters: two parameters which control if the walker changes sign when it takes a negative edge or not, as well as a teleportation parameter. In these experiments, we sweep through the parameter space for the strong walk and the SRWR and take the parameters with the best results. We show that the weak walk outperforms the other two methods in certain cases, despite being parameter-free. To evaluate the accuracy of the clustering assignment obtained in each case we use the Adjusted Rand Index (ARI)36 to compare against the ground truth clusters. These methods can be applied both to completely balanced and incompletely balanced graphs, and Section VII provides a sense of the lower bound of the degree of balance required for these methods to work.

Computational details

In the experiments that follow, we make the choice to construct the signed similarity matrices for both the strong and weak walks by taking the state of the walk vector at k = n, where n is large enough that the walk is approaching a stationary state. This can be computationally written as an iterative process allowing us to solve for the similarity matrix as a whole in one computation, which is more efficient than inverting a matrix as required by Equation (14). As matrix multiplication scales well, this reduces the computational complexity of the algorithm. As well, the matrix given in Equation (19) is specific to the starting node x, requiring N matrices to be inverted. See section IV B for a description of the algorithm. Specifically, if the state of the walker (strong or weak) is given by uk, the similarity between node x and any other node is given by the vector sx:

$${s}_{x}={u}_{k = n}[1:N]-{u}_{k = n}[N+1:2N],$$
(20)

since we subtract the negative walks from the positive. For the experiments on the empirical networks for which we calculate the entire similarity matrices, the kernel is made symmetric by adding it to its transpose.

Highland Tribes

There are very few empirical signed networks available that are labeled with ground truth cluster assignments. We show here two small examples with labeled data to demonstrate the distinction between the strong and weak walk kernels on these networks. The first network comes from37, and represents the antagonistic or friendly relationships between 16 tribes of the Eastern Central Highlands of New Guinea, represented by edge weights of −1 or +1 respectively. The network can be clustered into three communities, which characterise the higher order groupings of the society38. Figure 3 gives the adjacency matrix for the Highland Tribes network, where the nodes have been ordered by their community groupings. This network is not perfectly balanced, but exhibits weak balance, (see39 for a full characterization of the network’s local and global balance). As the network has more than two communities, we expect that the weak walk will outperform the strong walk at locating the three communities. The resulting strong and weak walk kernels generated by this network with 500 steps are also given in Fig. 3. The walks themselves produce a similarity matrix which, like the adjacency matrix of the graph, can be used as input into other methods to produce community assignments. For example, adjacency matrix can be combined with other methods such as the spin-glass community detection method for signed networks based on simulated annealing introduced in40 to obtain the community assignments in an unsupervised fashion, which can be compared against the ground truth community assignments using the ARI score. As our goal here to test whether the weak walk is able to outperform the strong walk, we take the parameter free version of the weak walk, with no teleportation. We compare against the strong walk’s best performance across all possible parameter choices; as such, we choose the best teleportation parameter by inputting the kernel into the spin-glass algorithm, and comparing the community assignments against the ground truth. We take the teleportation parameter that yields the best ARI score, corresponding to ps = 0.7. The strong walk kernel exhibits far more incorrect signs as compared to the weak walk: in particular it exhibits several positive connections between the second and third communities. Like the adjacency matrix, these kernels can be used as input for the spin-glass algorithm to obtain unsupervised community assignments, allowing us to compare the resulting clustering assignments produced by the three matrices. In this case, the graph is quite small, dense and almost balanced, and so the adjacency matrix, weak walk kernel and strong walk kernel all produce an ARI of 1.0 when used as input into the spin-glass algorithm.

Fig. 3: (left) Adjacency matrix for the Highland Tribes network.
figure 3

(centre) Strong walk kernel resulting from a 500 step walk on the Highland Tribes network with teleportation parameter ps = 0.7. (right) Weak walk kernel resulting from a 500 step walk with no teleportation pw = 1.0. Positive and negative values have been set to +1 or −1 respectively for clearer visualization.

Full size image

Sampson’s monastery

In this section we present another example of a small empirical signed network with ground truth labels. The Sampson’s Monastery network comes from an ethnographic study of the community structure within a New England monastery, where each relationship between the 18 novice monks has been designated as positive (friendly) or negative (antagonistic). Sampson, the ethnographer, divided the men into four groups according to the community structure he observed41. Although this example also represents a small network, the graph is more sparse than the Highland Tribes example, and has more noise in terms of edge signs that violate structural balance (see39 for a full description of the network’s structural balance). Figure 4 gives the network adjacency matrix, with the nodes arranged according to the four ground truth communities. Again we expect that since this network has more than two communities, the weak walk will outperform the strong walk at identifying the four communities. The resulting strong and weak walk kernels generated by this network with 500 steps are also given in Fig. 4. Again, we use the parameter-free weak walk with no teleportation, while for the strong walk, we choose the teleportation parameter of ps = 0.8, which achieves the highest ARI score when combined with the spin-glass algorithm. The strong walk kernel exhibits far more incorrect signs as compared to the weak walk, generating several positive connections between the second and third communities, and the third and fourth communities. When combined with the community detection spin-glass algorithm, the parameter-free weak walk kernel obtains the best ARI (0.639) compared to the the adjacency matrix (0.442) and the strong walk kernel (0.518).

Fig. 4: (left) Adjacency matrix for the Sampson’s Monastary network.
figure 4

(centre) Strong walk kernel resulting from a 500 step walk on the Highland Tribes network with teleportation parameter ps = 0.8. (right) Weak walk kernel resulting from a 500 step walk with no teleportation. Positive and negative values have been set to +1 or −1 respectively for clearer visualization.

Full size image

Application: semi-supervised clustering

In this section, we use the weak walk kernel to perform semi-supervised clustering of several different synthetic SSBMs with community structure. This application is semi-supervised, as we use prior information about the labels of a select group of nodes, and use this information to label the rest of the graph. We compare against the kernel generated by the strong walk, and the SRWR. In these experiments we take the parameter-free version of the weak walk, with no teleportation, while we sweep through the parameter space for the strong walk and the SRWR and take the parameters which achieve the highest ARI. Consequently, these methods perform better than the weak walk in some cases, but it may be less useful in practice, since the ideal parameters are not known a priori and indeed tend to change with each iteration of the SSBM, even when the SSBM parameters are held constant.

Strong balance: two-community SSBMs

Here, we look at 2 community SSBMs with 100 nodes in each community, and 0.05 probability of link existence between each pair of nodes. We perform the semi-supervised clustering with varying levels of noise on the signs of the links, via the sign flip parameter ν (see section IV A for a description of the noise parameter). For the strong and weak walks, we take the state vector at iteration k = 100, and set for SRWR the max iterations = 100. The other parameters for the SRWR and strong walk are chosen by parameters sweep and change for each iteration of the graph. We take the mean and standard deviation over 10 graph realizations, and the results are given in Fig. 5. We observe that the SRWR performs the best, a feature to be balanced with the fact that it has the most parameters. The strong walk exhibits improvement over the weak walk; this is expected, as the weak walk is designed to outperform the strong walk in the case when there are more than two communities, and the strong walk has the advantage of teleportation.

Fig. 5
figure 5

ARI from the semi-supervised clustering for the weak walk, strong walk and SRWR on 2 community SSBMs (strong balance).

Full size image

Weak balance: three-community SSBMs

Here, we look at 3 community SSBMs with 100 nodes in each community, and 0.05 probability of link existence between each pair of nodes. We perform the semi-supervised clustering with varying levels of noise on the signs of the links, as in the previous example. For the strong and weak walks, we take the state vector at iteration k = 100, and set for SRWR the max iterations = 100. As before, the other parameters are chosen by parameters sweep and change for each iteration of the graph. We take the mean and standard deviation over 10 graph realizations, the results are given in Fig. 6 (left). We observe that the SRWR again outperforms the other two methods. The strong walk and weak walk exhibit similar performance, despite the strong walk having one extra parameter. In different empirical networks, the noise may occur along positive or negative edges, with different probabilities. Consequently, we investigate the results when the noise on the signs of the edges is restricted to either inside the communities by varying νinside and setting νbetween = 0, or between communities by varying νbetween and setting νinside = 0 (see section IV A for details). The weak walk is much more robust to noise inside communities (Fig. 6 (centre)), and indeed performs almost as well as the SRWR in this case. When the noise is between the communities (Fig. 6 (right)), the weak walk is less robust and the SRWR outperforms both the strong and weak walks.

Fig. 6: ARI from the semi-supervised clustering for the weak walk, strong walk and SRWR on 3 community SSBMs (weak balance).
figure 6

(left) Noise level general to whole graph. (middle) Noise is only inside communities. (right). Noise is only between communities.

Full size image

Weak balance: six-community SSBMs

Here, we perform the same experiments but with a 6 community SSBMs with 50 nodes in each community, and 0.05 probability of link existence between each pair of nodes. As the number of communities increase, the problem of clustering signed networks becomes more difficult. We again perform the semi-supervised clustering with varying levels of noise on the signs of the links. For the strong and weak walks, we take the state vector at iteration k = 300, and set for SRWR the max iterations = 300. The other parameters are chosen by parameters sweep and change for each iteration of the graph. We take the mean and standard deviation over 10 iterations, the results are given in Fig. 7 (left). In this case, we observe that the weak walk performs much better than the strong walk and the SRWR, although the performance decreases quickly as the noise level increases. As in the previous example with 3 communities, we investigate the results when the noise on the signs of the edges is restricted to either inside the communities as in Fig. 7 (centre) or between the communities as in Fig. 7 (right). The weak walk is indeed much more robust to noise inside communities, compared to when the noise is between the communities, outperforming both the strong walk and the SRWR by a larger margin. The distinction between the methods is even more clear than in the previous case of 3 communities. The weak walk, despite having no parameters, outperforms the other methods when the weak balance structure is more clear.

Fig. 7: ARI from the semi-supervised clustering for the weak walk, strong walk and SRWR on 6 community SSBMs (weak balance).
figure 7

(left) Noise level general to whole graph. (middle) Noise is only inside communities. (right). Noise is only between communities.

Full size image

Asymmetry

In most of the previous literature on clustering signed networks, the density of links between each of the communities is held constant. In practice, this may not be the case; different pairs of communities may be connected by different link densities. This would produce, what we call here, an asymmetric SSBM. We define an asymmetry parameter a which controls the link density between communities by defining the link probability matrix as [[0.05, 0.05a, 0.05], [0.05a, 0.05, 0.05], [0.05, 0.05, 0.05]] for a 3 community asymmetric SSBM. One realization of the adjacency matrix for a 3-community asymmetric SSBM with 100 nodes in each community and a = 0.2 is given in Fig. 8.

Fig. 8
figure 8

Adjacency matrix of 300 node 3-community asymmetric SSBM, with asymmetry parameter a = 0.2.

Full size image

We perform the semi-supervised clustering on 300 node 3-community SSBMs with varying levels of asymmetry. For the strong and weak walks, we take the state vector at iteration k = 100, and set for SRWR the max iterations = 100. The results are given in Fig. 9. In every cases, the weak walk that we have introduced here outperforms the strong walk and SRWR. This is because walks between the first two communities can happen in two ways, either by taking a negative edge between the two, or by taking two negative edges by passing through the third community. This second walk cannot happen in the case of the weak walk, which cannot take two negative edges. When the asymmetry is high enough, when a is close to 0, the latter case is more likely, and introduces noise into the strong walk, which does not occur in the case of the weak walk.

Fig. 9
figure 9

ARI from the semi-supervised clustering for the weak walk, strong walk and SRWR on 3 community SSBMs (weak balance), with varying asymmetry.

Full size image

Discussion

Random walks play a central role in the study of networks, and are at the core of several algorithms, e.g. for community detection and network embedding. Random walks allow us to explore the indirect patterns of connectivity between nodes, and their spectral properties are directly related to important network patterns, such as the cut size. The specific rules of a random walk process are a non-trivial choice, however, and it is now recognised that tuning the parameters of the model, e.g. to favour network exploration in depth versus in width, may lead to significant improvements, as was demonstrated by node2vec8 for instance. More generally, the choice of a random walk model should be driven by the type of structure that one aims to isolate and/or by the nature of the diffusive processes taking place on an empirical data42.

This paper contributes at the problem of designing an random walk model, which is “optimal” within the family of random walks over signed networks. We argue that the random walk process associated to the standard matrices for the mining of signed networks, e.g. the signed Laplacian, is inherently (and implicitly) biased to the search of a strong balance structure. In the case when the system exhibits multiple clusters, associated to weak balance, we argue that another random walk model, called the weak walk, is more compatible with the network structure.

After introducing the model and exploring some of its properties, we have considered a classical random-walk based measure of node similarity, in three versions based on the strong walk, the weak walk and the so-called SRWR walk respectively. Through a series of experiments on artificial and real-life data, including unsupervised and semi-supervised clustering, we have shown that the results of weak walks outperform those of strong walks when the graph has more than two communities, or exhibits asymmetry in the density of links. We have also shown that it is competitive against the three-parameter SRWR walk in many cases as well, despite being parameter-free.

A wide range of algorithms for network mining take, as an input, trajectories generated by a random walk model. For the design of a similarity matrix, we have made a specific, yet classical, choice for how to combine the multiple walks that may connect a pair of nodes, but other choices could have been made. For example, the similarity matrix could be defined by summing over the walks at each step, by taking the state vector after k steps, taking the stationary state of the walk as in PageRank35, or by combining the results from walks of different lengths in a non-linear fashion to give greater weight to longer or shorter walks as in the similarity kernel introduced in7. Future research perspectives include feeding weak walk trajectories instead of strong walks ones into other algorithms and exploring applications to other typical signed network problems, such as unsupervised clustering10 and link prediction43. As a specific example, the strong walk has been used to learn stances for users in signed social networks44. Combining weak walks with these state-of-the-art methods could yield, at a minimal developing cost, an improvement in performance and flexibility for empirical networks, especially when the assumption of strong balance does not hold. Another challenge for future research is the identification of the “right” number of factions when confronted with real-world data, in order to guide the user to methods building on strong or weak walks in practice.

Methods

Signed stochastic block model

We perform most of our experiments on a signed stochastic block model (SSBM), which is a random graph model with a cluster structure that either exhibits strong or weak balance. The SSBM has N nodes, separated into k clusters. The edge probabilities between each pair of nodes are independent, and are defined by a k × k probability matrix P, where Pij gives the probability that an edge exists between a node in cluster i and another node in cluster j. In the initial configuration, edges connecting nodes in the same cluster are positive, while edges connecting nodes in different clusters are negative. As in18, unlike in the unsigned case, the edge existence probabilities do not need to be different for inter- and intra-community edges, as the structure of the network is determined by the signs. We introduce a noise parameter called sign flip probability ν (0, 1), which gives the probability that the sign of each edge is flipped from the correct sign (positive inside communities and negative between communities). In section VI, we also introduce parameters νinside (0, 1) and νbetween (0, 1) which separately control the level of noise on the signs of the edges inside and between communities, respectively.

Strong/weak walk experiments: iterative design

We construct the signed similarity matrix by taking the final state vector at k = n, where n is large and the state is approaching the stationary state. We compute the state vector at each step with an iterative process. The way the algorithm defines the similarity matrix is as follows. Beginning with a positive walker on node x:

$${{bf{u}}}_{0}={{bf{e}}}_{x}.$$
(21)

For the strong walk, at each step the new vector uk is computed from uk−1 as:

$${{bf{u}}}_{k}=(1-{p}_{s}){{bf{e}}}_{x}+{p}_{s}{{bf{u}}}_{k-1}T,$$
(22)

where ps is the strong walk teleportation parameter. For the weak walk, the process is defined as in Equation (19), although it is computationally easier to use the non-Markovian matrix Tw to compute uk from uk−1 as:

$${{bf{u}}}_{k}=(1-({p}_{w}{{bf{u}}}_{k}{T}_{w})cdot {boldsymbol{1}}){{bf{e}}}_{x}+{p}_{w}{{bf{u}}}_{k}{T}_{w}$$
(23)

where pw is the weak walk teleportation parameter, 1 is the all-one vector, thus (pwukTw) 1 sums over the probability that has already been allocated to the corresponding neighbours and the first term re-sets the lost probability to the original node x as a positive walker. The advantage of using an iterative process over the matrix given in Equation (17) is that this can be done for several initial states at the same time, while the matrix in Equation (17) is specific to the state where the walker begins, i.e. node x. However, the processes are equivalent since the term Mx in Equation (17) returns all probability leakage back to node x as a positive walker, exactly as the first term in Equation (23). Finally, for both the strong and weak walks we define the similarity between node x and any other node is given by the vector sx:

$${{bf{s}}}_{x}={{bf{u}}}_{k = n}[1:N]-{{bf{u}}}_{k = n}[N+1:2N],$$
(24)

since we subtract the negative walks from the positive.

Note that in the experiments we perform, we sample over all possible choices for ps, the strong walk teleportation parameter and select the best one for each iteration of the graph. However, we keep pw fixed at 1, such that the weak walk teleportation rate is determined entirely by the structure of the graph, to keep the weak walk parameter free.

Related Articles

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.

Water and wastewater infrastructure inequity in unincorporated communities

Uneven access to water and wastewater infrastructure is shaped by local governance. A substantial number of U.S. households lack adequate access and the U.S. is one of the few countries with large populations living outside of city bounds, in unincorporated areas. Few studies address how infrastructure services and local governance are intertwined at a regional scale. We examine the connection between incorporation status and access to centralized infrastructure, using negative binomial regression. A novel dataset informs this analysis, comprised of 31,383 Census block groups located in nine states representing over 25% of the national population. We find evidence that inequities in access are associated with unincorporated status and poverty rates. Sewer coverage rates are significantly lower for unincorporated communities in close proximity to municipal boundaries. Infrastructure equity could be improved by targeting high-poverty unincorporated communities, addressing challenges with noncontiguous service areas, and strengthening regional water planning and participatory governance.

Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement

Taxonomic incommensurability denotes the difficulty in comparing scientific theories due to different uses of concepts and operationalizations. To tackle this problem in psychology, here we use language models to obtain semantic embeddings representing psychometric items, scales and construct labels in a vector space. This approach allows us to analyse different datasets (for example, the International Personality Item Pool) spanning thousands of items and hundreds of scales and constructs and show that embeddings can be used to predict empirical relations between measures, automatically detect taxonomic fallacies and suggest more parsimonious taxonomies. These findings suggest that semantic embeddings constitute a powerful tool for tackling taxonomic incommensurability in the psychological sciences.

Networks and identity drive the spatial diffusion of linguistic innovation in urban and rural areas

Cultural innovation (e.g., music, beliefs, language) tends to be adopted regionally. The geographic area where innovation is adopted is often attributed to one of two factors: (i) speakers adopting new behaviors that signal their demographic identities (i.e., an identity effect), or (ii) these behaviors spreading through homophilous networks (i.e., a network effect). In this study, we show that network and identity play complementary roles in determining where new language is adopted; thus, modeling the diffusion of lexical innovation requires incorporating both network and identity. We develop an agent-based model of cultural adoption, and validate geographic properties in our simulations against a dataset of innovative words that we identify from a 10% sample of Twitter (e.g., fleeky, birbs, ubering). Using our model, we are able to directly test the roles of network and identity by comparing a model that combines network and identity against simulated network-only and identity-only counterfactuals. We show that both effects influence different mechanisms of diffusion. Specifically, network principally drives spread among urban counties via weak-tie diffusion, while identity plays a disproportionate role in transmission among rural counties via strong-tie diffusion. Diffusion between urban and rural areas, a key component in innovation spreading nationally, requires both network and identity. Our work suggests that models must integrate both factors in order to understand and reproduce the adoption of innovation.

Feature-aware ultra-low dimensional reduction of real networks

In existing models and embedding methods of networked systems, node features describing their qualities are usually overlooked in favor of focusing solely on node connectivity. This study introduces FiD-Mercator, a model-based ultra-low dimensional reduction technique that integrates node features with network structure to create D-dimensional maps of complex networks in a hyperbolic space. This embedding method efficiently uses features as an initial condition, guiding the search of nodes’ coordinates toward an optimal solution. The research reveals that downstream task performance improves with the correlation between network connectivity and features, emphasizing the importance of such correlation for enhancing the description and predictability of real networks. Simultaneously, hyperbolic embedding’s ability to reproduce local network properties remains unaffected by the inclusion of features. The findings highlight the necessity for developing network embedding techniques capable of exploiting such correlations to optimize both network structure and feature association jointly in the future.

Responses

Your email address will not be published. Required fields are marked *