Rapid and sensitive protein complex alignment with Foldseek-Multimer

Main

The similarity between two protein complexes is reflected in their optimal structural alignment, which also dictates a pairing of their chains. Aligning and comparing quaternary structures is essential for quantifying their structural diversity and identifying structural similarities and changes across different conformations or homologs. Furthermore, it is important to understanding protein function because many proteins operate as complexes1.

Recently, Foldseek2 has been developed as a fast structural aligner to detect similarity between two single-chain proteins, expressed using 3Di, a designated alphabet for describing tertiary amino acid interactions. Using Foldseek allows searching for similar single-chain structures in large databases, such as the AFDB3. However, because aligning two complexes requires knowing the correct pairing of their chains, Foldseek cannot be used directly to find the alignment between them.

US-align4 is a structural aligner for various types of molecules, including protein complexes. Its strategy for complex alignment is TM-score maximization. As there is a factorial number of possible assignments of chain pairings, US-align uses a greedy search heuristic for proposing candidate assignments, which are refined by dynamic programming. This heuristic was shown to make US-align up to five times faster than the state-of-the-art MM-align5, while producing higher scoring alignments, making US-align the gold standard for pairwise complex alignment.

Aiming to discover pairs of structurally conserved interfaces in large databases, Dey et al.6 developed QSalign for the detection of similar homomeric complexes. QSalign saves computation time by performing the full pairwise structural alignment only on complex pairs prefiltered based on their sequence similarity, retaining pairs with around 25% sequence identity or more. This speed-up comes at the expense of sensitivity, limiting its ability to discover structurally similar pairs in the twilight zone or below. Despite this speed-up, QSalign still took several months to conduct an all-versus-all search encompassing about 100,000 complexes in the 3DComplex DB V5 (ref. 7) using 100 threads. An alternative approach to reduce computational time during database search was presented by Guzenko et al.8, who compared the shapes between two complexes through 3D Zernike descriptors, avoiding the need to pair their chains. This approach can query through hundreds of thousands of structures in less than a second. However, it can only discover global matches between molecules of similar shapes, limiting its sensitivity, compared to chain-pairing methods, like US-align and QSalign. Furthermore, it is unable to find local matches within chains that do not match globally.

The challenge of sensitively searching large databases is expected to intensify as the computational prediction of protein complexes using tools like AlphaFold-Multimer9 can now be performed on entire proteomes to systematically predict complexes10,11,12 and on sequences from metagenomic samples. This will enrich our databases with a plethora of structures, potentially in the millions, in the coming years.

To address the need for large-scale structural comparisons between complexes, we developed Foldseek-Multimer (Fig. 1). Three factors contribute to its speed: (1) using Foldseek for fast chain-to-chain comparison, (2) describing chain-to-chain alignments as superposition vectors, and using them to identify complex alignments by efficient clustering, and (3) utilizing clustered databases during searches. Through benchmarks, we show that Foldseek-Multimer is: (1) nearly as accurate as US-align, while being orders of magnitude faster, (2) sensitive and suitable for metagenomic studies of complexes with low sequence similarity to others, (3) capable of all-versus-all searches, examining billions of complex pairs in 11 h.

Fig. 1: Foldseek-Multimer schematic.
figure 1

a, Foldseek-Multimer allows fast querying of input complex(es) against a large database, potentially containing millions of targets. b, All chains from the query (gray) are compared to those of each target (red). A prefilter allows to quickly reject non-matching chain pairs so the full alignment is only applied to promising complex pairs. c, Foldseek-Multimer represents each chain-to-chain alignment as a superposition, described by rotations and translations, required for superposing the target chain onto the query. In this simplified example, two chain-to-chain alignments (top, bottom) are a rotation along one axis (yellow and green highlights), while one (middle) is a rotation along a different axis. d, The complex-to-complex alignment is inferred from chain-to-chain alignments as the superpositions of chain pairs in the complex alignment are similar (‘Algorithm: overview’). Foldseek-Multimer uses the DBSCAN algorithm iteratively, with increasing radii, to identify superposition clusters and the best-scoring valid cluster for computing the complex alignment (Supplementary Fig. 1). e, Based on the best-scoring cluster, the complex TM score is computed across all chain alignments between query and target.

Full size image

The quality of Foldseek-Multimer’s alignments was compared to that of US-align on a benchmark of 931 pairs of protein complexes, known to be structurally similar, using either tool to align them. Foldseek-Multimer was run in two modes, differing in the algorithm used for chain-to-chain alignment: 3Di+AA (Foldseek-MM) or TM-align13 (Foldseek-MM-TM). Both tools detected the vast majority (>95%) of pairs as similar (US-align: 97.6%, Foldseek-MM-TM: 97.4%, Foldseek-MM: 95.8%), aligning them with a TM score ≥ 0.65, which is a cutoff found to be optimal for detecting structural similarity among complexes6. Using either mode, Foldseek-Multimer computed highly correlated TM scores to those of US-align (Fig. 2a and Supplementary Fig. 2) and produced the same chain pairing in >99% of the cases (see ‘Data availability’).

Fig. 2: Performance of Foldseek-Multimer.
figure 2

a, Query-length normalized TM scores (target-normalized: Supplementary Fig. 2) computed for 931 pairs of structurally similar complexes by US-align or Foldseek-Multimer. Both measures correlated highly (Pearson’s r). b, Execution time based on the dataset used for a. Complexes were binned by their number of chains; selected bins are shown (for all bins, see Supplementary Fig. 3). Box plots depict quartiles, each point is a complex pair (top) or complex (bottom), sample sizes are indicated as N, and whiskers are drawn to the maximum (minimum) point within 1.5 times the interquartile range over (under) the 75th (25th) percentile. Pairwise mode (top): Foldseek-Multimer is 10–100 times faster than US-align due to efficient chain-to-chain alignment and superposition clustering. Database search (bottom): complexes were queried against 3DComplexV7. Foldseek-Multimer is further accelerated by its prefilter, making it 103–104 times faster. c, An AlphaFold-Multimer prediction of a part of a CRISPR–Cas ribonucleoprotein from an environmental sample (top left) was queried by Foldseek-Multimer and US-align against PDB100. Foldseek-MM-TM identified the same hits as US-align, while being >3,000 times faster. These hits were the top ranks by Foldseek-MM (red) with TM score > 0.5. Non-aligned components of 7xg4 (top right) are set as transparent. d, Foldseek-Multimer was run on 57 billion pairs of complexes from 3DComplexV7. It discovered nearly all homomeric pairs previously identified as similar by QSalign, and found an additional 1.7 million homomeric pairs (Supplementary Fig. 5).

Full size image

We measured the runtime of the tools, breaking down the contribution of Foldseek-Multimer’s components to its speed. First, given the task of computing 931 pairwise alignments, we observed a speed-up of 1–2 orders of magnitude over US-align (Fig. 2b and Supplementary Fig. 3), reflecting the efficiency of the chain-to-chain alignment (Foldseek-MM) and superposition clustering (Foldseek-MM and Foldseek-MM-TM). The performance of Foldseek-MM-TM thus highlights the key contribution of Foldseek-Multimer’s innovative use of superpositions as an alternative to US-align’s global alignment. Next, the tools queried each of the 677 complexes in the benchmark (Methods) against the 3DComplexV7 database7. Here, Foldseek-Multimer was 3–4 orders of magnitude faster than US-align (Fig. 2b and Supplementary Fig. 3) due to an additional speed-up by its prefilter.

Recently, Altae-Tran et al.14 discovered the first CRISPR–Cas type IV-A system with a specified interference mechanism in an environmental sample of Sulfitobacter sp. JL08. Intrigued by their finding, we predicted a part of its ribonucleoprotein complex structure using ColabFold-AlphaFold-Multimer9,15. The prediction was of acceptable quality (0.564 pTM), and we provided it as a query to Foldseek-Multimer and US-align in a search against the Protein Data Bank (PDB)100 database (Methods). Foldseek-MM and Foldseek-MM-TM demonstrated remarkable efficiency in comparing a query consisting of six chains and spanning 1,843 amino acids against the 426,347 entries of PDB100. These comparisons took only 27 s and 6 min, respectively, on a single core of a server (23 s and 96 s on an 8-core MacBook Pro). By contrast, it took US-align 13 days.

Here, in addition to its fast core algorithm (Fig. 1), Foldseek-Multimer gained further acceleration since PDB100 is a clustered database, allowing it to search against the 343,785 representatives, instead of all entries, and to expand the search only within promising clusters (Methods). Foldseek-MM-TM and US-align scored five entries above 0.65. These entries were the top ranks by Foldseek-MM, scoring above 0.5 but below 0.65 (Fig. 2c, rank indicated by ‘#’). All five hits were from a recently reported type IV-A system in Pseudomonas aeruginosa16, which belongs to a different class (Gammaproteobacteria) than that of the query (Alphaproteobacteria). When examining the best match, 7xg4, we found that Foldseek-Multimer could identify similarity, despite low sequence similarity (11.1–19.8% sequence identity and 19–33.3% sequence similarity using the BLOSUM62 substitution matrix) between the six subunit pairs of Sulfitobacter sp. JL08 and those of 7xg4. This provides further support for the previous identification of the Sulfitobacter sp. JL08 system as type IV-A and highlights the potential of Foldseek-Multimer for investigating protein complex structures predicted in distant organisms from environmental samples (Supplementary Fig. 4: prediction quality effect).

Next, we examined Foldseek-Multimer in an all-versus-all setting, using the 3DComplexV7 database7 as it had been previously analyzed in this setting using QSalign (Methods). QSalign relies on the time-consuming Kpax17 structural alignment method, which prohibits it from conducting an exhaustive structural search. Thus, it first identified around 58 million pairs, which shared sequence similarity and then applied Kpax only to them, detecting around 4.5 million pairs of similar homomers (‘QSalign pairs’).

Using 128 cores, Foldseek-MM then queried the clustered 3DComplexV7 (Methods) against itself, examining 57 billion pairs in 11 h. Applying the same TM score ≥ 0.65 cutoff as QSalign, Foldseek-MM identified 98.6% of the homomeric pairs previously identified by QSalign and found an additional 1.7 million similar homomeric pairs: ‘Foldseek-MM Pairs’ (Fig. 2d). We used US-align for evaluating a randomly selected sample of 10% of the Foldseek-MM pairs (Methods). US-align confirmed 98.2% of the sampled pairs and rejected 1.8% (TM score < 0.65). We thus conclude that over 1.6 million of the homomeric pairs are new discoveries by Foldseek-Multimer, owed to its ability to detect similar complex structures below the twilight zone (Supplementary Fig. 5).

In addition to developing a command-line tool, we extended the Foldseek web server to support Foldseek-Multimer and visualize its search results using the NGL viewer library18. The web server overlays chain-to-chain assignments by using translucently colored protein surfaces. Users can choose between Foldseek-Multimer’s alignment modes, and apply taxonomic filters, restricting the search to specific clades. To accompany the web server with predicted structures, we organized 297,570 multimer predictions from community efforts10,11,19,20,21 into a single database (BFMD; Methods). BFMD is available in the web server and for local use.

In conclusion, we presented a strategy for complex-to-complex alignment, which quickly identifies compatible sets of chain-to-chain alignments by their superpositions. Demonstrated here on protein complexes, the Foldseek-Multimer strategy can be extended to other modalities, such as RNA and DNA complex structures, given a way to align their individual subunits. Together, the unprecedented sensitivity and speed offered by Foldseek-Multimer make it an essential tool for investigating complex structures in the AlphaFold era.

Methods

Algorithm: overview

Foldseek-Multimer examines all possible chain-to-chain pairings between the compared complexes, using Foldseek (Fig. 1b). It then uses the fact that a structural alignment between two complexes, Q and T, indicates a superposition: a set of rotations and translations, which minimize the sum of squared distances between their aligned residue pairs22. For simplicity, assume Q and T to be two structurally similar dimers, consisting of the chains Q_A, Q_B and T_A, T_B, where Q_A is similar to T_A and Q_B is similar to T_B. The physical meaning of the complex-level structural similarity is that Q_A is positioned and oriented relative to Q_B within Q in the same way that T_A is positioned and oriented relative to T_B within T. Thus, the same superposition, that is, the same set of rotations and translations, would minimize the distance between Q_A and T_A as well as the distance between Q_B and T_B. In other words, all individual chain-to-chain superpositions (for example, the one between Q_A and T_A) are equal to one another and to the complex-to-complex superposition. Therefore, a set of chain-to-chain alignments is compatible and can define a complex-to-complex alignment, only if all chain-to-chain superpositions computed from that set are equal. Therefore, Foldseek-Multimer computes for each chain-to-chain alignment a vector, representing its superposition (Fig. 1c). Next, it uses DBSCAN23 for clustering these vectors to identify compatible sets of chain-to-chain alignments, which share the same superposition and define valid complex alignments (Fig. 1d and Supplementary Fig. 1). Once complex alignments are identified, Foldseek-Multimer computes their TM score24 and reports them (Fig. 1e).

Algorithm: input

Foldseek-Multimer allows for searching one or more query protein complex structures against a target complex structure, a database of complex structures or a database of clustered structures. Structures can be provided in PDB/mmCIF format or as a Foldseek-formatted database. Formatting structures is possible using the createdb command.

Algorithm: chain-to-chain alignments

By utilizing Foldseek, Foldseek-Multimer offers two main modes for chain-to-chain structure comparison. The default mode, 3Di+AA, encodes structures as sequences over a 20-state 3Di alphabet, as fully described by ref. 2. Additionally, chain-to-chain alignments can be computed using TM-align13, which is a global, albeit slower, alignment method. During database search, a prefilter, which is based on the 3Di+AA mode, allows for fast removal of most chain pairs, continuing to compute chain-to-chain alignments only on promising candidates.

Algorithm: chain-to-chain superposition vectors

Given a chain-to-chain alignment, Foldseek-Multimer computes the superposition of the target chain onto the query chain, using nine rotations (U) and three translations (T). In preparation for aligning complex structure Q and complex structure T, Foldseek-Multimer creates a matrix with 12 columns, whose rows are the superposition vectors, computed from all chain-to-chain alignments, belonging to Q and T. The mean and the standard deviation (s.d.) of each column are then used to compute the coefficient of variation (CV = s.d./mean) of the column and exclude less-informative columns (CV < 0.1; Supplementary Table 1: the effect of this parameter on Foldseek-Multimer’s performance). If the mean value of the column is <1, the s.d. value is used instead of the CV for the exclusion criterion. Finally, the retained columns undergo normalization since they can have different scales. To that end, Foldseek-Multimer subtracts from each column its mean and divides it by its s.d. We denote the resulting reduced and normalized matrix as supQT.

Algorithm: chain-to-chain clustering

DBSCAN is used iteratively for clustering the rows of supQT as it doesn’t require knowing the number of clusters a priori. The stages of this procedure are described below and demonstrated on a small example in Supplementary Fig. 1.

Initialization

The Euclidean distances between all row pairs in supQT are computed and the minimum (minDist) initializes the parameter epsilon. The biggest cluster(s) encountered during the procedure are recorded in a candidate list alongside their size (maxClusterSize), which is initialized to 0.

The DBSCAN iteration

For each supQT row, all rows within a radius of epsilon from it, are defined as its ‘neighbors’. Then, all rows, which have at least one more neighbor (at least two neighbors, including itself) are considered as ‘core points’ and the rest as ‘non-core points’. Next, a core point is selected at random to start the first cluster. All its core-point neighbors are added to the first cluster. Each added core-point neighbor also adds its core-point neighbors and so on, until no more core points can be added to the first cluster. Then, all non-core points, which are neighbors of members of the first cluster, are added to it as well (without adding their neighbors). The second cluster is constructed similarly, operating on the remaining unclustered points.

Cluster validity and rescuing by nearest neighbors

During the DBSCAN iteration, after each cluster is computed, Foldseek-Multimer evaluates its validity. If a cluster includes the same chain in multiple chain-to-chain alignments, Foldseek-Multimer attempts to rescue it by selecting a compatible subgroup of points (that is, chain-to-chain alignments) from that cluster. To that end, points are selected for the subgroup in the order of their distance to the core point, which was used to initiate the cluster. Selection for the subgroup is stopped once the process encounters a point that includes a chain, which was already added by a previous point.

At each DBSCAN iteration, valid clusters which are at least as big as maxClusterSize are added to the candidate list. The value of maxClusterSize is updated each time a bigger cluster is encountered and all previously added clusters are removed from the list owing to being smaller.

Iterativity

Next, the value of the radius epsilon is increased by a delta of 0.1 (Supplementary Table 1: the effect of this parameter on Foldseek-Multimer’s performance) and a new DBSCAN iteration starts, potentially forming new clusters. If all new clusters are smaller than maxClusterSize, the procedure stops. Otherwise, the candidate list and maxClusterSize will be updated with the iteration’s clusters and epsilon will increase again, up to a maximal value of the distance between the two furthest points (maxDist).

Early stop condition

Let CQ and CT be the number of chains in Q and T, respectively. Without loss of generality, assume CQ < CT. If maxClusterSize is equal CQ, then no bigger valid cluster exists. Since there is a total of CQ × CT chain-to-chain alignments, the number of clusters in the candidate list cannot exceed CT once maxClusterSize is equal to CQ. Foldseek-Multimer checks these two conditions and avoids unnecessary DBSCAN iterations if they are met.

Discovered clusters

At the end of the iterative DBSCAN procedure, the biggest valid clusters are returned. Each of them is equivalent to a set of compatible chain-to-chain alignments with a similar superposition that together define a complex alignment between Q and T.

Algorithm: TM-score computation

TM scores are computed for the complex alignment derived from each of the valid clusters found for a Q–T complex pair as follows. First, the chains of complex Q are concatenated to each other in some order. Given the concatenation order of the chains in Q, Foldseek-Multimer concatenates the chains of complex T, in the order of their pairwise matches to the chains of Q, as defined by the cluster. Then, the TM score between the concatenated Q and concatenated T is computed the same way Foldseek computes it for single-chain pairwise alignments, using the Cα coordinate vectors of both chains (concatenated chains in this case). Using this computation, all complex alignments a given query complex Q has with a specific target T and with all other target complexes can be ranked and reported by their TM score.

Algorithm: utilizing clustered databases

To further accelerate Foldseek-Multimer, we aimed to reduce the redundancy in the target database, an approach, which is also adopted by TM-search25. To that end, we introduced a new capability to Foldseek, which allows it to efficiently search through clustered databases in MMseqs2 or Foldseek format (for example, PDB100, see below). If the input has M cluster representatives and N cluster members (M < N), Foldseek will first search (prefilter + alignment) against the M representatives, finding candidates below a specific E-value threshold (the default value of 10 was used in this study). Extending to promising clusters only, the alignment step will then be carried out on all cluster members of the candidates. Foldseek-Multimer will use the alignment results of all extended clusters for computing superposition matrices and the following procedure steps, as described above.

The 3DComplex database and QSalign comparisons

For the analyses presented in Fig. 2a,b,d, we downloaded the 3DComplex database version 7 (3DComplexV7 DB; see ‘Data availability’). In brief, this database holds 238,965 structures, consisting of 557,146 chains and was created from the ‘Biological Units/Assemblies’ downloaded from the PDB using the method described previously7. Before this study, QSalign6 had been applied to 3DComplexV7 DB and yielded a list of 57,953,513 compared structural pairs.

Similar pairs benchmark

Dataset

Starting with the list of 57,953,513 QSalign-compared pairs, we selected entries with varying numbers of subunits (from 2 to 24). For each size, the criteria for selection were that the TM score computed by Kpax17 was greater than 0.8, and that pairs of homomers had less than 80% sequence identity. If more than 100 pairs matched the criteria, only the first 100 were selected, resulting in a total of 931 complex pairs included in the benchmark.

Runtime evaluation

Performance was measured on a server with a 1x AMD EPYC 7702P 64-core CPU and 1 TB RAM, using a single core. The queries for the time measurements in Fig. 2b and Supplementary Fig. 3 were the 677 unique complexes associated with the 931 pairs. Owing to its high computational demand, the runtime of US-align on these 677 complexes against 3DcomplexV7 was extrapolated from running against 1,000 randomly sampled 3DcomplexV7 entries. Reporting the average over the number of cases Nc = 142, 109, 124, 18, 101, 7, 42, 8, 41, 44, 17, 5, 5, 14 for each number of chains c = 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 24: ({rm{avg}},=,frac{1}{{N}_{c}}mathop{sum }nolimits_{i = 1}^{{N}_{c}}t({q}_{i},{rm{sample}})frac{238,965}{1,000}). Foldseek-Multimer was run against the full database, without extrapolation.

Environmental CRISPR–Cas

The PDB100 database

A version of the PDB, termed PDB100, was used to search for structural homologs of an environmental CRISPR–Cas as well as to measure the runtimes of Foldseek-Multimer and US-align. PDB100 was first introduced by ref. 2, but further developed in this study, as described here. First, PDB, containing the asymmetric unit of 207,937 entries, consisting of 1,047,615 chains, was downloaded in November 2023 (see ‘Data availability’). Of these, 11,901 entries were associated with more than one structural model (for example, the NMR experiment 2KOX). In total, 426,347 structural models were associated with the PDB entries. Next, all chains were clustered using Foldseek (parameters: -c 0.95 –min-seq-id 1.0), resulting in 343,785 redundancy-reduced representatives. In contrast to van Kempen et al.2, PDB100 is now a cluster database, which holds the representatives alongside information to associate them to their cluster chains and structural models. PDB100 is updated regularly and is available through the Foldseek web server and can be downloaded using the ‘databases’ command.

Complex structure prediction

Four Sulfitobacter sp. JL08 protein sequences, identified as CRISPR–Cas type IV-A components by Altae-Tran et al.14—Csf1, Csf2, Csf3 and Cas6—were obtained from the plasmid map ‘pHS1068 NZ_CP025815 DinG HNH proteins (Escherichia coli codon optimized) CRISPR array in pACYCDuet-1 with Lac promoters.gb’, released by the authors. Following the reported stoichiometry of the CRISPR–Cas type IV-A core complex26, we constructed an input file for ColabFold-AlphaFold-Multimer15 with eight chains: 1xCsf1 + 5xCsf2 + 1xCsf3 + 1xCas6. When examining the structure, we noticed that AlphaFold-Multimer did not predict an interaction between Csf1 and Cas6 and the rest of the complex, so we omitted them and re-predicted the structure: 5xCsf2 + 1xCsf3. Comparing the four sequences of Sulfitobacter sp. JL08 to protein nr27 was performed using the blastp web server (February 2024).

Runtime evaluation

Performance was measured on a server with a 1x AMD EPYC 7702P 64-core CPU and 1 TB RAM, using a single core. Owing to its high computational demand, the total runtime was extrapolated when measuring US-align on the Sulfitobacter sp. JL08 structure against the PDB100, using five samples: ({rm{avg}}=frac{1}{5}mathop{sum }nolimits_{i = 1}^{5}t(q,{{rm{sample}}}_{i})frac{426,347}{1,000}). Foldseek-Multimer was run against the full database, without extrapolation. For the MacBook runtime measurements, we used a 13-inch MacBook Pro (M1; 2020; model A2338) with 16 GB RAM.

Comparison to QSalign on 3DComplexV7

QSalign pairs

Starting with the list of 57,953,513 QSalign-compared pairs, high-scoring homomeric pairs (maximum TM score ≥ 0.65) were selected, excluding pairs with a PISA structure. This resulted in 4,524,503 structurally similar unique homomeric pairs, which we denoted ‘QSalign pairs’.

A clustered 3DComplexV7

The 557,146 chains of 3DComplexV7 were clustered using Foldseek (parameters: -c 0.99 –min-seq-id 0.9 -e 0.00001), resulting in 142,957 redundancy-reduced representatives. This procedure took 18 s, using 64 threads.

Foldseek-MM all-versus-all search of 3DComplexV7

During this search, all temporary files were kept in memory and 128 cores were used (2 × AMD EPYC 7742). The entire search finished in 10 h and 23 min. Most of the time was spent in the module for matching chains, which took 7 h and 32 min.

Evaluation of ‘Foldseek-MM pairs’

About 1.7 million pairs of homomeric complexes were detected only by Foldseek-MM as similar. Since running US-align over all pairs is prohibitively slow, we randomly selected 160,252 pairs (around 10% of all pairs) and computed their alignment using US-align. For 2,844 of these (1.8%), US-align reported a TM score < 0.65, which we used as an estimate for the false-positive rate among the full set of novel ‘Foldseek-MM pairs’. Around 157,391 pairs (98.2%) were confirmed as matches by US-align and the rest (17 pairs, <0.0001%) were aligned as monomers.

The BFMD resource

In an effort to generate a large multimer database, we gathered 297,570 multimer predictions, consisting of 597,640 chains from several community efforts. These were turned into a clustered Foldseek database using the parameters: -c 0.95 –min-seq-id 1.0 -e 0.00001, resulting in 51,757 redundancy-reduced representatives. All predictions’ accessions are prefixed by the resource name. Multimers extracted from the ModelArchive19, all-versus-all prediction of a set of human genome maintenance proteins Predictomes20, LevyLab atlas of predicted homomers10, protein–protein prediction from the Human Reference Interactome28 and the Human Protein Complex Map29HuIntAF2 (ref. 11) and ProtVar, predicted multimers as part of an effort to understand missense variance21. The BFMD is available through the Foldseek web server and is downloadable as a standalone database using the ‘databases’ module.

Tool commands and arguments

Foldseek-MM commit c27a629 (default, using 3Di+AA):

foldseek easy-complexsearch query.pdb

target.pdb/targetDB result tmp –threads 1

Foldseek-MM-TM commit c27a629 (using tmalign):

foldseek easy-complexsearch query.pdb

target.pdb/targetDB result tmp –threads 1

–alignment-type 1

Additionally, the flag ‘–exhaustive-search 1’ was used for the benchmark of similar pairs and the flag ‘–cluster-search 1’ was used when using a clustered db. For database searches, we pre-indexed the database using ‘createindex targetDB’ and kept it in memory. We set –db-load-mode to 2 in easy-complexsearch, to indicate that the pre-indexed database is already in memory. During database search, Foldseek-Multimer can include or reject monomeric targets in the reference database using the –monomer-include-mode parameter. For this study, we set the parameter to reject all monomer matches.

US-align version 20220924:

US-align query.pdb target.pdb -mm 1 -ter 0 -mol prot

Additionally, the flag ‘-fast’ was set for during runtime assessments in Fig. 2b. For speed measurements, we kept the PDB/mmCIF files in memory to avoid input/output-related bottlenecks.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Related Articles

Iron homeostasis and ferroptosis in muscle diseases and disorders: mechanisms and therapeutic prospects

The muscular system plays a critical role in the human body by governing skeletal movement, cardiovascular function, and the activities of digestive organs. Additionally, muscle tissues serve an endocrine function by secreting myogenic cytokines, thereby regulating metabolism throughout the entire body. Maintaining muscle function requires iron homeostasis. Recent studies suggest that disruptions in iron metabolism and ferroptosis, a form of iron-dependent cell death, are essential contributors to the progression of a wide range of muscle diseases and disorders, including sarcopenia, cardiomyopathy, and amyotrophic lateral sclerosis. Thus, a comprehensive overview of the mechanisms regulating iron metabolism and ferroptosis in these conditions is crucial for identifying potential therapeutic targets and developing new strategies for disease treatment and/or prevention. This review aims to summarize recent advances in understanding the molecular mechanisms underlying ferroptosis in the context of muscle injury, as well as associated muscle diseases and disorders. Moreover, we discuss potential targets within the ferroptosis pathway and possible strategies for managing muscle disorders. Finally, we shed new light on current limitations and future prospects for therapeutic interventions targeting ferroptosis.

Type 2 immunity in allergic diseases

Significant advancements have been made in understanding the cellular and molecular mechanisms of type 2 immunity in allergic diseases such as asthma, allergic rhinitis, chronic rhinosinusitis, eosinophilic esophagitis (EoE), food and drug allergies, and atopic dermatitis (AD). Type 2 immunity has evolved to protect against parasitic diseases and toxins, plays a role in the expulsion of parasites and larvae from inner tissues to the lumen and outside the body, maintains microbe-rich skin and mucosal epithelial barriers and counterbalances the type 1 immune response and its destructive effects. During the development of a type 2 immune response, an innate immune response initiates starting from epithelial cells and innate lymphoid cells (ILCs), including dendritic cells and macrophages, and translates to adaptive T and B-cell immunity, particularly IgE antibody production. Eosinophils, mast cells and basophils have effects on effector functions. Cytokines from ILC2s and CD4+ helper type 2 (Th2) cells, CD8 + T cells, and NK-T cells, along with myeloid cells, including IL-4, IL-5, IL-9, and IL-13, initiate and sustain allergic inflammation via T cell cells, eosinophils, and ILC2s; promote IgE class switching; and open the epithelial barrier. Epithelial cell activation, alarmin release and barrier dysfunction are key in the development of not only allergic diseases but also many other systemic diseases. Recent biologics targeting the pathways and effector functions of IL4/IL13, IL-5, and IgE have shown promising results for almost all ages, although some patients with severe allergic diseases do not respond to these therapies, highlighting the unmet need for a more detailed and personalized approach.

LolA and LolB are conserved in Bacteroidota and are crucial for gliding motility and Type IX secretion

Lipoproteins are key outer membrane (OM) components in Gram-negative bacteria, essential for functions like membrane biogenesis and virulence. Bacteroidota, a diverse and widespread phylum, produce numerous OM lipoproteins that play vital roles in nutrient acquisition, Type IX secretion system (T9SS), and gliding motility. In Escherichia coli, lipoprotein transport to the OM is mediated by the Lol system, where LolA shuttles lipoproteins to LolB, which anchors them in the OM. However, LolB homologs were previously thought to be limited to γ- and β-proteobacteria. This study uncovers the presence of LolB in Bacteroidota and demonstrates that multiple LolA and LolB proteins co-exist in various species. Specifically, in Flavobacterium johnsoniae, LolA1 and LolB1 transport gliding motility and T9SS lipoproteins to the OM. Notably, these proteins are not interchangeable with their E. coli counterparts, indicating functional specialization. Some lipoproteins still localize to the OM in the absence of LolA and LolB, suggesting the existence of alternative transport pathways in Bacteroidota. This points to a more complex lipoprotein transport system in Bacteroidota compared to other Gram-negative bacteria. These findings reveal previously unrecognized lipoprotein transport mechanisms in Bacteroidota and suggest that this phylum has evolved unique strategies to manage the essential task of lipoprotein localization.

T-cell receptor structures and predictive models reveal comparable alpha and beta chain structural diversity despite differing genetic complexity

T-cell receptor (TCR) structures are currently under-utilised in early-stage drug discovery and repertoire-scale informatics. Here, we leverage a large dataset of solved TCR structures from Immunocore to evaluate the current state-of-the-art for TCR structure prediction, and identify which regions of the TCR remain challenging to model. Through clustering analyses and the training of a TCR-specific model capable of large-scale structure prediction, we find that the alpha chain VJ-recombined loop (CDR3α) is as structurally diverse and correspondingly difficult to predict as the beta chain VDJ-recombined loop (CDR3β). This differentiates TCR variable domain loops from the genetically analogous antibody loops and supports the conjecture that both TCR alpha and beta chains are deterministic of antigen specificity. We hypothesise that the larger number of alpha chain joining genes compared to beta chain joining genes compensates for the lack of a diversity gene segment. We also provide over 1.5M predicted TCR structures to enable repertoire structural analysis and elucidate strategies towards improving the accuracy of future TCR structure predictors. Our observations reinforce the importance of paired TCR sequence information and capture the current state-of-the-art for TCR structure prediction, while our model and 1.5M structure predictions enable the use of structural TCR information at an unprecedented scale.

Targeting of TAMs: can we be more clever than cancer cells?

With increasing incidence and geography, cancer is one of the leading causes of death, reduced quality of life and disability worldwide. Principal progress in the development of new anticancer therapies, in improving the efficiency of immunotherapeutic tools, and in the personification of conventional therapies needs to consider cancer-specific and patient-specific programming of innate immunity. Intratumoral TAMs and their precursors, resident macrophages and monocytes, are principal regulators of tumor progression and therapy resistance. Our review summarizes the accumulated evidence for the subpopulations of TAMs and their increasing number of biomarkers, indicating their predictive value for the clinical parameters of carcinogenesis and therapy resistance, with a focus on solid cancers of non-infectious etiology. We present the state-of-the-art knowledge about the tumor-supporting functions of TAMs at all stages of tumor progression and highlight biomarkers, recently identified by single-cell and spatial analytical methods, that discriminate between tumor-promoting and tumor-inhibiting TAMs, where both subtypes express a combination of prototype M1 and M2 genes. Our review focuses on novel mechanisms involved in the crosstalk among epigenetic, signaling, transcriptional and metabolic pathways in TAMs. Particular attention has been given to the recently identified link between cancer cell metabolism and the epigenetic programming of TAMs by histone lactylation, which can be responsible for the unlimited protumoral programming of TAMs. Finally, we explain how TAMs interfere with currently used anticancer therapeutics and summarize the most advanced data from clinical trials, which we divide into four categories: inhibition of TAM survival and differentiation, inhibition of monocyte/TAM recruitment into tumors, functional reprogramming of TAMs, and genetic enhancement of macrophages.

Responses

Your email address will not be published. Required fields are marked *