Constructing multicomponent cluster expansions with machine-learning and chemical embedding

Introduction

The cluster expansion (CE)^1,2 is a versatile tool to model atomistic interactions across several material classes. On-lattice CE models are routinely used to compute order-disorder^3,4,5,6, vibrational^7,8,9,10 and magnetic^11,12,13,14 thermodynamics of multicomponent materials. Though CE models primarily serve as surrogates for formation energies of atomic configurations, the method has been extended to compute tensorial properties of materials¹⁵, energies of defects^16,17, and to parameterize effective Hamiltonians that couple several microscopic degrees of freedom¹⁸.

Cluster expansion models are generally trained on zero Kelvin data, such as the formation energies of a set of orderings computed with density functional theory (DFT). CE models are then used together with statistical mechanics techniques to compute finite temperature properties of materials. Within the CE formalism, the formation energy (or other material property), is expanded as a linear series of cluster basis functions multiplied with expansion coefficients. Researchers have made significant strides towards simplifying the parametrization of CE models from a limited pool of first-principles calculations. For instance, predictive CE models can be obtained by regressing against cluster functions chosen from genetic algorithms¹⁹. Conventional data science techniques such as weighting²⁰ and cross-validation²¹ are reported to improve the predictive power of these lattice models. More recently, regularization and cross-validation have been employed to obtain sparse cluster expansions^{22,23,24,25,26,27,28}. Efforts have also been made towards choosing training datasets that minimize the number of expensive electronic structure calculations and in characterizing errors in finite temperature properties with Bayesian techniques^{29,30,31,32,33}.

Effective Hamiltonians based on the CE have provided critical insights into the behavior of structural^3,34,35,36, catalytic^37,38, electrochemical^39,40,41,42, thermoelectric⁶ and semiconductor^8,43 materials. However, the use of CE models is usually limited to alloys containing ≈3-4 chemical species. Although there are no theoretical limitations to applying CE models to multicomponent materials, several practical difficulties arise when parameterizing and deploying multi-element CE. For instance, the number of fitting coefficients rises polynomially with the number of chemical species. As a result, parameterizing CE models for chemically-complex alloys requires large training datasets and is often computationally expensive to coarse-grain with statistical mechanics techniques. CE models have found limited applicability in predicting the finite-temperature properties of multi-principal element alloys. Alloys containing mixtures of several elements, also referred to as high-entropy alloys, are attractive candidates for high-performance structural materials^44,45,46,47, energy storage⁴⁸ and catalytic applications⁴⁹. Accurate atomistic models are crucial to enabling the design of the next generation of high-performance multicomponent materials.

Inspired by recent efforts in chemical dimensionality reduction, we describe a formalism to build on-lattice cluster expansion models for alloys containing several elements or site degrees of freedom. The embedded cluster expansion (eCE), employs machine-learning techniques to simultaneously learn an embedding of site basis functions within a lower dimensional space, and the weights of an energy model. Symmetrized cluster functions constructed from the transformed site basis functions are used to compute the energy of a configuration. We apply the eCE formalism to build a formation energy cluster expansion for 6-component mixtures of elements in groups 5 and 6. Our results show that eCE models accurately predict formation energies in the complex alloy. Chemical trends in the alloying elements, such as the similarities between elements of the same group are naturally learnt by the model based on a small pool of electronic structure calculations. Site basis functions embedded in a three-dimensional space with the eCE model are sufficient to reproduce the energetics of the senary alloy to within 4 meV/atom. As the eCE models learn chemical similarities based on the electronic structure calculations, they are also able to extrapolate into chemical spaces that have not been sampled. As a proof of concept, we compare finite-temperature predictions of short-range order in a binary Cr-W alloy to a conventional CE and employ our new eCE framework to predict SRO in the equiatomic senary alloy.

Results

We begin by reviewing key aspects of the CE formalism, following which we describe a toy model to illustrate the embedded cluster expansion (eCE) before presenting the general eCE method. The eCE model is then applied to model the thermodynamics of a 6-component V-Nb-Ta-Cr-Mo-W refractory alloy.

On-lattice cluster expansions

Consider a crystal with N sites where each site, i, can be occupied by c chemical species (denoted ϵ₁, ϵ₂, ⋯ , ϵ_c). In general, there are c^N distinct arrangements of the c elements over the N sites. Any chemical decoration may be represented as a vector of occupation variables, (vec{sigma }=[{sigma }_{1},{sigma }_{2},cdots ,,{sigma }_{N}]). The occupation variable, σ_i, takes a unique value for each element ϵ_l. For instance σ_i = l if ϵ_l occupies site i. The site basis functions at site i can be represented as:

$$vec{varphi }({sigma }_{i})={[{varphi }_{1}({sigma }_{i}),{varphi }_{2}({sigma }_{i}),cdots ,{varphi }_{c}({sigma }_{i})]}^{T}$$

(1)

where (vec{varphi }({sigma }_{i})) is a vector containing c site basis functions. To ensure completeness of the cluster expansion, the site functions, {φ₁, φ₂, ⋯ , φ_c}, must be linearly independent. Common choices for site basis functions include Chebychev polynomials¹, occupation or indicator functions², and trigonometric or sinusoidal functions⁵⁰. Cluster functions are then constructed by taking products of site basis functions across all N sites of the crystal:

$${Phi }_{vec{alpha }}(vec{sigma })=mathop{prod}limits_{(i,nu )in vec{alpha }}{varphi }_{nu }({sigma }_{i})$$

(2)

(vec{alpha }) is a list of size N with each entry being a tuple that contains the site index i and the site function index ν. Every cluster function is a product of N site basis functions. It is usually convenient to choose site basis functions such that one of the functions is constant, i.e. ϕ₁(σ_i) = 1, ∀ σ_i. This allows for the construction of a hierarchical cluster expansion. For example, if all site basis functions are chosen to be 1 except that of the first site, the corresponding cluster function is given by Φ₁ = ϕ₂(σ₁) × 1 × 1 × ⋯ × 1 = ϕ₂(σ₁). Φ₁ is the cluster function associated with a point cluster located at site 1. Similarly, choosing all site functions to be 1 except for two sites will result in a cluster function that represents a pair cluster.

Any scalar property, such as the formation energy of a configuration, is given by^1,2:

$$begin{array}{ll}E(vec{sigma }),=,mathop{sum}limits_{vec{alpha }in Lambda }{tilde{J}}_{vec{alpha }}{Phi }_{vec{alpha }}(vec{sigma })\qquadquad ,=,{tilde{J}}_{0}+mathop{sum}limits_{vec{alpha }in {Lambda }_{point}}{tilde{J}}_{vec{alpha }}{Phi }_{vec{alpha }}(vec{sigma })\qquadquad ,+mathop{sum}limits_{vec{beta }in {Lambda }_{pair}}{tilde{J}}_{vec{beta }}{Phi }_{vec{beta }}(vec{sigma })+cdots end{array}$$

(3)

where (E(vec{sigma })) is the formation energy of configuration (vec{sigma }), ({tilde{J}}_{vec{alpha }}), are effective cluster interations (ECI) for cluster (vec{alpha }), (Lambda ={{vec{alpha }}_{1},{vec{alpha }}_{2},cdots ,}) is the set of all clusters in the crystal, Λ_point is the set of point clusters, and Λ_pair is the set of pair clusters. Choosing the first site basis functions to be constant partitions the total energy of a crystal into contributions arising from points, pairs, triplets etc.

Symmetry reduces the number of independent expansion coefficients in eq. (3). If cluster (vec{alpha }) can be transformed to another cluster (vec{lambda }) by a symmetry operation of the undecorated crystal, the interaction coefficients for both functions must be equal, i.e. ({tilde{J}}_{vec{alpha} }={tilde{J}}_{vec{lambda} }). All symmetrically equivalent cluster functions can be collected together in an orbit, denoted as ({Omega }_{vec{alpha }}={vec{alpha },vec{lambda },cdots ,}). (vec{alpha }) refers to a prototype cluster function that represents the entire orbit. The symmetrized cluster expansion contains energy contributions from each orbit:

$$E(vec{sigma })=mathop{sum}limits_{{Omega }_{vec{alpha }}in Lambda }{tilde{J}}_{{Omega }_{vec{alpha }}}mathop{sum}limits_{vec{beta }in {Omega }_{alpha }}{Phi }_{vec{beta }}(vec{sigma })$$

(4)

where (Lambda ={{Omega }_{vec{alpha }},{Omega }_{vec{gamma }},cdots ,}) is the set of all orbits. The expansion of eq. (4) can be further partitioned into energy contributions arising from each site in the crystal⁵¹:

$$E(vec{sigma })=mathop{sum}limits_{i=1}^{N}{E}_{i}(vec{sigma })=mathop{sum}limits_{i=1}^{N}mathop{sum}limits_{{Omega }_{vec{alpha }}^{i}in {Lambda }^{i}}frac{{tilde{J}}_{{Omega }_{vec{alpha }}}}{| vec{alpha }| }mathop{sum}limits_{vec{delta }in {Omega }_{vec{alpha }}^{i}}{Phi }_{vec{delta }}(vec{sigma })$$

(5)

E_i is the energy contributed by site i to the total energy of the crystal, Λⁱ is the set of all clusters radiating from site i, ({Omega }_{vec{alpha }}^{i}) is the orbit of cluster function (vec{alpha }) centered around site i and (| vec{alpha }|) is the number of sites in the cluster. The additional factor (frac{{tilde{J}}_{{Omega }_{vec{alpha }}}}{| vec{alpha }| }) arises due to overcounting of each cluster in the site-centric cluster expansion. For simplicity of notation we will define ({J}_{{Omega }_{vec{alpha }}}=frac{{tilde{J}}_{{Omega }_{vec{alpha }}}}{| vec{alpha }| }). The symmetrized cluster functions ({Theta }_{{Omega }_{vec{alpha }}^{i}}=mathop{sum}nolimits_{vec{delta }in {Omega }_{vec{alpha }}^{i}}{Phi }_{vec{delta }}(vec{sigma })) are site-centric descriptors that can distinguish between all symmetrically distinct neighborhoods around site i. Figure 1 schematically shows the symmetrically equivalent sets of nearest neighbor and next-nearest neighbor pair clusters on a triangular lattice. The symmetrized site-centric descriptors can serve as inputs to any regression model that parameterizes the site energy E_i⁵¹:

$$E(vec{sigma })=mathop{sum}limits_{i=1}^{N}{E}_{i}({{Theta }_{{Omega }_{vec{alpha }}},{Theta }_{{Omega }_{vec{beta }}},cdots ,})$$

(6)

Although there are an infinite number of site-centric descriptors, in practice, the number of cluster functions must be truncated. Linear CE models typically enumerate cluster functions up to a maximal cluster size and cluster radius before fitting the CE model with techniques such as compressed sensing, genetic algorithms etc. Non-linear CE models may be advantageous as they have been found to converge at significantly smaller cluster sizes than linear models⁵¹.

Constructing multicomponent cluster expansions with machine-learning and chemical embedding — **Fig. 1: Schematic illustration of a triangular lattice with two symmetrically distinct pair clusters.**

Though exact and complete, the cluster expansion formalism is not easily amenable to capture atomistic interactions in multicomponent alloys. The number of cluster functions in a cluster containing k sites with c chemical species scales polynomially as (c−1)^k. As shown in Fig. 2, in alloys with over 5 chemical species, the number of features can exceed a thousand even for relatively small cluster sizes. The rapid growth in the number of cluster functions with chemical complexity is thus a major impediment to parameterizing accurate multicomponent cluster expansions.

**Fig. 2: Polynomial increase in the number of features with the number of alloying elements.**

Encoding chemical similarity through a linear transformation of site basis functions

Multicomponent alloys often exhibit additional degeneracies that arise due to chemical similarities between elements. This is typically manifested as relationships between interaction coefficients of the multicomponent CE. For instance, consider a ternary alloy with three chemical elements A, B and C that can occupy each site of the triangular lattice (Fig. 1). If the elements B and C are chemically similar, we would expect that the ECI of cluster functions that involve either the B or C element are related. This is readily seen in a CE that employs the occupation basis. Occupation site basis functions adopt the following values:

$${{boldsymbol{varphi }}}=left[begin{array}{ccc}{varphi }_{1}(A)&{varphi }_{1}(B)&{varphi }_{1}(C)\ {varphi }_{2}(A)&{varphi }_{2}(B)&{varphi }_{2}(C)\ {varphi }_{3}(A)&{varphi }_{3}(B)&{varphi }_{3}(C)end{array}right]=left[begin{array}{ccc}1&1&1\ 0&1&0\ 0&0&1end{array}right]$$

(7)

The matrix φ contains the value of the three site basis functions φ₁, φ₂, φ₃. The columns of the matrix correspond to the values of the basis functions if either A, B or C occupies the site. Assuming that nearest neighbor pair interactions are sufficient to describe the ordering energies in the ternary alloy, the formation energy of a configuration is given by:

$$begin{array}{ll}E(vec{sigma }),=,N{J}_{0}+{J}_{B}mathop{sum}limits_{i}{varphi }_{2}({sigma }_{i})+{J}_{C}mathop{sum}limits_{i}{varphi }_{3}({sigma }_{i})\ qquadqquad,+,{J}_{BB}^{NN}mathop{sum}limits_{i,jin NN}{varphi }_{2}({sigma }_{i}){varphi }_{2}({sigma }_{j})\ qquadqquad,+,{J}_{CC}^{NN}mathop{sum}limits_{i,jin NN}{varphi }_{3}({sigma }_{i}){varphi }_{3}({sigma }_{j})\ qquadqquad,+,{J}_{BC}^{NN}mathop{sum}limits_{i,jin NN}({varphi }_{2}({sigma }_{i}){varphi }_{3}({sigma }_{j})+{varphi }_{2}({sigma }_{j}){varphi }_{3}({sigma }_{i}))end{array}$$

(8)

where J₀ is the energy of the empty cluster, J_B, J_C are the point energies of the B and C chemical elements, ({J}_{BB}^{NN},{J}_{CC}^{NN},{J}_{BC}^{NN}) are the pair energies of a B-B, C-C and B-C pair respectively. The chemical similarity of B and C should manifest in the energy expansion as equalities between the interaction coefficients. Specifically, J_B = J_C and ({J}_{BB}^{NN}={J}_{CC}^{NN}={J}_{BC}^{NN}). In practice, the relationships between ECI are learnt based on a training dataset of electronic structure calculations, but are never exploited.

The degeneracies between ECI for this system can be used to learn a simpler CE. Consider the following linear transformation of the site basis functions of eq. (7):

$$begin{array}{lll}qquadqquad{{mathcal{T}}}vec{varphi }({sigma }_{i})qquadquadqquadquad=,vec{tilde{varphi }}({sigma }_{i})\ left[begin{array}{l}1quad0quad0\ 0quad1quad1end{array}right]left[begin{array}{l}{varphi }_{1}({sigma }_{i})=1\ quad{varphi }_{2}({sigma }_{i})\quad {varphi }_{3}({sigma }_{i})end{array}right],=,left[begin{array}{l}qquadquad1\ {varphi }_{2}({sigma }_{i})+{varphi }_{3}({sigma }_{i})end{array}right]end{array}$$

(9)

where the original site basis functions, (vec{varphi }({sigma }_{i})), have been transformed by the matrix ({{mathcal{T}}}), into a set of new site basis functions (vec{tilde{varphi }}({sigma }_{i})). The transformed site basis functions, (vec{tilde{varphi }}({sigma }_{i})) span a two-dimensional sub-space within the three-dimensional space of the original site functions. The projected space is embedded as shown in Fig. 3. The new site functions are not mathematically complete and evaluate to identical values for both the B and C specie:

$$tilde{{{boldsymbol{varphi }}}}=left[begin{array}{ccc}{tilde{varphi }}_{1}(A)&{tilde{varphi }}_{1}(B)&{tilde{varphi }}_{1}(C)\ {tilde{varphi }}_{2}(A)&{tilde{varphi }}_{2}(B)&{tilde{varphi }}_{2}(C)end{array}right]=left[begin{array}{ccc}1&1&1\ 0&1&1end{array}right]$$

(10)

In the site basis function space of (vec{tilde{varphi }}), it is impossible to distinguish between the B and C specie. However, a CE constructed through tensor products of the transformed site basis functions, (vec{tilde{phi }}) can reproduce the ordering energies of the three chemical species on a triangular lattice:

$$begin{array}{ll}E(vec{sigma }),=,N{J}_{0}+{J}_{B}mathop{sum}limits_{i=1}^{N}{tilde{varphi }}_{2}({sigma }_{i})\ qquadqquad,,+,{J}_{BB}^{NN}mathop{sum}limits_{i,jin NN}^{N}{tilde{varphi }}_{2}({sigma }_{i}){tilde{varphi }}_{2}({sigma }_{j})end{array}$$

(11)

As the transformed site basis functions evaluate to identical values for both the B and C specie, the CE of Eq. (11) will compute the exact same energy for orderings where B atoms are replaced with C or vice-versa. Embedding the “chemical similarity” of elements B and C into the site basis functions allowed us to reduce the number of cluster basis functions from 6 to 3. Though chemical rules are often not known a-priori, they can be simultaneously learnt together with the interaction coefficients based on a small pool of electronic structure calculations.

**Fig. 3: Effect of the embedding matrix on site basis functions.**

Embedded cluster expansions (eCE)

In general, any set of linearly independent and complete site basis functions can be embedded in a lower dimensional space through a linear transformation:

$$vec{tilde{varphi }}({sigma }_{i})={{mathcal{T}}}vec{varphi }({sigma }_{i})$$

(12)

(vec{varphi }({sigma }_{i})) is a vector of size c × 1, the transformed site basis (vec{tilde{varphi }}({sigma }_{i})) is a vector of size k × 1 and the transformation ({{mathcal{T}}}) is a matrix of size k × c, where k ≤ c and the rank of the transformation matrix is k. The elements of ({{mathcal{T}}}) linearly mix the site functions of (vec{varphi }) to obtain a lower-dimensional site basis. Maintaining the hierarchy of the cluster expansion requires that the constant site basis function remains in the transformed site basis. This can be enforced by fixing the first row and column of ({{mathcal{T}}}) to [1, 0, 0 ⋯ ]. The remaining entries of the transformation are learnable parameters of the model.

Cluster functions at each site are computed from the transformed site basis functions similar to eq. (2) and symmetrized as detailed in eq. (5):

$${tilde{Theta }}_{{Omega }_{alpha }^{i}}=mathop{sum}limits_{vec{delta }in {Omega }_{alpha }^{i}}mathop{prod}limits_{(j,nu )in vec{delta }}{tilde{varphi }}_{nu }({sigma }_{j})$$

(13)

Site-centric energies can then be expanded in terms of the symmetrized cluster functions:

$$E(vec{sigma })=mathop{sum}limits_{i=1}^{N}{E}_{i}({{tilde{Theta }}_{{Omega }_{alpha }^{i}},cdots ,})$$

(14)

The symmetrized cluster functions are invariant under all symmetry operations of the disordered phase⁵¹ and can be used as inputs to any regression method to parameterize E_i.

We refer to the CE formalism that embeds the site basis functions in a lower dimensional space as embedded cluster expansions or eCE. Throughout this study, we will indicate the number of effective chemical elements, i.e. the number of rows in ({{mathcal{T}}}) with a number. For instance, 2-eCE refers to a model with the site functions embedded in a two-dimensional space. Though the model is mathematically incomplete, exploiting chemical similarities between the elements significantly reduces the complexity of the energy expansion and results in fewer site-centric descriptors. For instance, 2-eCE models have an identical number of descriptors as a binary cluster expansion. A 2-eCE model of a 6-component alloy with all clusters shown in Fig. 2 will require ≈10¹ features, while the exact CE will contain ≈10³ descriptors.

The elements of the transformation matrix and regression coefficients for the energy model can be simultaneously learnt by minimizing a loss function through gradient descent techniques:

$${{mathcal{L}}}= mathop{argmin}limits_{{{boldsymbol{w}}},{{mathcal{T}}}}mathop{sum}limits_{vec{sigma }}{left({E}^{DFT}(vec{sigma })-mathop{sum}limits_{i = 1}^{N}{E}_{i}^{eCE}(vec{sigma },{{boldsymbol{w}}},{{mathcal{T}}})right)}^{2}+{{{mathcal{L}}}}_{reg}$$

(15)

where the first term is a least-squares error between the computed formation energies and the values predicted by the model, ({{{mathcal{L}}}}_{reg}) is a regularization term to prevent overfitting, w are the energy weights and ({{mathcal{T}}}) is the transformation matrix.

We demonstrate the advantages of the eCE formalism in a senary V-Nb-Ta-Cr-Mo-W alloy. This senary alloy is of current interest for high-temperature applications^44,45,46,47. Mixtures of elements in groups 5 and 6 of the periodic table form disordered solid solutions on the body-centered cubic crystal structure, or ordered Laves phases⁵². We focus on the configurational thermodynamics of orderings on the bcc crystal structure for the rest of this study.

Hyperparameter optimization

Parameterizing embedded cluster expansions (eCE) with Eqs. (12)–(14) requires several hyperparameters to be carefully tuned. Figure 4 shows the variation in validation error with the number of effective chemical species in the eCE (i.e. the embedding dimension), the number of symmetrically distinct clusters, and the number of datapoints in the training dataset. The CE models of Fig. 4a are trained to reproduce formation energies of 2936 randomly sampled datapoints. Separate models were parameterized over 10 random instantiations of the training dataset. The site energies of eCE models are computed with neural networks. Additional information about the neural network parameters, and regularization are provided in the methods section.

**Fig. 4: Learning curves for eCE and conventional CE.**

The mean validation error in Fig. 4a is computed over 1147 configurations that are not included in the training set. A cluster expansion containing two effective chemical species, 2-eCE, results in high validation errors of ≈15meV/atom. As 2-eCE embeds the 6 linearly independent site basis functions in a two-dimensional sub-space, this model is similar to a “binary” cluster expansion. Unlike a conventional binary cluster expansion where one of the site functions can take two possible values, the site function in 2-eCE can take six distinct values. The saturation in validation error with increasing number of clusters suggests the model with two site basis functions lacks the chemical flexibility to reproduce the ordering energies of this senary alloy. Increasing the dimensionality of the projected site basis functions to three lowers the validation error to ≈3 meV/atom. In fact, the “ternary” 3-eCE model is essentially as accurate as cluster expansions with six linearly independent site basis functions (purple line in Fig. 4a). The relatively small spread in validation errors for eCE models suggests that the models are not very sensitive to the exact configurations included in the training dataset.

Figure 4 a also compares a linear CE model (parameterized with ridge regression) to non-linear models that use neural networks. Similar to a previous study⁵¹, the linear model requires more cluster basis functions to reach accuracies comparable to non-linear models. Parameterizing a linear cluster expansion model with triplet cluster sizes larger than 4Å was prohibitively expensive due to the large memory requirements needed for ridge regression. Cluster functions built with pair clusters up to a size of 10Å and triplet clusters with a size of 4Å are found to be highly accurate for non-linear models. Linear eCE models are found to have significantly higher prediction errors than non-linear eCE models that employ neural networks to compute site energies.

Having identified the optimal number of clusters, we next study the variation in validation error with the size of the training dataset. Figure 4b shows learning curves of eCE models with projection dimensions ranging from 2-6. 2-eCE models trained with up to 3000 datapoints have validation errors of over 15 meV/atom. Interestingly, ≈500−1000 randomly chosen training datapoints are sufficient to reproduce the complex chemical interactions in the senary alloy for eCE models that contain 3 or more embedding dimensions. The validation errors for a linear senary CE are higher than the non-linear models.

Often, despite low validation errors, atomistic models may fail to reproduce phase stability at low temperatures. Figure 5 depicts the formation energies of orderings in three binary alloys. DFT calculations (gray circles in Fig. 5) predict a single binary ground state in Cr-Ta, no stable ground states in the Nb-V alloy, and several stable ground states in the Mo-Ta alloy system. A 3-eCE model trained with 2936 datapoints and 14 symmetrically distinct clusters reproduces all the salient features of the zero Kelvin energies in binary alloys (green crosses in Fig. 5). The stable states in Cr-Ta and Nb-V are exactly reproduced by the 3-eCE model, while most ground states are captured by the 3-eCE in Mo-Ta. Minor discrepancies between eCE and the DFT calculations may be due to fitting errors or small numerical errors in the electronic structure calculations. Nevertheless, there is excellent agreement between the 3-eCE model and DFT, indicating that eCE models are able to capture thermodynamic ground states in addition to the overall ordering energetics of multicomponent alloys.

**Fig. 5: Formation energies in the Cr-Ta, Mo-Ta and Nb-V binary alloys.**

Extrapolating in chemical space

Training datasets for conventional CE models usually contain orderings that span the entire composition space of an alloy. As a result, multicomponent alloys with 3 or more chemical species can require orders of magnitude more data than simpler binary alloys. eCE models are able to leverage chemical similarities between alloying elements to extrapolate into unsampled regions of composition space. Figure 6 compares two models that are trained on all configurations except those containing both molybdenum and tantalum. The linear CE and 3-eCE models should be unable to learn the interactions between Mo and Ta as configurations containing both elements are not included in the training dataset. Surprisingly, the 3-eCE model in Fig. 6 reproduces the ordering energies of binary Mo-Ta configurations. 3-eCE is able to reproduce several ground states and the overall shape of the convex hull. In contrast, the conventional linear CE of Fig. 6 fails to reproduce the energies of binary orderings. The failure of the conventional CE is not surprising as the model has to extrapolate into the binary Mo-Ta space. On the other hand, the 3-eCE model learns from chemical similarities in the dataset to effectively extrapolate into unseen composition spaces.

Figure 7 benchmarks the extrapolation ability of eCE models across all possible pairs of left-out elements. Similar to Fig. 6, 15 separate training datasets were constructed by including all configurations from the senary dataset, except those containing a specific pair of elements. Each training dataset was then used to parameterize five eCE models with chemical embedding dimensions ranging from 2 to 6. A linear senary CE was also parameterized for each dataset. The linear CE serves as a baseline against which we compare the predictions of eCE. The validation errors computed over all left-out data and the left-out binary configurations are shown in Fig. 7. The benchmarks of Fig. 7 are similar to the leave one out cross validation (LOOCV) metric. Rather than individual datapoints being left out of the dataset, entire alloys are used to cross-validate models.

**Fig. 7: Extrapolating the formation energies over left-out element pairs.**

Figure 7 clearly demonstrates that eCE models are able to extrapolate into unseen composition spaces with significantly higher accuracy than conventional CE models. Average extrapolation errors range from ≈30 meV/atom for a linear CE to ≈8 meV/atom for 3-eCE models. eCE models with 3 or 4 embedding dimensions are found to have the smallest extrapolation error, while all other eCE models have significantly larger errors. Although the average energy errors of 3-eCE models are small, we do find some degree of sensitivity to the exact pair of elements left out of the training dataset. Figure 7 shows the range of energy errors across all 15 pairs of elements that are left out of the training dataset. For instance, the 3-eCE model is found to have extrapolation errors on left-out binary configurations that range from ≈5−30 meV/atom. The binary alloy comprised of the left-out elements corresponds to the most challenging datapoints to reproduce with eCE models. As a result the average prediction errors over configurations containing just the left-out elements (green bars in Fig. 7) is higher than the prediction error over configurations containing other elements in addition to the left-out pair. The highest extrapolation errors are found to occur for binary alloys containing either Cr and V or Cr and Nb. This suggests that even with chemical compression, the bonding between some elements may be too complex to extrapolate from interactions in other alloys.

Embedded cluster expansions are able to utilize chemical similarities to achieve low prediction errors in chemical spaces that would be entirely extrapolative when using conventional CE. For instance, as shown in Fig. 7, eCE models with embedding dimensions of 3 and 4 perform significantly better than eCE models with higher embedding dimensions. As the number of effective chemical species increase, the eCE model treats chemical species with a greater degree of independence. Thus, higher dimensional embeddings lose the ability to leverage chemical trends to lower prediction errors. In Fig. 7, where all configurations containing a pair of elements are left out, 6-eCE models perform poorly as they are fully extrapolative. In contrast, 3-eCE models utilize chemical trends to achieve lower prediction errors.

Similar to the CALPHAD method, CE models of multicomponent alloys can be parameterized starting from training data that spans lower-order constituent systems such as binaries, ternaries, etc⁵³. Figure 8 tests the ability of eCE models to extrapolate into multicomponent space when it is trained on lower-order chemical spaces. All models in the first panel of Fig. 8 were parameterized with a dataset containing only binary orderings of the elements in groups 5 and 6 of the periodic table. The models were then tested on higher-order composition spaces containing 3 or more elements. As shown in Fig. 8, 3-eCE models have extrapolation errors ranging between 13-16 meV/atom, with the more complex orderings being better determined than the simpler ternary orderings. In contrast, 2-eCE and linear CE models have significantly larger errors. Similar to Figs. 4, 6 and 7, 2-eCE lacks sufficient flexibility to capture the chemical interactions of this alloy system. The energies of binary datapoints are likely insufficient to accurately describe the multicomponent energetics in the senary alloy as 3-eCE models parameterized over separate random instantiations resulted in errors ranging from 10 to 20 meV/atom.

**Fig. 8: Extrapolation error in multicomponent chemical space.**

Adding configurations with orderings of up to 3 elements drastically lowers the extrapolation error of most CE models. In fact, adding configurations beyond ternary orderings shows only marginal improvement in extrapolation errors (≈3–4 meV/atom). While configurations with multiple alloying elements may be critical to accurately capturing low-energy ground states, the results of Fig. 8 suggest that the senary alloy formed from elements in groups 5 and 6 of the periodic table are primarily composed of unary, binary and ternary interactions. This is in excellent agreement with our findings from previous sections indicating that 3-eCE models are able to accurately reproduce the ordering energetics of this senary alloy.

Finite-temperature predictions

Surrogate models such as CE are ultimately used to compute finite-temperature quantities such as free energies, heat capacities, or short-range order parameters. Figure 9 compares the nearest neighbor Warren-Cowley short-range order parameters (SRO) in the Cr-W binary alloy computed with a linear CE to the values obtained from a 3-eCE model. The linear CE was parameterized with binary orderings of Cr and W in our dataset. Clusters containing up to 4 sites and a distance of 5.5Å were included in the model that achieved an RMSE of 3.5 meV/atom. The 3-eCE model is identical to the model used in Fig. 5. Canonical Monte-Carlo simulations are employed to compute ensemble averages of the Cr-W SRO as a function of temperature. Both models show essentially identical values of the SRO at elevated temperatures and start to slightly deviate at temperatures approaching ≈1600 K. The Cr-W SRO value is found to be slightly negative, indicating an elevated number of Cr-W pairs as compared to the truly disordered alloy. Similar values for the Cr-W alloy have been computed recently with both CE and off-lattice interatomic potentials³⁶. The agreement in finite-temperature properties indicates that the 3-eCE model has comparable accuracies to a conventional CE.

**Fig. 9: Warren-Cowley short-range order (SRO) parameters for Cr-W pairs in an equiatomic Cr-W binary alloy.**

Having established the accuracy of eCE, we employ the 3-eCE model to compute finite-temperature SRO values in an equiatomic senary V-Nb-Ta-Cr-Mo-W alloy (Fig. 10). Figure 10a shows the SRO values for elements plotted based on their group numbers. Pairs of elements in the same group display positive values of SRO, corresponding to fewer nearest neighbor pairs as compared with a random alloy. Choosing one element from group 5 and another from group 6 results in negative values of SRO. The short-range order parameter values for all pairs of elements at 3000 K are shown in Fig. 10b. Element pairs involving one element from group 5 and another from group 6 either have SRO values close to zero or are strongly negative even at elevated temperatures. Our finite-temperature simulations suggest that pairs of elements from groups 5 and 6 are attractive and there should be an elevated number of such pairs even at elevated temperatures. In turn, this causes a decrease in the number of element pairs from the same group.

**Fig. 10: Warren-Cowley short-range order (SRO) parameters in an equiatomic senary V-Nb-Ta-Cr-Mo-W alloy.**

Discussion

Cluster expansions are the tool of choice to study phase transformations and finite temperature properties of multicomponent alloys. The embedded cluster expansion (eCE) model introduced in this study leverages chemical similarities between elements to construct atomistic models for alloys containing several elements. The eCE model simultaneously learns a lower-dimensional embedding of site basis functions along with the regression coefficients of a site-centric energy model. The site energies within eCE use cluster functions constructed from the transformed site functions that lie in a lower-dimensional space. As fewer site functions are required to describe occupants at any given site, there is a drastic reduction in the number of cluster functions. The results of Fig. 4 show that eCE models can reach accuracies comparable to conventional CE models. Zero Kelvin phase stability predicted by eCE models is also found to be quantitatively accurate (Fig. 5). Allowing the model to learn chemical similarities between elements enables robust extrapolation into unsampled chemical spaces. Despite leaving pairs of elements out in Figs. 6, 7, eCE models capture the energetics of the left-out alloy. The results of Fig. 8 suggest that orderings from simpler chemical sub-systems may be sufficient to capture multicomponent interactions in concentrated alloys. Finite temperature properties such as short-range order are also found to be well-reproduced by eCE models (Fig. 9), allowing us to investigate trends in SRO for the multicomponent senary alloy (Fig. 10).

The values of the site basis functions learnt by an eCE model can shed light on chemical similarities between alloying elements. Site function values learned by a 3-eCE model for the six refractory elements in groups 5 and 6 of the periodic table are shown in Fig. 11. 10 separate 3-eCE models were parameterized starting from an identical initialization for the elements of the embedding matrix (({{mathcal{T}}}) in Eq. (12)). The initial values of the site basis functions for each element are shown as circles and the final values learnt by the 3-eCE model are represented as triangles in the figure.

**Fig. 11: Site basis function values learnt by 3-eCE models.**

Figure 11 shows several chemical trends across all 3-eCE models. Pairs of elements, such as molybdenum and tungsten or tantalum and niobium have similar site basis function values. Chromium and vanadium on the other hand are separated from the site basis function values of the other elements. Both elements are not found to cluster with any other element in Fig. 11. The clustering of site basis function values can be correlated with chemical similarities between elements. For example, molybdenum and tungsten have similar metallic radii and belong to group 6 of the periodic table. This results in very similar chemical interactions as reflected by embedded values of the site basis functions for both elements in Fig. 11. In contrast, the other element of group 6, chromium, is smaller than Mo and W. This results in qualitatively different interactions of Cr and causes the model to separate the element in the projected space. The elements of group 5 in the periodic table have very different site basis function values than the elements of group 6. The chemically similar elements, niobium and tantalum, have embedded site function values that are in close proximity, while vanadium, that is smaller than both elements is clearly differentiated in Fig. 11. The lack of elements similar to Cr and V may be related to the large extrapolation errors observed for Cr-V containing configurations in Fig. 7. Some degree of scatter is evident over the models due to the stochasticity of the gradient descent technique used to minimize eq. (15). Nevertheless, all parameterizations show similar trends.

Our results also suggest that learning the transformation matrix, ({{mathcal{T}}}), and careful initialization of the embedding matrix is crucial to obtaining predictive models. Figure 12 compares the validation errors of four different learning schemes. Two sets of models are parameterized while allowing for the transformation matrix to be learnt during model training. In the other two groups of models, the embedding matrix is fixed to its initial value. We also attempt two different initialization strategies. The first group of models are initialized based on similarities in the chemical properties of each element (details are outlined in the methods section). Random orthogonal projection vectors are used for the initial embedding matrix in the other models. Figure 12 shows the range of validation errors obtained over 10 instantiations of a 3-eCE model. Random initializations result in a large variance of the validation error. Difficulty in learning predictive eCE models from random initializations could be due to the existence of multiple local minima of the loss function. All models that are initialized with a transformation matrix containing some chemical information are able to achieve significantly lower prediction errors than random embedding matrices. The 3-eCE model with the lowest validation error that was initialized with a random projection matrix was also able to learn elemental similiarities like those shown in Fig. 11. Interestingly, a learnable transformation matrix seems to be necessary to enhance the predictive power of the model. Comparing the models with chemically informed initializations of the transformation matrix, Fig. 12 suggests that learning the best embedding matrix could lower errors by ≈3 meV/atom. This can also be seen in the reduced spread of validation errors for random initializations with a learnable embedding matrix. While the benefits of learning the embedding matrix for the senary refractory alloy system are not very large, this may be important for materials with more alloying elements.

**Fig. 12: Effect of the embedding matrix initialization and learning scheme on validation error.**

The eCE framework of eqs. (12) to (14), is similar to recently proposed chemical embedding schemes for off-lattice interatomic potentials^{54,55,56,57,58}. The application of chemical compression schemes to off-lattice models have enabled researchers to investigate alloys containing several elements^59,60. Parameterizing such off-lattice models can be very expensive, often requiring tens of thousands of calculations. Additionally, as suggested in a recent study⁵⁹, resolving the small energy differences between competing intermetallic orderings with a general interatomic potential can be challenging. The eCE surrogate model could provide sufficient energy accuracy and computational speed to bridge this gap and allow researchers to study complex alloy thermodynamics. As shown by Fig. 4, relatively small training datasets are required to parameterize on-lattice models with chemical compression schemes. The eCE models are sufficiently flexible to extrapolate into higher dimensional composition spaces. Additionally, computing finite-temperature properties from eCE models is relatively straight forward and computationally cheap. This could enable alloy designers to rapidly screen materials for compositions with desirable properties through eCE based surrogate models. Promising alloy chemistries that require more accurate simulations that account for all sources of entropy can subsequently be studied with bespoke interatomic potentials.

The chemical flexibility of the eCE formalism and the smaller dataset sizes needed to parameterize these models will enable the systematic exploration of high-dimensional composition spaces. eCE models will provide significant advantages against conventional CE in alloys where some chemical trends (similarities or dissimilarities) exist between groups of elements. In materials where elements are chemically uncorrelated or with very complex chemical trends, eCE models may require larger embedding dimensions, perhaps even approaching the number of elements in the alloy. Benchmarks such as the learning curves of Fig. 4 can be used to discern the appropriate dimensionality of the embedding space.

The eCE formalism is also subject to several of the same restrictions as conventional CE. For instance, alloys with long-range interactions, or significant structural relaxations will continue to remain challenging to parameterize with eCE models. The problem can be somewhat alleviated through structure matching algorithms⁶¹ to prune large relaxations out of training databases and the explicit inclusion of additional terms accounting for long-range interactions^39,62. Further, all significant sources of entropy will need to be included in the formalism to enable a rigorous comparison with experiment. This may require the coupling of eCE models to other site degrees of freedom such as magnetic moments, vibrations or lattice distortions. The extension of the eCE model to such coupled effective Hamiltonians can be done similarly to existing methods that couple site occupancy with other discrete or continuous degrees of freedom.

Methods

DFT calculations

Formation energies of 4083 symmetrically distinct orderings between elements of groups 5 and 6 (Cr-Mo-Nb-Ta-V-W) are calculated with the generalized gradient approximation (GGA-PBE) to density functional theory (DFT) and projector augmented-wave (PAW) pseudopotentials as implemented in the Vienna Ab-Initio Simulation Package (VASP)^63,64,65,66. A plane-wave cutoff energy of 550 eV with a k-point grid density of 55Å and smearing of 0.1 eV were used to relax the positions of atoms and lattice parameters of all orderings. Symmetrically distinct orderings on a parent bcc crystal structure are enumerated with the CASM code⁴. 2487 symmetrically distinct orderings on supercells of sizes up to 12 are enumerated within the binary and ternary sub-systems of the senary V-Nb-Ta-Cr-Mo-W alloy. Symmetrically distinct equiatomic orderings in the quaternary, quinary, and senary alloys are enumerated in supercells containing up to 6 atoms. 387 random arrangements of the 6 elements in a supercell containing 8 atoms are also included in the training dataset. SRO values are computed with canonical Monte-Carlo simulations performed in a 10 × 10 × 10 supercell of the conventional bcc cell. The short-range order parameters are computed by averaging over 1000 Monte-Carlo passes.

Embedded cluster expansion

The embedded cluster expansion was implemented in python using the PyTorch library. Gradient descent of the loss function in eq. (15) is performed with the stochastic Adam algorithm for 100 epochs. A learning rate scheduler is applied and overfitting is controlled through the L2-regularization. A graph for the site energy is built within PyTorch starting from Chebyshev site basis functions that are projected into a lower dimensional space through a learnable embedding matrix. Symmetrized site-centric cluster functions constructed as tensor products of the embedded site basis functions are used as input to a 4-layer neural network (32 × 32 × 8 × 1) that uses the ReLU activation function for each node except the final layer, where we use a linear activation function. The rows of the learnable linear transformation ({{mathcal{T}}}) are re-normalized after each iteration.

eCE models were initialized with a projection matrix, ({{mathcal{T}}}) computed from chemical properties of each element. 8 elemental properties (atomic number, radius, electronegativity, density, melting point, bulk modulus, Youngs modulus and Brinell hardness) were collected for each element from pymatgen⁶⁷. The material properties for each element were used to form the columns of a matrix, A. The rows of A were standardized to have zero mean and a standard deviation of 1. An embedding matrix, with an embedding dimensionality of k was initialized with the first k right-singular vectors of A.