Discovery of new topological insulators and semimetals using deep generative models

Introduction
Topological materials refer to materials with special topological arrangements in the geometric of electronic band structures, which can produce robust surface states and unconventional electromagnetic activities1,2. The topology represents that the electronic structure of the materials will not change due to parameter tuning without opening the energy gap. These materials, encompassing topological insulators (TIs)3,4,5 and topological semimetals (TSMs)6,7,8,9, have captured considerable attention over recent decades due to their extraordinary properties. These materials showcase robust boundary states that are resilient to the perturbing effects of static disorder, distinguishing them from conventional materials. Moreover, their unique features, including the topological magneto-electric effect and anomalous transport phenomena such as the quantum anomalous Hall effect10,11,12, highlight their potential to advance our understanding of condensed matter physics and spintronics. These novel electronic properties of topological materials have great potential in the development of dissipationless electronic11 and spintronic devices13. Since the inception of this field, the quest to discover new TIs and TSMs has emerged as a pivotal and cutting-edge area of research. The advancements in this field have predominantly hinged on first-principles calculations anchored in topological band theory14,15. In particular, the advent of theories like symmetry indictors16 and topological quantum chemistry (TQC)17 has led to the discovery of a large number of topological materials and the establishment of numerous databases1,18,19 based on high-throughput computation. The burgeoning volume of data has made machine learning be used to solve the problems of theoretical classification frameworks in terms of computational speed20, topology determination21, and low-symmetry structure classification22. Despite these notable advancements, current methodologies predominantly focus on identifying topological materials within pre-existing databases1,18,19,20,22. There remains a significant challenge in the ability to discover novel topological materials that extend beyond the confines of pre-existing ones. This challenge is particularly pronounced in the realm of low-symmetry chiral Kramers-Weyl semimetals, where they will break the double spin entanglement guaranteed by the traditional symmetry rules such as Kramer’s theorem, thereby leading to the appearance of new topological phases23,24. Nevertheless, materials with low-symmetry topological band structures are recognized as key candidates for achieving field-free switching and high energy efficiency technologies due to their unique electronic properties. The low-symmetry of these materials leads to unconventional spin polarization, which can be induced by current, enabling precise control of the magnetization state without the need for an external magnetic field25,26,27. In light of these considerations, it becomes critically important to develop a robust, data-driven approach that can simultaneously generate and diagnose novel and stable TIs and TSMs which transcends the restrictions imposed by symmetry rules.
Inverse design28,29,30,31 is a data-driven strategy with available data of desired material functionality, domain knowledge and artificial intelligence to discover materials that exhibit this functionality31. The data-driven strategy has emerged as a leading approach in the realm of novel materials research. A critical challenge in inverse design is searching for desired materials in countless possible materials with different properties which depend on chemical composition and crystal structure32. One such solution is the application of generative machine learning models in the design of new candidate structures. However, these generative models often grapple with the challenge of capturing translationally and rotationally invariant representations when dealing with crystalline materials28,33,34,35,36,37,38,39. To overcome this limitation, the crystal diffusion variational autoencoder (CDVAE)40 adopts diffusion models41,42 to directly generate the atomic coordinates of crystalline lattice structures with these invariant representations and employs periodic graph neural networks (PGNNs)43 as the backbone of variational autoencoders44, enhancing its capacity to handle complex structures. These PGNNs enable CDVAE to capture the translational and rotational invariance of crystals by ensuring periodic rotation, reflection, and translation invariances, leading to higher stability and reduced screening costs. Compared to traditional methods such as trial and error, random search, and density functional theory (DFT) approaches, the advantage of diffusion-based models45,46,47 is that they can efficiently generate diverse and highly realistic materials structure at an extremely low computational cost. This is because diffusion-based models generate novel materials by following the distribution of available materials in used datasets, which is usually called data-drawn discovery. This approach has been adeptly integrated into the inverse design framework, significantly enhancing the discovery process of an extensive array of innovative materials, including one-dimensional structures48, two-dimensional (2D) materials32 and high-critical temperature superconducting materials49. Despite these advancements, it is noteworthy that, within the current scope of our knowledge, the exploration and identification of new topological materials through the application of generative models represents a yet unexplored frontier.
In this work, we demonstrate that generative machine-learning models can also be used in the quest for new topological materials. We introduce a data-driven inverse design method CTMT tailored for the discovery of novel TIs and TSMs. CTMT synergistically combines several cutting-edge technologies: the aforementioned CDVAE, Topogivity22, interatomic potentials (IAPs) as realized in M3GNet50, and TQC. This integrative approach enables CTMT not only to generate but also to effectively identify stable topological materials beyond pre-existing ones. Our method has proven its efficacy by finding 4 TIs and 16 TSMs that are absent in current material databases, including 4 chiral Kramers-Weyl semimetals. These findings exhibit the potential of CTMT as a universal tool for the exploration and discovery of novel topological materials, paving the way for uncovering even more diverse and rare material types.
Results
Workflow
Figure 1 illustrates the main framework of CTMT, including four sequential functional blocks: generation, filtering, stability verification, and topology type classification, for the systematic discovery and validation of new TIs and TSMs.

a The dataset of topological materials, b the Crystal Diffusion Variational Autoencoder (CDVAE), c the novelty and legitimacy check unit, d the Topogivity check unit, e DFT calculations of the formation energy (Eform)and the energy above hull (Ehull), f fast phonon spectrum scanning by M3GNet, and final TQC as shown in (g) determines 20 stable topological materials, including 16TSMs and 4 TIs as shown in (h).
Generation of Crystal Structures
The purpose of the first functional block is to generate new crystal structures. The used training dataset is from the topological materials database (https://www.topologicalquantumchemistry.fr/), as depicted in Fig. 1a, which includes 6109 TIs and 13,985 TSMs. The original database also includes 18,090 trivial materials, which were excluded from the training dataset. Notably, this same training dataset has been employed in recent studies51,52,53. Supplementary Fig. S1 shows the atomic proportion percentage across the training dataset. The trained CDVAE model generates 10,000 highly realistic candidates of potential topological materials based on Langevin dynamic sampling42 (Fig. 1b, see Methods section for details). The generated 10,000 candidates are fed into the filter block, as shown in Fig. 1c.
Filtering Process
The filter block checks the novelty, legitimacy, and topologically nontrivial possibility of the candidates. The novelty check is performed by eliminating materials with the same chemical formula and structure in the dataset using the StructuresMatcher package of pymatgen54. The main parameters for StructureMatcher were set as follows: lto at 0.2, stol at 0.3, and angle_tol at 5. The legitimacy of materials is determined by using the smart packages55 to verify whether their stoichiometric chemical formulas satisfy charge neutrality and electronegativity balance. The third check uses pymatgen packages54 to assess structural validity by examining if the bond length of these crystals is larger than 0.5 Å. After that, 4,715 valid candidates pass the first three checks and go through the ‘Topogivity’22 check, as shown in Fig. 1d. Topogivity is a machine-learned chemical rule for discovering topological materials22. A given material is diagnosed with high accuracy (typically > 80%) as topological nontrivial (trivial) if the weighted average of its element’s Topogivities is positive (negative). Our present work uses the criterion that the weighted average Topogivity should be greater than one to obtain more accurate topological nontrivial materials. In addition, candidates containing elements with 4f or 5f electrons are excluded due to the challenges in obtaining accurate results from DFT calculations, which arise from their complex electronic structures and the significant relativistic effects introduced by spin-orbit coupling (SOC). Furthermore, magnetic atoms are also excluded because of the additional computational procedures required to determine the magnetic ground state. The difficulty of generating materials with heavy elements and magnetic atoms is a potential obstacle that may influence results. This filtering gives 104 potential topologically nontrivial materials for further stability verification.
Stability verification
The stability verification block examines the stability of the candidates. First, DFT calculations are performed, as shown in Fig. 1e, to calculate the formation energy and energy above hull. If a candidate has a positive formation energy Eform ≥ 0 eV/atom or its energy above hull is Ehull ≥ 0.16 eV/atom, the candidate is thermodynamically unstable and will be removed. The 57 candidates with Eform < 0 eV/atom and Ehull < 0.16 eV/atom are potentially synthesizable (see Methods section for details). Since thermodynamic stability alone is not sufficient to ensure candidate stability, phonon spectrum calculations must be conducted to check whether there are imaginary phonon frequencies inside the 57 candidates. As the phonon spectrum calculation based on Vienna ab initial simulation package (VASP)56 is a computationally intensive task, the pre-trained M3GNet50 is directly integrated into CTMT without any further training to complete the verification. Compared to other machine-learning interaction potential methods such as MACE57 and CHGNet58, M3GNet can directly provide phonon spectrum, without converting predicted forces and energies into second-order force constants, while ensuring high prediction accuracy59. Therefore, M3GNet is adopted in the present work for stability checks. As shown in Fig. 1f, the IAPs predicted by M3GNet is utilized to perform fast phonon spectrum calculations. This step filters out candidates with imaginary phonon frequencies and leaves 32 stable candidate materials (see Methods section for details).
Topology type classification
The fourth block, as shown in Fig. 1g, uses TQC on the Bilbao Crystallography Server1 to ultimately determine the topology type of the materials, leading to the discovery of 20 new topological materials (Fig. 1h, see Methods section for details), including 16 TSMs and 4 TIs. TQC is a band theory of the structure of energy bands in crystals and links to the topological properties of crystals with electron orbitals at the Fermi level.
Topogivity
Figure 2 illustrates the distribution of atomic proportion percentages and Topogivity values across these 104 topologically nontrivial candidates after the Topogivity check, offering key insights into the diversity of their compositions. The trained CDVAE model generates candidates with a wide range of compositions, as evidenced by the fact that the atoms in these 104 materials span most elements with known Topogivity. Interestingly, elements such as oxygen (O), fluorine (F), phosphorus (P), chlorine (Cl), bromine (Br), and iodine (I) are absent in the 104 materials, because they have large negative values of Topogivity22. In the 32 stable candidates, 20 of them are confirmed to be topological, reaching an accuracy rate of 62.5%. Although this accuracy is lower than the classification accuracy of Topogivity (82.4%)22, CTMT discovers novel topological materials with extremely high fabrication feasibility. The integration of Topogivity into CTMT not only significantly reduces the material search space but also enhances the proportion of topological materials among the candidates recommended by the CDVAE model.

Elements enclosed within the gray box are not included in the candidates.
Stability
Figure 3a displays a detailed stability verification. The DFT calculations show that 77 crystals of the 104 candidates have formation energies Eform < 0 eV/atom and are thus considered thermodynamically stable, accounting for the percentage of 74%. In the 77 candidates, 57 materials have Ehull < 0.16 eV/atom with high synthesis possibility, which gives the percentage of 74% as well. Figure 3b, c show the energy distribution of Eform < 0 eV/atom for 77 materials and Ehull < 0.16 eV/atom for 57 materials, respectively. The formation energy is clustered around –0.4 eV/atom, indicating a general trend of energy stability. Similarly, the energy above the hull is distributed around 0 eV/atom. Notably, materials with Ehull = 0 eV/atom are of particular interest as they signify the ground state or the most stable configuration achievable. In this work, there are 11 materials with Ehull = 0 eV/atom as the phase diagrams shown in Supplementary Fig. S2, S3. Furthermore, the M3GNet integrated in CMTM estimates the phonon spectrums and gives only 32 structural stable and synthesis feasible candidates, out of the 57 potential candidates with a success rate of 56%, echoing with the prediction of 2D materials at 69%32.

a The passing candidate number in each step. b The distribution of the formation energy of 77 materials with Eform < 0 eV/atom. c The distribution of the energy above hull of 57 materials with Ehull < 0.16 eV/atom.
Topological properties
The TQC methodology1 is finally performed to ascertain the topological classification of each stable structure, resulting in 16 TSMs and 4 TIs. We use the TQC method to obtain the characters of all bands at all the relevant high-symmetry points (the maximal k-vectors). If the characters at the relevant high-symmetry points do not satisfy the compatibility conditions, it means they are enforced semimetals. Each identified TSM is classified as an enforced semimetal with Fermi degeneracy (ESFD). EFSD depends on whether they have a high-symmetry point degeneracy at the Fermi level. All 16 TSMs have a high-symmetry point degeneracy at the Fermi level, while the TIs exhibited distinct topological invariant numbers. The physical behaviors of bands in momentum space are interpreted using several kinds of topological invariant numbers. The topological invariant is a quantized number that characterizes the topological status of a given system, and the Chern number and Z2 number are examples of the topological invariant numbers. Topological invariant numbers are the conserved quantities when any topological phase transitions do not occur60. For the definition of TI in this context, from the theory of TQC17, if the set of bands below the Fermi level cannot be expressed as a linear combination of elementary band representations (EBRs), then it can be identified as a TI. It’s very important to note that sometimes the TI identified in this frame may lack a band gap, as we can also see many examples from the topological materials database1,17,61. Unlike the conventional definition of an insulator, which requires a global band gap (i.e., a nonzero indirect band gap between the conduction band minimum and the valence band maximum). In TQC, it is sufficient for every high symmetry k point in the Brillouin zone to exhibit a direct band gap, such that the occupied and unoccupied states can adiabatically evolve into a global band gap17. TIs identified through TQC may exhibit band structures without a global gap, such as GeTa31 and Bi22. The detailed characteristics of these topological materials are systematically cataloged in Table 1. Among them, the space group of the structure is determined using the SpacegroupAnalyzer method in the pymatgen54 package. A particularly noteworthy finding is the identification of nonmagnetic chiral crystals, including 3 Kramers-Weyl semimetals with space group P1 and 1 Kramers-Weyl semimetal with space group C2. These Kramers-Weyl semimetals represent a new category of materials, hosting Kramers-Weyl fermions at time-reversal-invariant momenta23,24. Kramers-Weyl fermions have attracted intense attention due to their unique physical properties including magneto-chiral dichroism62, large optical activity63,64, quantized chiral charges65 and negative longitudinal magnetoresistance66 due to the intricate interplay of SOC, structural chirality and time-reversal symmetry. The previous research efforts aimed at discovering new topological materials always relied on symmetry rules23,24, which met difficulties in dealing with low-symmetry as well as the chiral structures. However, our inverse design process, which does not rely on any symmetry-based rules, successfully identified a number of chiral structures with lower symmetry. This achievement underscores a significant advantage of our method, highlighting its potential to explore and uncover a broader spectrum of topological materials, particularly those with unconventional and complex structures.
In Fig. 4, we highlight four novel topological materials with their crystal structures, band structures, and phonon spectra, which are considered the most likely to be synthesized. Among them, CdAu5 (Fig. 4a, e, i) is a Kramers-Weyl semimetal with space group C2. In this material, the band splitting is observed at all points except at the time-reversal-invariant momenta. Li2YBi2 (Fig. 4b, f, j) and Zr2ScC (Fig. 4c, g, k) are both TSMs with space group ({rm{P}}bar{3}{rm{m}}1) and R3m, respectively. Mg4Pt2 (Fig. 4d, h, l) is presented as a TI, characterized by a set of topological invariant numbers: Z2w,1 = 1, Z2w,2 = 1, Z2w,3 = 1, Z4 = 2, Z2 = 0, Z8 = 6. The absence of any obvious imaginary frequencies in their phonon spectra corroborates their structural stability67. As an example of trivial band structures, Supplementary Fig. S4 presents the band structure and phonon spectra of crystal Ba2Sn4 generated by CTMT. This material, identified by TQC as a linear combination of EBRs (indicating trivial topology), is further confirmed to be stable through phonon analysis. The band structures of other topological materials identified in this study are available in Supplementary Fig. S5–S8, which provide detailed information on the structures and stability information of these materials.

a–d Crystal structures for CdAu5, Li2YBi2, Zr2ScC, and Mg4Pt2, respectively. e–h Band structures for CdAu5, Li2YBi2, Zr2ScC, and Mg4Pt2, respectively. i–l Phonon spectra for CdAu5, Li2YBi2, Zr2ScC, and Mg4Pt2, respectively.
Discussion
Overall, by systematically combining CDVAE, Topogivity rule, M3GNet, and TQC, we have successfully developed a novel data-driven method CTMT for the inverse design of new TI and TSMs based on deep generative models. This innovative approach has led to the discovery of 20 novel and stable topological materials, including 16 TSMs and 4 TIs. Compared with the traditional methods1,18,19 of directly determining the topology type of materials based on calculations, CTMT can reduce the calculation range to a smaller size and achieve higher accuracy while saving computational resources by preliminarily screening non-trivial materials based on Topogivity before calculation. In CTMT, the success rate of finding topological materials from stable materials is 62.5%, which is much higher than the traditional methods’ success rate of less than 30%. This outcome highlights the effectiveness of individual components within the CTMT framework in searching for new topological materials and proves the potential of CTMT in exploring all possible topological materials. Meanwhile, it is important to acknowledge that, due to the constraints related to computational accuracy and the stringent screening criteria applied for topology type and stability, there is a possibility that many potential topological materials within the generated dataset remain undiscovered. Our work opens up a novel and efficient path for finding groundbreaking topological materials, and holds great potential for the exploration of other advanced functional materials, such as topological superconductors, nodal line semimetals, and layered room temperature ferromagnetic materials. The future direction is to use topological properties as the generation condition in CTMT, so that it can consider both the stability of the crystal structure and the topological type of the material.
Methods
CDVAE training details
In this work, the dataset of topological materials was partitioned into training and validation subsets at a ratio of 8:2 for the CDVAE training. The backpropagation method was used in training with the Adam optimization algorithm and a learning rate set to 0.001. The training was completed after 800 epochs with the minimum loss on the validation set. During training, the hyperparameters are set to be consistent with those of the CDVAE training mp-20 dataset. The trained CDVAE model generates 10,000 candidates of potential topological materials, which are sent to a series of filters and thermodynamic stability checks by first-principles calculations based on DFT.
The CDVAE40 model we used in this word is implemented by Xie et al. (https://github.com/txie-93/cdvae). This model incorporates DimNet++68, adapted for periodicity as the encoder, and GemNet-dQ69 as the decoder. Both the encoder and decoder are invariant to structure changes, comprising 2.2 million and 2.3 million parameters, respectively. The training dataset is extracted from The Topological Materials Database website (https://www.topologicalquantumchemistry.fr) by using the request package (https://requests.readthedocs.io) in Python to collect the crystal structure information and topology type. After that, the crystal structure information is converted into a crystallographic information file (CIF), and the Structure package in pymatgen54 is used to feed the CIF into the CDVAE model. During training, we set the parameters of these networks with those used by Xie et al40. Additionally, due to the absence of formation energy data and the unpartitioned test set in the collected topological material dataset, we adopted the hyperparameters from the modified MP-20 dataset. The modifications are specified as follows: We set the predicted property (“prop” in the code) to “scaled_lattice” and set the number of targets (“num_targets” in the code) to 6. Furthermore, we excluded the test dataset configuration and limited the maximum training epochs (“train_max_epochs” in the code) to 800. The sampling process was conducted by executing the file at https://github.com/txie-93/cdvae/blob/main/scripts/evaluate.py. Before execution, we configured the model path to the saved model parameter path and set the “tasks” parameter to “gen”, which generated 10,000 candidate structures.
For the analysis involving Topogivity and M3GNet, we employed their pre-trained models. The Topogivity values were directly retrieved from Fig. 2 of Ref. 22, and we included these values in Fig. 2 of our work as well. The IAPs used in M3GNet for phonon filtering were obtained by loading the pre-trained model from MP-2021.2.8-EFS (https://github.com/materialsvirtuallab/m3gnet/tree/main/pretrained/MP-2021.2.8-EFS). Following this, we applied the structure relaxation demo from the M3GNet package (https://github.com/materialsvirtuallab/m3gnet) to preform relaxation on the selected structures and calculated the phonon spectra using the phonopy package.
Calculation parameter settings
The VASP is used to carry out the DFT calculations with the exchange-correlation potential of the generalized gradient approximation in the Perdew-Burke-Ernzerh type. The convergence criteria of energy and force are 10–6 eV and 0.01 eV/A, respectively and the cutoff energy for plane-wave expansion is 500 eV. The pre-trained M3GNet uses the 2×2 crystal superlattice cell and has the pre-sets of relaxation steps of 10,000 and a maximum force threshold of 0.0001 ({rm{eV}}/{text{AA }}). The phonon spectrum calculations employ the M3GNet force field and the phonopy packages70. The “Check topological mat” module from TQC method54 is integrated in CTMT and all calculations relevant to topological properties have been included with the SOC. The energy convergence precision is pre-set to 10-8 eV. This rigorous approach facilitates the categorization of structures into TSM, TI, and linear combination of EBRs, and the latter indicates a topological trivial state with the set of bands below the Fermi level1. The utilization of VASPKIT71 and the pymatgen packages54 significantly expedites efficiently the processing of DFT data.
Responses