Secure and federated genome-wide association studies for biobank-scale datasets

Main

Secure computation frameworks from modern cryptography offer promising strategies to address privacy concerns in collaborative genomic studies^3,7,8,9. These techniques allow a group of parties to jointly analyze their data by exchanging encrypted information while ensuring that each party’s data remain private from others. Although recent work has illustrated the potential of this strategy for genome-wide association studies (GWAS)^4,5,6, existing methods remain limited in practical utility. Prior work based on the framework of secure multiparty computation (MPC)⁴ is prohibitively slow for large biobank-scale cohorts (as demonstrated in this work) and requires the external sharing of entire input datasets in encrypted form, which may not be feasible in practice due to security risks. More recent methods based on homomorphic encryption (HE) schemes^5,6 improve efficiency but fail to support the standard analysis pipelines commonly used by researchers due to their computational complexity. These pipelines include those based on principal-component analysis (PCA) and linear mixed models (LMMs), which provide strategies to account for population structure within a study cohort to accurately estimate association signals^10,11.

We developed secure federated genome-wide association studies (SF-GWAS), a secure and federated algorithm for multisite GWAS, to enable collaborative genomic studies at scale with cryptographic privacy guarantees (Fig. 1). SF-GWAS builds upon the following two key conceptual advances. First is our ‘federated’ approach to secure computation, whereby each input genomic dataset is kept at the respective source site. This approach minimizes computational costs by both avoiding large data transfers between sites and allowing the use of efficient cryptographic operations that leverage the unencrypted input data locally available at each site. To enable this strategy, we combine MPC and HE—two cryptographic schemes for analyzing encrypted (masked) private data, which previous work on secure GWAS has separately leveraged—into a hybrid framework (Methods). Our framework employs HE for local computations over large matrices and vectors (for example, multiplication between a genotype matrix and a vector of individual phenotypes) and MPC for nonlinear operations such as division and sign functions (for example, used in statistical calculations) to improve numerical precision and efficiency. Previous work on a multiparty extension of HE⁵ did not utilize computational MPC routines, which we found crucial for our large-scale analysis pipelines due to the wide range of values encountered during the analysis.

Secondly, SF-GWAS introduces an efficient algorithmic design to support the federated execution of various end-to-end GWAS pipelines. Our workflow includes, in addition to association tests, quality control (QC) procedures and two standard strategies to account for population structure and sample relatedness in the cohort^10,11: PCA or LMMs. Both PCA and LMMs involve highly complex linear algebra computations, which thus far have limited the development of privacy-preserving algorithms for these tasks that can be applied to large biobank datasets. Using a set of algorithmic strategies aimed at optimizing performance in a distributed setting, such as techniques to maximize the use of local plaintext (unencrypted) data, we developed practical secure algorithms for both PCA- and LMM-based GWAS workflows, demonstrating the broad utility of our framework (Methods and Supplementary Note).

We first compared SF-GWAS with the prior state-of-the-art method for secure PCA-based GWAS⁴, referred to as S-GWAS. We used both methods to analyze the three datasets from the S-GWAS publication for lung cancer (n = 9,178 individuals), bladder cancer (n = 13,060) and age-related macular degeneration (AMD; n = 22,683) (Methods). Each dataset was divided into two subsets to simulate a joint study involving two cohorts that could not be combined. For all three datasets, we observed a substantial reduction in both runtime (Fig. 2a) and communication (Fig. 2b) for SF-GWAS compared to S-GWAS. Communication refers to the transfer of encrypted information between parties required during the interactive protocol, which is a key performance metric as its impact on overall runtime can vary depending on the network conditions in practice. SF-GWAS ran consistently an order of magnitude faster than S-GWAS (for example, 4.6 h versus 64.3 h for AMD data, representing a 14× reduction). Communication cost was three to four times lower for SF-GWAS (for example, 173.7 GB versus 666.6 GB for AMD data), largely because S-GWAS requires sharing the entire encrypted dataset between the parties, which SF-GWAS circumvents through our federated approach. The output of SF-GWAS closely matched a direct analysis of the pooled plaintext data (Extended Data Fig. 1). Furthermore, SF-GWAS provides stronger security properties than S-GWAS by avoiding the sharing of entire encrypted input datasets while adopting an encryption scheme that is resilient to quantum computer-based attacks (Supplementary Note).

**Fig. 2: SF-GWAS is more computationally efficient than the prior art for secure collaborative GWAS.**

We evaluated the scalability of SF-GWAS on two larger datasets: the eMERGE consortium (n = 31,293) and UK Biobank (UKB; n = 275,812), focusing on body mass index (BMI) as the analysis trait in both. These datasets are more than 100 and 2,000 times larger, respectively, than the largest AMD dataset analyzed in the S-GWAS publication, in part due to the number of variants analyzed (509,000 in AMD versus 38 and 93 million in eMERGE and UKB, respectively). We split both datasets across multiple centers (six for eMERGE and seven for UKB) according to the original data collection sites, considering each center’s dataset as local and private (Methods). S-GWAS could not be evaluated on either dataset due to its infeasible runtime requirements, estimated through linear extrapolation to be several months for eMERGE and several years for UKB.

Consistent with our previous results, the association statistics computed by SF-GWAS, for BMI and using the PCA-based pipeline, closely matched a plaintext analysis based on the PLINK software¹² on each of the pooled datasets (Fig. 3a,b and Extended Data Fig. 2). In addition, we observed that a meta-analysis of summary statistics from individual centers can result in notable discrepancies compared to the pooled analysis (Extended Data Fig. 3), especially on the eMERGE dataset, likely due to limited site-specific sample sizes and the heterogeneity of study populations^13,14,15. Because the computational steps of SF-GWAS closely emulate a centralized analysis, SF-GWAS results are virtually equivalent to what the researchers would obtain if the datasets could be directly combined. This equivalence holds regardless of how the data are split (for example, even when the data distribution is heterogeneous).

**Fig. 3: SF-GWAS accurately reproduces biobank-scale GWAS without data centralization.**

The total runtime of SF-GWAS on eMERGE was 17.5 h, including 2.8 h for QC filtering, 8 h for PCA and 6.7 h for association tests. For UKB, the runtime was 5.3 days in total, including 4.5 h for QC, 44 h for PCA and 77.8 h for association tests. A concurrent work on multisite GWAS¹⁶ reported a runtime of 5 h for 160,000 single-nucleotide polymorphisms (SNPs) and 16,000 samples, which extrapolates to more than 4 months for the UKB dataset, while providing weaker security guarantees than SF-GWAS and addressing only the PCA-based workflow. The runtime of PCA-based SF-GWAS scales linearly with dataset size and thus can be readily estimated in practice (Extended Data Fig. 4 and Supplementary Note). Furthermore, given the reduced reliance of SF-GWAS runtime on interactive steps as dataset size grows, we expect it to remain practical for international collaborations with higher network communication delays, such as those between the US and the UK (Extended Data Fig. 5).

Next, we evaluated our secure and federated algorithm for LMM-based association tests, based on the plaintext method of REGENIE¹⁷, on a dataset of 409,548 individuals of European descent from UKB including related individuals (Methods). We split this dataset among six centers, as in the previous experiment. SF-GWAS produced association statistics that accurately matched those of REGENIE, where the latter was run directly on a pooled dataset without encryption (Fig. 3c and Extended Data Fig. 2). We additionally validated the accuracy of our method on the lung cancer dataset from S-GWAS (Extended Data Fig. 6). Owing to our optimizations, LMM-based SF-GWAS exhibits near-constant runtime scaling with the size of local datasets and achieves runtimes on the order of days, even for large datasets containing hundreds of thousands of individuals (Extended Data Fig. 7 and Supplementary Note). In our experiment, it achieved a runtime of 6 days for the UKB dataset.

GWAS of binary traits (for example, disease status) is typically performed using logistic regression, requiring iterative model fitting in the absence of an analytical solution. Given the high computational costs of nonlinear operations under encryption, existing works on secure GWAS have focused on linear regression or have been limited to supporting logistic models on small datasets^4,6. SF-GWAS incorporates a practical, secure federated algorithm for score-based tests for logistic models using Newton’s method, ensuring fast convergence to accurate parameter estimates (Methods and Extended Data Fig. 8). Reanalyzing the three S-GWAS datasets using our logistic workflow produces results consistent with PLINK and achieves practical runtimes of less than 5.3 hours for all datasets, comparable to those of the linear pipeline (Extended Data Figs. 8 and 9).

Collaborative analysis performed using SF-GWAS identified genetic variants that are concordant with prior GWAS results. Comparing with the published summary statistics from the Pan-UK Biobank project¹⁸, we observed that 71 out of 73 significant variants (P < 5 × 10⁻⁸) identified by SF-GWAS on eMERGE coincided with previously reported associations for BMI, whereas independent analysis of each center’s data resulted in at most two significant variants. Similarly for UKB, 1,778 out of 2,200 significant variants for LMM-based SF-GWAS, and 21,544 out of 24,357 for our PCA-based analysis (based on a larger set of imputed genotypes), were previously reported to be significant, indicating a large overlap despite differences in the analysis setting. In contrast, independently analyzing each center-specific dataset (PCA based) identified 2,600 significant variants across all centers, considerably fewer than SF-GWAS (24,357). Furthermore, when treating UKB as a validation cohort, meta-analyses of seven center-specific GWAS based on eMERGE data yielded fewer variants validated in UKB compared to SF-GWAS, illustrating the improved statistical power of the joint analysis enabled by SF-GWAS (Extended Data Fig. 10).

Finally, we demonstrate an application of SF-GWAS to analyzing datasets collected independently by different organizations. In addition to the AMD GWAS cohort from the International AMD Genomics Consortium (IAMDGC; n = 21,692, including 9,284 cases), we selected ancestry-matched (European) AMD cohorts in the eMERGE consortium (n = 4,252, including 574 cases) and the UKB (n = 85,863, including 2,632 cases). This resulted in a GWAS dataset of 111,807 individuals in total, split among three parties representing the original data sources. Although the three cohorts were genotyped using different array platforms, a large number of shared genetic variants (22,191,946) could be chosen for joint analysis by imputing each dataset and finding an intersecting set of genomic positions and allele definitions. The total runtime of SF-GWAS (PCA-based, logistic) was 3 days, including 53 min for QC filtering, 8 h for PCA and 68 h for association tests. For comparison, we estimate a runtime of 42 days for an existing cryptographic solution for GWAS⁶ (which does not support QC or PCA), assuming linear scaling with both the number of computing cores (inversely) and the number of variants. Consistent with previous experiments, SF-GWAS obtained accurate results (Fig. 3d and Extended Data Fig. 9), and the top five strongest associations identified by SF-GWAS correspond to loci and genes with well-established roles in AMD, including CFH, C2/CFB, HTRA1, C3, and APOE/APOC1 in chromosomes 1, 6, 10, 19 and 19, respectively.

When using SF-GWAS in collaborative studies, data harmonization may be required to identify shared genetic variants across sites and unify allele or phenotype definitions. This can be facilitated by sharing non-private metadata, such as genomic positions, without exposing sensitive raw data. Notably, SF-GWAS facilitates cross-site variant QC by securely calculating global statistics, taking all parties’ data into account. This can improve data quality, for instance, by rescuing rare variants that are too infrequent at individual sites. In addition, a recent method for securely detecting related individuals across sites could be incorporated to aid cohort selection¹⁹. Another important practical consideration is the ease of deployment with security assurances. To address this, we developed our open-source software to operate in any environment with minimal dependencies, and the sfkit web server²⁰ can be used to automate the deployment of our workflows using either cloud resources or private machines. Although server networking may require organizational approval, SF-GWAS’s federated design and encrypted sharing of information minimize risks and support regulatory compliance²¹.

In summary, our work demonstrates a secure and federated approach to multisite GWAS with rigorous privacy guarantees and scalable performance. Further extending our methods to other useful analysis tasks, including logistic mixed-effects models and statistical corrections for imbalanced or biased datasets^22,23,24, as well as ensuring security even against malicious actors who may deviate from the prescribed protocols, are meaningful directions for future work. We expect our algorithmic techniques and open-source software to accelerate the development of secure genomic data analysis methods. In a global environment where there are increasing concerns and legal restrictions over sharing of sensitive data about individuals, our work lays the foundation for future biomedical advances by facilitating cross-institutional collaboration.

Methods

Ethics and inclusion statement

The use of all controlled-access datasets in this study were approved by the respective data access committees through the NIH Database of Genotypes and Phenotypes (dbGaP) and the UKB Access Management System (project ID: 46341 and 41910).

Review of secure MPC

MPC techniques enable multiple entities to securely and interactively perform computation on private inputs (Supplementary Note). Standard MPC frameworks²⁵ leverage (additive) secret sharing, where each private value is split into random (encrypted) shares, which are in turn distributed to different computing parties. Although the shares collectively encode the private value, any subset of shares provably does not leak any private information. Computing parties then collaborate and use the secret shares to evaluate a function on the private input without revealing information about the private input to any entity involved. For example, the secure addition of two secretly shared numbers x and y can be executed by having each party add their individual shares for x and y. The new shares constitute a sharing of x + y, which is the desired computation result. More sophisticated functions (for example, multiplication, division, square root and sign) can be similarly defined over the secret shares but require the two parties to interact by exchanging a sequence of numbers, which also do not reveal private information. These secure routines can be composed to perform arbitrary operations on private input data held by multiple entities. However, the communication cost of MPC can introduce a bottleneck in applications involving complex tasks. In addition, secret sharing requires that the entire input data be shared with all computing parties.

Review of HE

HE is a form of encryption that allows for direct computations over encrypted data, without having to decrypt them. Initially developed for limited types/rounds of operations, HE is now widely applicable to many analysis tasks due to the recent introduction of fully HE schemes, which include a bootstrapping routine to allow an arbitrary number of operations to be performed, and the development of efficient techniques for common scientific operations. For instance, the CKKS scheme²⁶ encodes a vector of continuous values in a single ciphertext and is well-suited for calculations where a small amount of noise can be tolerated. Like other HE schemes, CKKS performs both additions and multiplications simultaneously on the encrypted values within a ciphertext (single instruction, multiple data property), which improves the overall scalability of the scheme. While HE uniquely enables a single party to perform computation on the encrypted data without interaction, the computational cost and flexibility of HE remain more limited than MPC for general tasks. Also, for multisite collaboration, one needs to transfer all of the encrypted data to a single machine for joint analysis, which can be challenging for large datasets.

Combining HE and MPC to enable practical, secure federated computation

To address the performance limitations of HE and MPC, we take a federated approach to secure computation leveraging both HE and MPC, where the input datasets are kept locally at the respective source sites and only small intermediate data are securely exchanged among the parties using cryptographic techniques to carry out a global computation. For the HE component, we build upon a multiparty HE (MHE) scheme (related to threshold HE) based on CKKS, which extends the CKKS scheme to the setting with multiple data providers by secret sharing the decryption key and constructing a shared encryption key (Supplementary Note). Under our scheme, any party can encrypt the data and perform HE computations locally, but decryption can be performed only if all parties cooperate. Our approach allows each party to perform local computations involving the unencrypted input dataset and a small amount of encrypted data, whereas certain global computations are performed by sharing intermediate results among the parties, encrypted under the shared encryption key. At the end of the protocol, all parties collaborate to decrypt the final results. By keeping each input dataset local, we minimize the communication and enable local plaintext computation, which are substantially more efficient than corresponding ciphertext computation. We also leverage an efficient interactive protocol for bootstrapping in MHE²⁷ to reduce the overall computational burden of HE.

Improving upon existing works on MHE^5,28,29,30, we switch between the MHE and secret sharing representations of intermediate data, which enables efficient MPC routines to be used in conjunction with MHE operations to carry out the global computations (Supplementary Note). We convert between the two schemes to perform each operation under the most efficient scheme throughout the GWAS pipeline. For example, we perform large-scale matrix and vector operations using MHE to exploit its SIMD (single instruction, multiple data) property, but evaluate non-polynomial functions (division, square root and comparison) with compact bit-wise MPC protocols, which are more efficient and numerically stable than the MHE counterparts. Any operation involving the local unencrypted data is performed using MHE to avoid secret sharing of the large input datasets.

Our algorithmic design strategies for enabling secure population structure correction

Our federated framework for secure computation allows us to develop efficient and provably secure algorithms for collaborative GWAS. Our work introduces practical methods for two standard approaches to account for population structure, namely PCA and LMMs. We adopt the following algorithmic design strategies to obtain accurate and efficient performance. First, we closely adhere to the computational pipeline of the desired centralized algorithm to obtain accurate results while securely operating over private datasets held by multiple parties. This is in contrast to other collaborative approaches that simplify or approximate the analysis to address the lack of access to a pooled dataset (for example, meta-analysis). Next, we restructure the computation while maintaining its equivalence to the original algorithm to both maximize the use of low-cost operations and minimize communication by leveraging local plaintext data. We switch between MPC and MHE routines throughout the protocol to improve the efficiency and numerical robustness of our routines. We also optimize the vectorized encoding of data in encrypted representations for efficient composition of linear algebra operations. We detail our algorithmic strategies and optimization techniques in Supplementary Note.

In addition to enabling a substantial performance improvement for PCA compared to prior work⁴, our techniques facilitate the design of a practical protocol for secure and federated LMMs (Supplementary Note). LMMs typically involve operations on the genetic relatedness matrix, which scale with the number of individuals in the dataset and impose a notable computational burden for large cohorts, even in centralized analysis settings. Our approach builds upon REGENIE¹⁷, a recently developed algorithm for LMM association tests based on a stacked ridge regression approach, which directly models the ancestry-related confounding effect as the output of a genome-wide regression model, thus circumventing the use of a genetic relatedness matrix. This approach brings scalability improvements while providing accuracy comparable to other LMM methods such as BOLT-LMM³¹, fastGWA³² and SAIGE³³. Unfortunately, REGENIE cannot be directly applied in a secure federated setting. Although ridge regression can be efficiently performed in plaintext on a pooled dataset, implementing standard algorithms (for example, those based on the closed-form solution) with secure computation techniques results in impractical runtime requirements due to the complex matrix operations, such as inversion, that need to performed on large encrypted matrices.

We address this challenge by comprehensively redesigning REGENIE’s stacked regression procedure with our secure federated algorithms for ridge regression based on conjugate gradient descent and alternating direction method of multipliers (ADMM). Although the inclusion of private covariate features makes a direct application of the latter approach³⁴ impractical in our setting, we reformulated the ADMM algorithm (referred to as ADMM-Woodbury) by leveraging a matrix identity to enable plaintext precomputation of computationally expensive operations (for example, large-matrix inversion) at each site. This reformulation maximizes efficient local computation and, consequently, achieves near-constant runtime scaling with respect to cohort size (Extended Data Fig. 7).

Our scalable techniques for secure association tests based on linear and logistic models

Performing association tests at genome-wide scale involves intricate linear algebra operations (for example, projection, inversion, and factorization) applied to large-scale matrices and vectors. Following the strategies in the previous section, we designed efficient algorithms for secure federated association tests that can analyze millions of genetic variants within practical runtimes. For example, we developed optimized routines for matrix multiplications between the local genotype matrix and an encrypted low-dimensional matrix of covariates and phenotypes by identifying and precomputing intermediate terms that involve the unencrypted local dataset, which can be more efficiently computed than operations involving only encrypted data. This approach also helps to distribute the workload among the parties according to the size of the local datasets. For complex, iterative operations such as matrix factorization and inversion, we securely convert between MHE and MPC frameworks to take advantage of efficient MPC routines for costly operations such as comparison and division. Furthermore, after the initial precomputation, we process genetic variants in non-overlapping genomic blocks in parallel to accelerate the computation.

Prior to our work, logistic regression-based association tests were considered more challenging and could be performed using cryptographic techniques only on small datasets. This limitation has mainly been due to the costly iterative optimization procedure required to estimate the coefficients of logistic regression models (for example, genetic effect sizes). To address this challenge, we developed an efficient secure federated algorithm for score-based tests in logistic models (Supplementary Note). Our algorithm securely obtains statistics that are asymptotically equivalent to alternative tests (that is, Wald and likelihood-ratio) and can be applied to large biobank datasets. Unlike the Wald or likelihood-ratio tests, both of which require a separate logistic regression model to be fitted for every variant being tested, score-based tests fit a single null model including only the covariates, subsequently allowing all variants to be tested using a closed-form expression without refitting the model. To obtain the initial null model efficiently and accurately, we devised a federated algorithm for Newton’s method harnessing both MHE and MPC for efficient matrix multiplications and Hessian matrix inversion, respectively. This approach enables faster convergence to precise parameter estimates compared to standard first-order gradient-based methods (Extended Data Fig. 8). Once the null model is obtained, the remaining computation of association statistics mostly consists of linear algebraic operations, which can be performed efficiently using our secure federated approach. As our results show, these techniques enable SF-GWAS to support accurate logistic regression-based association tests with practical runtimes comparable to those of the linear regression setting.

SF-GWAS pipelines

Our SF-GWAS algorithm implements the full GWAS pipeline, including QC, correction for population structure (PCA and LMM) and association tests. Collaborating parties first agree on the phenotype, covariates and a list of genetic variants to analyze, as well as study parameters, such as filtering thresholds and other algorithmic parameters. They also agree on the security parameters and generate the required cryptographic keys for HE and MPC, for example, encryption keys and shared pseudorandom number generators (Supplementary Note). They then proceed with the interactive protocol to securely perform the desired computation. For QC, the parties independently filter their subset of samples based on public thresholds (for example, for heterozygosity and missing genotype rate), then utilize MPC routines to jointly and securely filter the variants based on global statistics (for example, minor allele frequencies and Hardy-Weinberg equilibrium). The variant filtering output is shared with all parties so that the rest of the protocol can proceed with the reduced variant set.

For the PCA-based workflow, we implemented a secure federated algorithm for randomized PCA^4,35 to compute the top principal components (PCs) over the entire dataset without constructing the pooled matrix (Supplementary Note). The jointly computed PCs are kept encrypted for use in the following steps. For linear regression-based association tests, we use both HE and MPC operations to compute the covariate-corrected association statistics based on a linear model. This step includes (i) a secure federated QR factorization for computing the joint orthogonal basis of PC and the observed covariates, (ii) secure matrix multiplication based on HE to project the genotypes onto the covariate subspace for correction and to compute genotype-phenotype covariances and (iii) MPC routines to inversely scale the statistics by the standard deviations of the genotypes and the phenotype for normalization.

For binary traits, we use our secure federated algorithm for score-based tests using logistic regression models (Supplementary Note). The first step is to fit a null logistic regression model across the parties considering only the covariates, for which we introduce a federated Newton’s method based on both HE and MPC. Then, we compute the components of the score-based test statistic with respect to each tested variant, which include (i) the score (that is, the derivative of the log-likelihood function with respect to the coefficient of the genetic effect) and (ii) the Fisher information (related to the standard error of the coefficient). Both terms and the resulting statistics are computed securely in a federated manner using a series of linear algebra operations in HE and nonlinear operations in MPC. Finally, the association statistics are collectively decrypted and shared among the parties as the final output of the analysis.

For the LMM-based workflow, we implemented a secure federated version of REGENIE¹⁷ that is based on stacked ridge regression models (Supplementary Note). After the QC step, the genetic variants are first grouped into fixed-size blocks. For each block, a ridge regression model is jointly trained across the parties using our ADMM-Woodbury algorithm to obtain encrypted phenotype predictions, which leverage only the variants within the block while accounting for linkage disequilibrium. Subsequently, the block-wise local predictions are provided as input features to a genome-wide ridge regression model for phenotype prediction, jointly trained across the parties using our conjugate gradient descent algorithm. We adopt a cross-validation scheme to determine an appropriate choice of variance parameter, representing the genomic heritability of the target phenotype. We selected K-fold cross-validation, as it is typically more efficient than other methods such as leave-one-out cross-validation and achieves almost identical accuracy in this framework¹⁷. Association tests are performed by calculating the correlation between each target variant and the phenotype residuals, excluding the contribution from the genome-wide regression model, which is estimated without the chromosome containing the tested variant. Similar to the PCA-based workflow, we compute the association statistics efficiently using optimized secure matrix multiplication routines combined with MPC routines for data normalization.

Further details on the SF-GWAS pipelines, including their security, runtime and complexity analyses, are provided in Supplementary Note and Supplementary Tables 2–5.

Related work

Several studies have proposed methods for securely performing collaborative GWAS over private datasets using secure computation frameworks such as HE or MPC^4,5,6. However, these approaches are often limited by runtime constraints or the inability to perform GWAS analyses while accounting for population structure or sample relatedness. MPC frameworks based on secret sharing⁴ are particularly communication-intensive and require encrypting and distributing the entire dataset among all participating parties, resulting in substantial overhead. Although HE enables noninteractive computation on encrypted data, joint analysis using HE⁶ still requires collaborating entities to centralize the encrypted data for analysis by a single party. This approach leads to impractical computational costs for complex analysis tasks on large genomic datasets. Prior work applying MHE to GWAS⁵ addresses some limitations of centralized HE but is restricted to association testing, excluding other critical GWAS components such as QC filtering and correction for population structure (PCA or LMMs). Furthermore, this earlier work relied solely on HE computations, without integrating MPC protocols to improve accuracy or enhance efficiency for nonpolynomial operations. Secure hardware-based approaches to GWAS, such as those based on Intel SGX³⁶, have also been proposed. However, these methods are also limited to association testing without incorporating PCA or LMMs and lack the formal privacy guarantees provided by HE or MPC, making them susceptible to known security risks^37,38,39.

Another approach to collaborative GWAS is based on federated learning techniques, which iteratively aggregate intermediate results among parties to perform global computations^15,40. These solutions are generally more accurate than meta-analysis, as they can more closely emulate a pooled analysis¹⁵. They also achieve efficient performance because plaintext (unencrypted) data are locally held and directly analyzed by each party. However, these methods require intermediate results to be shared among the parties (or with a trusted third party) in plaintext, often involving many rounds of interaction, which increases the risk of privacy leakage^41,42. Although differential privacy techniques can mitigate such leakage, existing methods are not practical for GWAS, as releasing statistics for a large number of variants would require adding substantial noise to preserve privacy⁴³. In SF-GWAS, parties keep their local data in plaintext while exchanging only encrypted intermediate results throughout the study. This approach efficiently emulates a pooled analysis in a distributed fashion while ensuring formal privacy guarantees.

Benchmark datasets

We obtained the three datasets used in the original S-GWAS publication⁴ for comparison. These include a lung cancer dataset (n = 9,178, including 5,088 cases; 612,794 SNPs; females across all age groups), a bladder cancer dataset (n = 13,060, including 9,684 cases; 566,620 SNPs; both sexes across all age groups) and an AMD dataset (n = 22,683, including 6,211 cases; 508,740 SNPs; both sexes across all age groups). We followed the steps in the prior work to prepare the data and then evenly and uniformly split each dataset between two parties to emulate a multisite study.

For the eMERGE data, we obtained a cohort of 31,292 individuals split across seven study groups: Geisinger Health System (n = 3,089), Group Health (University of Washington; n = 1,827), Marshfield Clinic (Pennsylvania State University; n = 4,736), Mayo Clinic (n = 6,208), Icahn School of Medicine at Mount Sinai (n = 5,661), Northwestern University (n = 4,424) and Vanderbilt University (n = 5,347). The individuals range in age from 3 to 75 years and include both sexes. We used a total of 38,040,168 imputed biallelic SNPs for the analysis and chose BMI as the target phenotype. Four covariates were included in the analysis: membership to each study group, age (at time of assessment), sex and age².

For the UKB data, we obtained a cohort of 409,547 individuals of European descent, including both sexes, aged 40 to 69. For use with the PCA-based GWAS pipeline, we also constructed a subset of 275,812 unrelated individuals (King⁴⁴ relatedness coefficient less than 0.062). This cohort represented 22 different health centers across the United Kingdom. To simulate a federated study, we grouped the health centers into six study groups based on geographic location: Scotland (n = 29,825 in total; 20,970 unrelated), Northern England (n = 52,909; 31,263), Northwest England (n = 101,786; 67,494), Central England (n = 93,642; 61,158), Southeast England (n = 76,568; 58,351) and Wales (n = 54,823; 36,576). We provide the list of centers in each group in Supplementary Table 1. We used a total of 92,248,310 imputed biallelic SNPs for the PCA-based pipeline, and a subset of 581,927 non-imputed genotyped SNPs for the stacked regression models in the LMM-based pipeline, following the recommendation in the REGENIE software documentation⁴⁵. We analyzed BMI as the target phenotype. Six covariates were included in the analysis: membership to each study group, age (at time of assessment), sex, age × sex, age², and age² × sex.

For the cross-biobank AMD GWAS analysis, we used data from three independent sources: eMERGE (n = 4,252, including 574 cases), UKB (n = 85,863, including 2,632 cases), and International AMD Genomics Consortium (IAMDGC; n = 21,692, including 9,284 cases) for a total of 111,807 individuals (12,490 cases). We used the imputed genotype data provided by eMERGE and UKB. For IAMDGC, we used Michigan Imputation Server (version 1) to impute the samples using the Haplotype Reference Consortium reference panel. The total number of imputed biallelic variants shared among the three datasets was 22,191,946. For UKB, AMD cases were ascertained based on the ICD-10 code H35.3, and the control samples were randomly sampled from the remaining cohort. For eMERGE and IAMDGC, we retained the AMD case and control labels provided in the original datasets. To demonstrate the scalability of our approach, we considered individuals of European ancestry in each dataset, determined based on self-reported ethnicity for UKB and GrafPop (version 1.0)-inferred ancestry for eMERGE and IAMDGC (P_e > 90%). Four covariates were included in the analysis: membership to each study group (two independent features), age (at time of assessment) and age². Other covariates of potential interest, such as sex, were not available in all datasets and thus were excluded from the analysis.

GWAS details

For the lung cancer, bladder cancer, and AMD datasets, we used the same QC filters as applied in the prior analysis of these datasets using S-GWAS⁴. For the eMERGE data, we used the following QC parameters: genotype missing rate per SNP < 0.1, minor allele frequency > 0.05, and Hardy-Weinberg equilibrium chi-squared test statistic < 23.928 (P > 10⁻⁶). We used the same set of filters for the UKB and cross-biobank AMD data, except for minor allele frequency > 0.001 to account for the larger size of these datasets.

For PCA-based GWAS, we adopt the standard approach of using a reduced set of SNPs with low levels of linkage disequilibrium for the PCA step. SF-GWAS achieves this by imposing a minimum pairwise distance threshold of 100 kb after QC filtering, which we found to obtain similar results to alternatives based on a direct linkage disequilibrium calculation. To establish parity between the plaintext, centralized analysis and our SF-GWAS approach, we use the same set of SNPs for PCA in both analyses for our main results. Alternatively, SNP selection for PCA can be performed separately and the agreed-upon list of SNPs may be provided as input to SF-GWAS. For lung cancer, bladder cancer, AMD, eMERGE, and cross-biobank AMD datasets, we kept the top five PCs as covariates for the subsequent analysis. For UKB, we kept the top ten PCs. We assess the association between each SNP and phenotype of interest based on both linear and logistic regression models including covariates.

For linear association tests, SF-GWAS first constructs an orthogonal basis (Q) for the subspace defined by the covariates (for example, top principal components, age, sex, study group memberships), then computes the Pearson correlation coefficient (r) between the genotype and phenotype vectors where the covariate effects have been projected out using Q. The coefficient r for each SNP is revealed to the collaborating entities. The corresponding χ² statistic with one degree of freedom is obtained as r²(n − c)/(1 − r²), based on which a P value can be calculated. n and c denote the total number of individuals and the number of covariates, respectively. Note that this mapping does not reveal any additional information other than r. For logistic association tests, SF-GWAS computes the score-based test t-statistic for each SNP i as (t=({{tilde{{mathbf{g}}}}_{{{i}}}}^{T}(,{mathbf{y}}-hat{{mathbf{p}}}))/sqrt{{{tilde{{mathbf{g}}}}_{{{i}}}}^{T}{mathbf{W}}{tilde{{mathbf{g}}}}_{{{i}}}}), where ({tilde{{mathbf{g}}}}_{{{i}}}) is the genotype residual vector across individuals after adjusting for covariates, y the phenotype vector, (hat{{mathrm{p}}}) the vector of estimated mean of the trait based on the null model, and ({mathbf{W}}={bf{diag}}{{{hat{p}}_{j}}(1-{{hat{p}}_{j}})}) with ({{hat{p}}_{j}}) the jth element of (hat{{mathbf{p}}}) corresponding to the jth individual, following the approach of REGENIE¹⁷. The corresponding P value is estimated via a normal approximation with ({t}^{2} sim {chi }_{1}^{2}).

For LMM-based GWAS, we follow REGENIE’s approach to first apply ridge regression to obtain the best predictive model of the covariate-corrected phenotype within a given genomic block, accounting for genotype correlations, and then perform a second regression to obtain genome-wide phenotype predictions based on the block-wise predictors. The size of each genomic block is set to 8,192 in order to maximally leverage the vectorized encryption scheme based on our cryptographic parameters. Following REGENIE, we use a five-fold cross-validation to select the best variance parameter to construct the final predictors. For the association tests, we adopt the standard leave-one-chromosome-out scheme, leaving out the chromosome including the tested variant to correct for the background genetic effect on the phenotype without interfering with the signal being tested. We then obtain the χ² statistic with one degree of freedom, analogous to the PCA-based pipeline, using the residuals of the leave-one-chromosome-out genome-wide predictors for each variant.

Evaluation approaches

We evaluated SF-GWAS by simulating each party on a separate virtual machine (VM) with 16 virtual CPUs (vCPUs) and 128 GB of memory (e2-highmem-16) on the Google Cloud Platform. For the main results, we adopt the most efficient local area network setting by creating the VMs within the same zone in the Google Cloud Platform; we illustrate the reasonable additional cost of wide-area network setting in Extended Data Fig. 5. For the UKB and cross-biobank AMD analyses, we used the same or larger VM types to utilize more vCPUs in parallel given the large size of the dataset. Specifically, we used n2-highmem-64 (64 vCPUs with 512 GB of RAM) and n2-highmem-128 (128 vCPUs with 864 GB of RAM) depending on the size of the local dataset.

We measured the runtime and communication cost of SF-GWAS (the latter measured by the number of bytes sent among the parties) on all datasets. We compared these metrics against the prior work implementing the analogous PCA-based GWAS pipeline, S-GWAS⁴. We also evaluated SF-GWAS’s scaling with respect to the number of samples, SNPs, covariates and computing parties (Extended Data Fig. 4). For the scaling experiments, we replicated the lung cancer dataset to produce a dataset of desired dimensions and modified GWAS parameters as needed to ensure that the amount of data at each step grew proportionally with the input dimensions (for example, ensuring that the number of samples passing QC doubles when the original number of samples doubles).

To evaluate the accuracy of SF-GWAS, we compared its association statistics to those obtained from a plaintext, centralized analysis where the individual datasets are combined to form a single consolidated dataset for analysis. For the three S-GWAS datasets, we used a plaintext Python (version 3.8) implementation of the same procedure as SF-GWAS (using a standard PCA implementation in the scikit-learn package⁴⁶, version 1.3.0); and for eMERGE, UKB, and the cross-biobank AMD analysis, we used the PLINK software (version 2.0; https://www.cog-genomics.org/plink/2.0/) implementing the same pipeline for PCA-based GWAS. For LMM-based GWAS, we used the REGENIE software (version 2.2.4; https://rgcgithub.github.io/regenie/) on the pooled dataset with the same parameters to obtain the ground truth association results. For eMERGE and UKB, we additionally evaluated the accuracy of meta-analysis approaches by performing a separate GWAS for each study group with the same study parameters as the global analysis and then combining the association statistics among different parties using the meta-analysis methods implemented in PLINK.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.