Addressing contemporary threats in anonymised healthcare data using privacy engineering

Introduction

Data can be either useful or perfectly anonymous, but never both— Paul Ohm¹

Data privacy is a growing concern in healthcare. Medical cyber-attacks are increasingly frequent² and there is ample evidence that trusted healthcare organizations have transferred personal identifiable information (PII) to industry^3,4,5,6. Nevertheless, such leakage of personal data likely underestimates the scale of contemporary threats to data privacy and identity.

Technological advances now make it possible to identify important characteristics of an individual without the need for PII, even in datasets that are compliant with privacy regulations. Linkage attacks enable inference of an individual’s precise identity (identity disclosure) or specific features (attribute disclosure) without PII, by cross-referencing online repositories in a probabilistic fashion. Linkage attacks have been used to identify individual medical records⁷ and socio-demographic details¹, yet remain in the use agreements of several consumer companies^8,9,10. Membership inference attacks (MIA)^11,12 infer whether an individual belongs to a specific disease-related or vulnerable social group without PII, that could influence the delivery of services to them and perpetuate disparities of care¹³. In addition, artificial Intelligence (AI) based novel biometrics can now infer identity from traditionally de-identified sources lacking PII, utilizing data such as ECGs or even patterns of gait¹⁴. (Fig. 1) These advances in data availability, computational methods and analytical efficiency have eroded practical safeguards to privacy attacks, and have also introduced new privacy risks.

Addressing contemporary threats in anonymised healthcare data using privacy engineering — **Fig. 1: Threats to data privacy.**

It is important that patients, health professionals, and researchers alike are familiar with these privacy threats, in order to make informed choices¹⁵. However, such familiarity may be uncommon¹⁶. We set out to review the contemporary health data privacy landscape, which refers to the protection and secure handling of personal health information, for clinicians, researchers, and patients. We then turn to privacy engineering¹⁷, a field that provides technical solutions in commerce, and discuss the strengths and limitations of privacy enhancing technologies (PETs) for various health tasks. We focus on a broad framework that could guide selection of appropriate PETs to strengthen privacy while maintaining health data utility between patients, caregivers, administrators, researchers, and industry.

Contemporary threats to privacy

From 2005 to 2019, there were an estimated 3,900 breaches of health data records affecting over 249 million patients^18,19. Health data breaches pose unique risks. Unlike user-selected data such as account names or passwords that can be reset or deleted, biomedical identifiers may be difficult – or impossible – to change. There is also a unique tension between data privacy and sharing in healthcare. Principles of patient consent and data protection, encoded at least since the Belmont Report, empower individuals to freely disclose sensitive information and engage with health providers²⁰. Weakening these protections may erode trust in healthcare. Conversely, there are widely held expectations that health data should be shared outside the private patient-carer relationship, to improve societal health and develop future therapies^15,21.

Threats to health data privacy

Mitigating data privacy risk historically focused on restricting access to identifiers by specific groups defined by healthcare providers, professional societies, or governments. Identifiers include personal identifiable information (PII), which is any data that can identify a specific individual, or personal data defined by the General Data Protection Regulation (GDPR)²² in the European Union. Personal Health Information (PHI) is a subset of 18 elements defined in the U.S. by the Health Insurance Portability and Accountability Act (HIPAA)²³. As the healthcare landscape has evolved, so too have the types of entities vulnerable to privacy attacks. For instance, direct-to-consumer services often handle sensitive data, yet may not be covered by privacy legislation. A cyber-attack at 23andMe in 2023 exposed sensitive ancestry information revealing ethnic group membership such as individuals with Ashkenazi Jewish heritage²⁴.

Importantly, de-identifying data may not provide robust protections¹. Technical attacks can now attach identities to data that are compliant with widely-used standards for de-identification, to infer attributes beyond what individuals explicitly disclose (Fig. 2). At the heart of these threats is the concept from information theory of the positivity of mutual information²⁵, which informally ensures that the certainty in an unknown quantity (such as an individual’s identity) never decreases as additional information is utilized. A corollary is that reducing the extent of disclosed information may reduce the risk of disclosure.

**Fig. 2: Classes of privacy threats without the need for personal data (de-identified datasets).**

The ways in which characteristics of an individual are inferred depend, in part, on the format in which information is available²⁶. One can broadly consider three formats: de-identified datasets, aggregated information from multiple individuals such as clinical trial data, and AI models.

Threats to de-identified datasets

A key observation is that seemingly innocuous data without PII can be used for ‘re-identification’. One of the earliest attacks is attributed to Sweeney in 2002, who used voter registration combined with other public data to re-identify medical records of the Governor of Massachusetts²⁷. As recently as 2018, patients have been re-identified using de-identified HIPAA-compliant datasets and data aggregated from public newspaper articles²⁸. Such risks are amplified in patients with rare conditions, such as cross-referencing genetic variant data in a public database with social media or news articles. While health data repositories may state that attempts to re-identify patients violate their terms of service, it is unclear whether current safeguards are sufficient to protect privacy.

Artificial Intelligence has added to such identity revelation threats. Using supervised learning paradigms²⁹, AI can associate seemingly non-identifiable biological or behavioural data inputs with public identifiers. Trained AI systems have been used to infer identity from anonymised chest X-ray images³⁰, ECGs³¹, cerebral MRI images³², or gait³³. AI has also introduced new types of threats described below.

Threats to aggregated datasets

Aggregated data were traditionally viewed as free of privacy threats, and publishing statistical analyses of population data is often required by law in the U.S. and other countries^7,34. It is now clear that aggregated data may indeed introduce privacy risks. Membership inference attacks³⁵, or more broadly tracing attacks¹², can discern whether an individual of known characteristics is present in aggregated data. If the dataset contains sensitive information—such as status of HIV, ethnicity or a vulnerable group—this attack could facilitate discriminatory practices^20,36 and further widen disparities in care^13,20.

For example, in 2008, Homer et al.¹¹ inferred which individuals were present in a Genome Wide Association Study (GWAS) of published aggregated allele counts. Since members of each GWAS subgroup had a specific disease, this analysis revealed which individuals had the disease³⁷. Paradoxically, MIA can also be applied to synthetic data³⁸, despite their goal of reducing the need for actual patient data, which should be considered when designing digital twins for healthcare³⁹.

Data is increasingly aggregated piecemeal through data portals, in interactive query systems. Differences between the results of two or more queries can reveal information about an individual, which is termed a differencing attack⁴⁰. The interactive nature of data portals opens the opportunity for a user to adaptively and strategically ask queries that statistically facilitate this threat. A stylized differencing attack is presented in Fig. 2B. The explosive growth of large language models and AI-powered chatbots, which are essentially ‘evolved’ interactive data portals that learn continuously on massive repositories of data⁴¹, raises the need to implement strategies pre-emptively to mitigate privacy threats from such models.

Threats to artificial intelligence models

While AI models are used as ‘prediction engines,’ from an adversarial perspective AI models may be viewed as ‘carriers of training data’. After all, the predictive behaviour of an AI model is determined by its parameters, and modern neural networks or large language models (LLM) learn their parameters from training data. These parameters encode statistical summaries of the training data, just as the intercept in a simple linear regression model represents the average, and other parameters represent a unit change in the conditional average. AI models thus present opportunities for adversaries to learn attributes of individuals in the training data.

AI models are also susceptible to MIA, similar to aggregate statistics, based on the principle that models tend to be more ‘confident’ on predictions whose inputs are more similar to their training data⁴². An adversary could thus query the model, record inputs for which the model has higher confidence, and infer which subjects are likely in the training data. This vulnerability is highest for models that are overfitted, or highly influenced by their training data⁴³. MIA of AI models is not simply a theoretical possibility—but has been applied with astonishing success. MIA has been reported for AI models in neuroimaging⁴⁴ and large language models⁴⁵. Chen et al. found that, without additional safeguards, an adversary could infer the membership of an individual in a genomic dataset in convolutional Neural Networks trained on genetic data⁴⁶. Chang et al. further demonstrated that ChatGPT/GPT-4 is vulnerable to MIA via ‘digital archaeology’, a technique that uncovers digital traces of previous literature, early websites, social media or other content, to infer whether LLMs used copyrighted works in training⁴⁷.

AI models are susceptible to other forms of privacy attacks. Training data extraction attacks describe the ability of an adversary to recover individual examples from training data and have been successfully levied against LLMs^48,49. Model inversion attacks, on the other hand, aim to reconstruct a model’s parameters based on queries^50,51, which can then be reverse-engineered to infer sensitive information about its training data. A recent model inversion attack on a personalized warfarin dosing model enabled the prediction of the patient’s genetic markers⁵².

Historically, privacy protection was often ‘structural’⁵³—in that the time to compute linkages or to perform a differencing attack or MIA provided practical barriers to attacks. Advances in data availability via wearables and other sensing technologies, and increases in analytical techniques and computational efficiency, have eroded these safeguards. AI models are thus not only directly vulnerable to attacks but can be used to facilitate them.

Ethical considerations and legislative strategies

Perspectives on data privacy and data sharing vary considerably^54,55,56,57. Traditional guidelines, such as the Belmont principles of respect for persons, beneficence and justice⁵⁸, may not translate directly to contemporary data sharing such as self-disclosure in social media, nor address privacy threats that infer sensitive information without disclosing PII or PHI.

One solution is to treat all data, even without PHI or PII, as sensitive. This would have dramatic implications on medical practice and research and raises a broader debate on the ethics of withholding seemingly low-risk data that could improve or save lives. Extreme restrictions in data access would also exacerbate bottlenecks in patient care and have been blamed for the slow response to the COVID-19 pandemic in Canada⁵⁹. Industry has also voiced concerns that restricting data access may impede technical innovation⁶⁰, which re-emphasizes the need for discussions on how best to balance health data utility with privacy.

Several ethical frameworks can be applied to health data privacy^54,55,56,57. Contextual integrity was developed to define privacy boundaries and data-sharing practices for online services and is based on ethical norms expected by individuals and society⁵⁷. In general, society expects that sensitive healthcare data will not be publicly exposed⁶¹. A major concern of patients is the secondary use of data beyond their original consent, particularly by industry⁶², yet industry has a central role in innovation.

Several legislative protections for health data privacy are described¹⁵. Historically, the focus has been anonymisation of data, with the primary privacy law in the U.S. being the Health Information Portability and Accountability Act (HIPAA)²³. HIPAA does not regulate data per se but defines groups (‘covered entities’) for whom access is regulated. Such legislation may provide suboptimal protections as healthcare has evolved, and direct-to-consumer services or the providers of wearable devices are typically ‘non-covered entities’ (unregulated by HIPAA)²³.

In the European Union (EU), the General Data Protection Regulation (GDPR) 2016/679 or its United Kingdom counterpart (UK GDPR) define protections for health and personal data more broadly—regardless of custodian, format or collection method²². The legislation also gives consumers greater control over the consent and use of biometric and genetic data, that are discussed specifically in a ‘special category’⁶³.

Supplemental protection may also be provided in the U.S. by organisations such as the Federal Trade Commission (FTC), which has overseen actions against direct-to-consumer services engaging in ‘deceptive or unfair’ practices in relation to consent or data sharing⁶⁴, the California Consumer Privacy Act (CCPA) or the Genetic Information Privacy Act (GIPA)²².

Thus, the translation of any technology for health data privacy into robust practical solutions raises ethical, legal and technical challenges. Safeguards should protect sensitive data throughout the lifetime of an individual, or even their offspring, which is challenging as healthcare organisations or consumer companies grow or consolidate⁶⁵. A central component in implementation must be shared decision-making with the patient or data owner (‘No decision about me without me’). The dialogue should be transparent, considering current and future data uses, for what types of data, and which specific protections are—or are not—available. This in turn will require education of patients and caregivers on the risks and benefits of more and less restrictive options, which is currently suboptimal¹⁶.

Privacy framework for health data

The central health data need is to facilitate access by authorised individuals while preventing unwanted disclosure of identity, sensitive attributes or group membership of an individual or cohort. In primary data use, the patient (data owner) gives explicit consent to use their data for care, which is regulated by restricting authorisation. Nevertheless, this ‘release and forget’ model often provides unnecessary data access. For instance, it is not clear why clinicians at an Institution can access data for patients they are not caring for, or for long after care has completed. Secondary data use refers to purposes other than those for which consent was originally provided, such as research, is regulated differently²³ and is vulnerable to privacy risks. A useful framework could consider how data is released for primary use, in ways that then reduce downstream risks from secondary use⁶². While regulations such as GDPR require data to be secured appropriately²², they do not recommend specific tools.

An alternative framework is a dynamic privacy model. Data access could be time-limited to fulfil a specific task, such as the acute care of an individual, and authorised for the care team. The types of accessible data could be minimized to reduce downstream risks of linkage attack, and time-limited for each stage that the patient reaches in a task. The system would be implemented to enable timely data utility while maintaining these privacy safeguards.

Here, we introduce a sample framework for health data privacy in which the primary unit of privacy is the biomedical task, that we reference to applicable privacy strategies. In Fig. 3, the privacy framework first assesses the specific task (primary axis), followed by axes of data type, the role of user(s), and the time bounds for data access.

**Fig. 3: A proposed framework for health data privacy in which the primary unit is the task rather than professional role.**

Primary axis I. Specific task

Health data tasks broadly span individual care, research and industry, that progress from start to completion. The task largely determines the approach to privacy—patient care requires data at the individual level and is well suited to privacy solutions that operate on encrypted data (e.g., homomorphic encryption, see section Privacy Engineering for Biomedical Data), but less suited to privacy solutions that operate on aggregated data (e.g., differential privacy, see section Privacy Engineering for Biomedical Data). While some tasks share attributes across domains, such as outpatient clinical care, others may differ substantially such as the processes followed by a geneticist versus a surgeon, or population versus patient-centred research, or to develop an app.

Axis II. Data type

Differing types of data required for a task introduce different privacy threats. Fig. 4 breaks data types into categories of granular and aggregated data, although real-world cases may blend categories. Data needs will change as a task progresses from start to completion. Data types would have to be defined by experts in data science, with real-world experience in these tasks, and will require periodic review to incorporate changing practice patterns such as telehealth and crowdsourced data collection⁶⁶.

**Fig. 4: Biomedical data access needs and risks to privacy.**

Granular Data

Granular data may uniquely identify an individual and carry substantial privacy risks. Such data may be required for patient care, medicolegal evaluation, patient-oriented research, and algorithmic development in industry. Examples include history and laboratory tests at a clinical encounter, sequences of X-ray images over time, or mutation-level genomic data. Another form of granular data includes sequences of billing data such as CPT codes.

Granular data, however, can be protected while preserving utility by techniques such as generalization. While a pathologist requires access to actual tissue to perform their task, and a radiologist requires access to granular images, bedside clinicians often make decisions on ‘reduced’ data. Clinicians may be satisfied with knowing a patient’s age in bands such as ’40 s’ rather than 41 or 47 years precisely, a cardiologist may be satisfied with knowing left ventricular ejection fraction in ranges such as ’35–50%’ rather its value to single digit precision, and a pulmonologist may rely on a tumour stage more than its precise histology or dimensions. This approach of reduced data release—of generalized rather than raw data even for primary use—has been shown to protect against privacy attacks in other domains when data is subsequently aggregated^7,27,67,68.

Aggregate Data

Aggregate data may afford greater privacy protections than granular data because released elements are less specific²⁵. This reduces the certainty that any particular record indicates a specific person, has a sensitive attribute or is a member of a specific group^7,27,67,68. Aggregate data are widely used for tasks such as outcome tracking within a clinical unit, communications between healthcare providers and patients, population-based research, resource utilization, and public reporting of clinical trial data.

Axis III. Intended users

Intended users are straightforward for some use cases, such as the hospital team entrusted with caring for an individual, or related users such as subspecialty consultants. Others are less obvious, such as legal and compliance teams in clinical practice. Other users are controversial, including researchers or industry professionals developing an app. User roles may change over the natural history of a task.

Axis IV. Time bounds of data access

Data access should be limited to the time for each stage of a task. While the primary care team and consultants must have immediate data access at their point in care, access may later be restricted as appropriate. It may be reasonable for users to request longer time bounds, such as a consultant requesting continued access to a patients’ data for follow-up, or technology companies requesting extended access to update their AI models. These requests can be considered case-by-case. The Supplementary Materials illustrate use cases for the framework.

Comparison to other data use frameworks

In international human rights law, the Necessary and Proportionate Doctrine specifies that data collection is permissible provided it is necessary to serve a legitimate legal purpose, and is narrowly tailored to be proportionate for this purpose⁶⁹. The World Health Organization extended this logic to public health priorities and taking appropriate measures to minimize disclosure risks⁷⁰.

Underlying these approaches are the principles of data minimization⁷¹, to collect only the minimal viable amount of information for a given task, and purpose limitation, to collect and use information only for legitimate purposes⁷², which are referenced in Article 5(1)(c) of the European Union’s GDPR and California’s Privacy Rights Act (CPRA, Section 1798.100)⁷³.

The proposed framework shares the goal of data minimization, integrated with PETs. It is intended to be used as a tool to identify the factors to consider for privacy needs, which can then help drive the selection and implementation of PETs for a task. The framework augments existing consenting and data-sharing processes but is not intended to replace them.

Implementation of such frameworks may introduce complexities for healthcare organizations and research institutions, which may be unavoidable. Task-based, user-specific and time-limited privacy is emerging elsewhere in professional life and society—in subscription-based access to online journal articles or movie content, in financial and legal transactions. Unlike current ‘break the glass’ policies that simply restrict data access behind digital firewalls, strategies such as that proposed here must be seamless for the user to avoid delays in utility. One may argue that current privacy solutions are less cumbersome, but it is unclear if they are adequate for emerging privacy threats. Future discussions on health data privacy frameworks must include data owners (ultimately, patients), ethicists, health care Institutions, research bodies, industry, legal experts and government.

Privacy engineering for biomedical data

Privacy engineering focuses on the design and implementation of solutions to data privacy threats¹⁷. Several technical solutions or privacy-enhancing technologies (PETs) differ in their benefits and tradeoffs⁷⁴, such as reduced data precision or increased computation time, and none offers a panacea as emphasised by Mulligan et al.⁵⁵.

The selection of PETs for a given health data task is a developing field. While policies by organizations such as OECD (Organization of the Economic Co-operation and Development)⁷⁵ emphasise health data privacy, they focus on traditional authentication and authorisation policies. Nevertheless, a consensus is emerging around the need to address emerging threats raised by AI and a plethora of data, and the United Nations recently adopted language by the US government on safe and reliable AI⁷⁶.

We outline PETs that could be adapted to the health data ecosystem⁷⁴, presenting preferred use cases as a foundation for appropriate use. Implementation of solutions is likely complex and will require interoperability between systems⁷⁷. Table 1 summarizes various privacy-enhancing technologies (PETs) (Fig. 5).

**Fig. 5: Privacy enhancing technologies (PETs) for biomedical data aim to obscure individual identity, sensitive attributes and group memberships from privacy attacks.**

Table 1 Summary of Privacy Enhancing Technologies

Full size table

Differential privacy (DP) is a mathematical strategy that guarantees, informally, that any outcome of a statistical analysis of a dataset is “essentially as likely” to occur regardless of an individual’s precise data point⁷⁸. An adversary will thus not be able to discern from the output whether a specific individual’s data was included or not. When properly configured, differential privacy can thus prevent re-identification⁷⁹, MIA in aggregated data¹² and AI models^12,46,80, differencing attacks⁴⁰, training data extraction⁴⁹, and model inversion⁵¹.

DP works by adding “carefully crafted noise” that introduces a privacy-accuracy tradeoff: the greater the noise the greater the privacy, yet the less useful the data. The precise loss of accuracy can be calibrated for a particular task^78,81. DP has been applied for genomic analysis³⁵, epidemiology⁸¹, and medical imaging⁸² and is more accurate on larger datasets⁷⁸. The U.S. Census Bureau uses DP techniques when reporting public health data³⁴, which serves as a model for institutions to publish aggregated data. Accordingly, DP may be better suited for population tasks than for individual patient care and is less suited for analyzing outliers such as specific critical laboratory values or effects in subsets of the population^78,81.

While differential privacy addresses the motivating threats for this manuscript, other threats can also arise from the collection, sharing, and use of data in the health setting. We now considering several additional PETs.

Homomorphic encryption (HE). Encryption is frequently used to achieve health data privacy, yet typically requires data to be de-encrypted for analysis—during which time they are vulnerable to attack. HE is a mathematical approach that allows computations to be performed on encrypted data without decrypting them.

Fully homomorphic encryption (FHE) enables a diverse array of computations, while partially homomorphic encryption (PHE) supports only certain operations including addition or multiplication. HE has been applied for individual patient care, such as developing genetic risk scores accessible for authorised providers by storing encrypted genetic (single nucleotide polymorphism) data and pseudonymized identifiers in separate processing facilities⁸³. HE has also been used for gene-based cancer risk modelling⁸⁴ and for coronary angiography⁸⁵. The National Institutes of Health have funded iDASH (Integrating Data for Analysis, Anonymization, and Sharing) as a National Center for Biomedical Computing using HE to enable Institutions to conduct computations without de-encryption⁸⁶. The tradeoff of HE tends to be that computation time is more burdensome than for unencrypted data. This may limit its application for mobile technologies⁸⁷, or other applications or geographies where resources are constrained, but can be mitigated using partial HE.

Secure multiparty computation (SMC) enables decentralized computations without the need for a central data repository, by mathematically combining information such that nothing is revealed to data holders beyond what is revealed by the computational analysis. Several potential uses for SMC include analyses across institutions such as coordinating data for clinical trials, cancer research projects, population health management⁸⁸, or population research on MRI features⁸⁹. The tradeoffs in SMC are its logistical and computational complexity, which may limit its use in resource-constrained applications such as mobile digital technologies.

Federated learning (FL) describes the development of AI models using data from multiple sources. FL has been used to train AI models to identify diabetic retinopathy using optical coherence tomography scans from different scanners⁹⁰ and to predict cardiovascular hospitalizations using electronic health records from multiple institutions⁹¹. Unlike SMC, the decentralized nature of federated learning does not offer a mathematical guarantee of protection by itself but, when paired with other techniques, federated learning can inherit their mathematical guarantees. For example, federated learning has been combined with SMC and HE to develop secure AI models for outcomes prediction in oncology and GWAS⁹², and with differential privacy to develop secure AI models for medical image analysis⁹³.

Federated learning has limitations in the health setting. First, distributing data protection between centres may complicate alteration of access at any one center. Second, centres in a network may differ in key attributes, which may introduce diversity yet fall short of a key statistical requisite of samples being independent and identically distributed.

Synthetic data is an emerging approach to develop AI models or perform tasks that require a dataset rather than summary outputs, shielding the specifics of data while preserving their statistical properties⁹⁴. Using a commercial service (MDClone) to generate synthetic data, Lun et al. showed that patients with ischaemic stroke with and without co-existing cancer differ in their prevalence of hypertension, chronic obstructive lung disease and thromboembolic disease, that was confirmed in actual clinical data⁹⁵. Synthetic data offer no mathematical guarantee of privacy per se⁹⁶ but can be paired with techniques such as differential privacy to train AI models⁹⁷. A limitation of differential privacy synthetic data is that it may be difficult to preserve correlations between all variables in the original data, although preserving some correlations is achievable⁹⁸. Other limitations include the types of data that can be synthesised, and whether they are realistic. AI models trained with synthetic data may have lower accuracy than those trained on real data⁹⁹, albeit with higher privacy protection⁹⁶.

Privacy engineering via accountability mechanisms

Data privacy breaches can be addressed by several mechanisms. Access controls aim to ensure that granular information is only available to approved individuals with a legitimate need for their task¹⁰⁰. Privacy-preserving programming frameworks can be used to restrict the types of computations run or enforce that they are run in a privacy-preserving manner (e.g., using differential privacy)⁸¹. Controlled environments, such as data clean rooms¹⁰¹, trusted execution environments¹⁰², and secure research facilities can also be used to limit the misuse of data. For example, Dutch researchers developed a controlled environment to analyze vertically partitioned health data across multiple parties in a privacy-preserving manner¹⁰³. Finally, audit logging can be used to monitor which individuals access information, and what types of computations they run⁷¹. While access controls and audit logging were originally developed for security purposes, their use has extended into modern data privacy practices to restrict the flow of information to appropriate individuals for appropriate use cases. Taken together, these mechanisms provide practical barriers to restrict improper information disclosure, by documenting when and how information was used and by whom.

Discussion and conclusions

The nature of privacy threats to health data is evolving. Traditional threats include cyber-attacks leading to the leak of personal data from healthcare entities, which are expanding to direct-to-consumer services. These growing threats challenge existing privacy practices and legislation for health data. In addition, emerging non-traditional privacy threats could have just as great an impact.

Diverse privacy threats can now occur without traditional data leakage, enabled by advances in artificial intelligence, aggregation of massive online data resources, and data analytics. While these technologies have advanced many facets of health care, they have removed some structural obstacles to privacy attacks—making them easier and more efficient to perform—and have introduced new risks. For instance, fully trained AI models and large language models that are emerging throughout medicine carry vestiges of training data, that could be used to infer sensitive characteristics of individuals used for training.

Such threats introduce new ethical and societal challenges, and a need for novel solutions that maintain privacy yet ensure data utility for critical health care tasks. A variety of privacy solutions are now available, some applicable to individual patients, such as homomorphic encryption, others better suited to population data, such as differential privacy, and others suited to maintain privacy between hospitals in a trial network, such as secure multiparty computation. Each has strengths and limitations, which can be considered as part of an integrated health data privacy framework.

Future initiatives on health data privacy must start with shared decision-making with the patient (data owner). This should be transparent, considering current and future data uses, types of data and what protections are available. This will require education of data owners, caregivers and health data professionals, so that truly informed decisions can be made. The subsequent implementation of privacy technologies is likely to be complex but is unavoidable and should include data owners, ethicists, health care Institutions, research bodies, industry, legal experts and government.