Article
5 min read
Adrian Sutherland

Marlon Brando once snapped at an interviewer that ‘privacy is not something that I’m merely entitled to, it’s an absolute prerequisite’. This holds as true for medical patients in 2023 as it did for Hollywood's greatest leading man in 1960. 
 
Your average patient today has benefited from the digitisation of medical information. Electronic Health Records (EHRs) have undoubtedly made healthcare more efficient. But the technology’s main upsides – convenience and ease of access – raise a big question: who should have access to other people’s medical data? 
 
There is no easy answer here, only a conflict of priorities. Let’s take a closer look at the disputes in play, and how an emerging technology promises to resolve many of them. 

 

Patient data is gold dust 

 

It’s worth reminding ourselves of what is at stake here. In the hands of medical and pharmaceutical researchers, real-world data (RWD) is the key to major breakthroughs that improve patient outcomes. 
 
The more data available, the easier it is to identify trends and patterns. Scaling up development of diagnostics, therapies and treatments could spell countless gains in quality-adjusted life years. 

 

The patient data privacy quandary

 

While the benefits of sharing healthcare data are clear, clinicians tend to take a more cautious view of its uses. 
 
Physician-patient privilege matters, regardless of its consequences for scientific progress. Patients must always be able to trust that what they disclose to their doctor is strictly confidential. 
 
This is not only a matter of professional ethics. Health records are now a bigger target for fraudsters than credit cards, accounting for 95% of all identity theft. Only the sturdiest firewalls can prevent identity theft on an industrial scale. 
 
Different jurisdictions have different rules in place to establish patients’ informed consent for their records to be used in research. In almost every case, they will stipulate that personal identifiers are removed or obscured. Yet this is not always enough to keep data safe. 

 

De-identification is not a magic bullet

 

Even rigorously anonymised data can be reconnected to its sources. On various occasions, researchers have tested the mettle of de-identification methods and found them lacking. 
 
By triangulating de-identified data with other information available online, scientists have been able to identify participants in genomic sequencing projects. 
 
As machine learning tools get better, the risks of adversarial attacks on medical databases will only grow. 
 
But what if the data never belonged to anyone in the first place? 

 

How synthetic data protects privacy

 

AI’s creative abilities stretch beyond writing superhero films and rendering 3D video game graphics. Neural networks trained on real-world data can now generate synthetic data that credibly resembles its sources. While artificial, this data is endowed with the same statistical properties as real-world data. 
 
This field is developing at rapid speed. In just the last few years, Generative Adversarial Network (GAN) modelling has improved upon randomised ‘Monte Carlo’ approaches. The data produced retains deep internal relationships that are more meaningful to analyse. 
 
Researchers can then use this data to create imitation EHRs for patients who don’t actually exist. They can then use this information without violating privacy law. So long as protocols are followed to stop real patient data ‘leaking’ into the artificial datasets, research can be done faster and at greater scale. 
 
The research community needs to convince wary stakeholders that this technology respects privacy by design. 

 

Putting synthetic data into practice

 

Synthetic data’s research benefits are most pronounced when real-world data is hardest to come by. Let’s take two examples. 

 

Treating rare conditions

 

Some diseases are so rare that it’s hard to find enough data to design treatments for them. With only a few hundred sufferers worldwide, designing clinical trials for a condition like Progeria is almost impossible. 
 
Instead, generative AI algorithms can take data from a small sample of patients and ‘amplify’ it into a credible representation of a larger population. Researchers can then test out different variables ‘in silico’, using computational models to see what works. 
 
Later phases of these experiments will still require validation with real patient data. Nonetheless, the use of synthetic data could markedly accelerate these processes. This brings treatment closer for the medically underserved without risking real, vulnerable people’s health. 

 

Responding to emergencies

 

As we learned in 2020, what starts small can snowball very fast. Highly-infectious diseases spread at exponential rates that far outpace our abilities to track them. Confounding factors, like uneven access to testing, can dilute the quality of real-world data. 
 
Privacy concerns also come into play here. At febrile moments like the early stages of a pandemic, protecting the identities of the infected becomes even more important. 
 
Using predictive AI tools on synthetic datasets could help decision-makers make quick and ethical decisions during emergencies. By creating a ‘digital twin’ of a population, we can model critical variables like cases, deaths and hospital occupancy. 
 
During the COVID-19 pandemic, UK drug discovery company BenevolentAI used synthetic data to successfully predict that a drug for rheumatoid arthritis could be repurposed to treat Covid. We expect to see a lot more examples in the future of ersatz data delivering very real benefits. 

 

Taking synthetic data forward

 

While synthetic data is already bearing fruit, its path to wider use remains strewn with limitations. 
 
Firstly, it will only be as good as its inputs and the model used to process them. If the original data is biased, its synthetic twin will be too. This is a particular problem in healthcare, where data is often noisy and incomplete. The time and manpower needed to verify the original inputs can make synthetic data less efficient than it seems. 
 
Synthetic data is not a perfect replica of its source data, just an approximation. This means it can lose outliers that are critical for truly representing a living, human population. 
 
More research should overcome these challenges. At Endava, we are currently trying to validate whether models trained on a combination of real and synthetic data can outperform those trained on real data alone. We’re using conditional GANs to generate synthetic mammography mass measurements and feeding them to our classification models. The results could inform the development of less intrusive breast cancer diagnostics. 
 
Stay tuned for more updates on this project. In the meantime, read Armin’s latest article for another perspective on how AI is changing the constants of medical research.

 

No video selected

Select a video type in the sidebar.