Bridging the Data Divide – Harnessing Synthetic Data for Stronger Real-World Evidence


- May 7, 2025
- 7 Min. Read
In prior articles, we discussed the chronic, long-term impacts of obesity on the U.S. healthcare system. The impact of childhood obesity is even more serious. From January 2017 to March 2020, the prevalence of obesity among U.S. children and adolescents was 19.7%, affecting approximately 14.7 million U.S. youths aged 2–19 years. Synthetic Denver is a synthetic population used for testing by the Childhood Obesity Data Initiative (CODI) project. It is comprised of 6,357 simulated child patients residing in Colorado (approximately 1/100th), with records split to test and identity matching systems. Demographic data is modeled to be realistic and reflect Denver area healthcare provider variations. Could we use synthetic data in combination with real-world data (RWD) to further model the impacts of childhood obesity?
The concepts of RWD and real-world evidence (RWE) are not new, but their interest and use increased significantly in 2016 with the passage of the 21st Century Cures Act. Leveraging RWD for RWE generation supports several goals within this Act, including the acceleration of development of new medical products and medical research, and overall efficiencies. The value of RWD and the types and amount of information available have, in parallel, grown exponentially. This growth has been bolstered by large-scale adoption of the electronic health care record (EHR), artificial intelligence (AI) and machine learning (ML) tools.
However, the generation of RWE is not without risks and challenges, including patient privacy, data scarcity, and introduction of bias (based on the source of data). For example, data within RWD, even when de-identified, are at risk for re-identification, especially when free text is accessed to uncover additional information not available from structured data. To address these risks and challenges, synthetic data, in combination with RWD, may provide an attractive alternative.
The concept of synthetic data was first introduced by Donald Rubin in 1993 but has only recently become socialized within healthcare research. Our intention here is to provide a high-level overview of how the union of synthetic data with RWD may advance RWE generation and potentially expedite regulatory decision making.
Synthetic data refers to artificially generated data that replicates the statistical characteristics and patterns of RWD. It is widely used in data science, machine learning, and deep learning to conduct experiments, test algorithms, and develop models without exposing sensitive or confidential information. Synthetic data is created using algorithms and mathematical models to simulate the complexities found in real datasets. Synthetic data can be generated using various methods such as random sampling, bootstrapping, rule-based systems, statistical models, or generative adversarial networks (GANs). It plays a crucial role in various aspects of machine learning including data augmentation, anomaly detection, bias reduction, and simulating rare events. Synthetic data also offers solutions to data scarcity, privacy concerns, and biased datasets.
The pros and cons for using synthetic data are well-established. It is cost-effective, has faster turnaround times, allows for greater control over quality and format, provides better performance in machine learning algorithms, enables greater flexibility and increased collaboration, reduces bias, and improves data security. Limitations may include overfitting models (like the training dataset), domain expertise, computational costs, and privacy risks.
RWD, patient health information collected during routine healthcare, outside of controlled clinical trials, typically from sources like EHR, medical claims databases, patient registries, and wearable devices, provide insights into how treatments are used and experienced in actual healthcare settings in everyday practice. Can synthetic data be combined with RWD to conduct real world evidence studies to advance regulatory issues? Researchers spend a great deal of time collecting, organizing, and cleaning data prior to data analysis. Although studies using RWD are mostly retrospective cohort studies, data aggregation and cleaning is resource intensive, demanding time and monies. With synthetic data, large datasets can be created in a cost-effective manner. Further, multiple datasets can be created, tweaked, and subsequently tested.
Additionally, regulatory decision makers using RWE are at the mercy of results derived from a single study which seeks to replicate data most often generated by a randomized clinical trial. While RWE protocols can be well designed, exhaustive, and statistically sound – they can only address a limited set of questions. Frequently, during the course of a study, related and relevant questions surface. Addressing these immediately would likely lead to delays in study execution, closure, and increased expenses. Not to mention the potential to muddle initial findings. At some point, additional questions need to be investigated. Synthetic data combined with RWD provide an alternative pathway to answer these critical questions through selection of multiple datasets, i.e., populations of interest, while avoiding the need for a second study. Conversely, without the use of synthetic data, secondary follow-up questions may be incomplete, underpowered, and not generalizable.
Back to our example, let’s assume we are conducting a study examining the association between childhood obesity and postoperative outcomes such as revision and replacement of a hip prosthesis among a cohort of children following total hip arthroplasty, i.e., hip replacement. In this real-world example post implant adverse events are relatively rare and therefore provide insufficient sample size estimates for post-hoc or sensitivity analyses. Using synthetic data enables researchers to create an almost unlimited number of synthetic childhood cohorts with which to impute post-implant events. Further, the use of synthetic data allows for imputation of essentially any combination of chosen time, event, or subgroup. This is the potential advantage of RWD combined with synthetic data.
At ICA, we are committed to providing clients with the resources, tools, and critical ingenuity necessary to accelerate evidence generation, realize efficiencies, and positively impact public health. The hallmark of properly conducted research should result in more questions; questions that can be addressed using sensitivity analyses or in subsequent studies. We help clients by creating a RWE model that proactively anticipates these follow-on study questions or allows for the rapid creation of synthetic RWD to build suitable datasets to address them, empowering researchers to investigate virtually all study-generated questions quickly and decisively.