Resources

Synthetic data is poised to exert a profound impact on the data science, AI, and machine learning sectors in the coming years, revolutionizing the way we generate, share, and analyze data. Research efforts are underway, driven by the recognition of synthetic data’s potential to shape the future of data-driven innovation and ensure data privacy in an increasingly digital world. We have compiled a representative list of several key resources highlighting the global stage of synthetic data and its impact/applications.

Research Reports, White Papers, and News Articles

AI helps Argentine small farmers and winemakers compete while luring a new generation

About a quarter of the earth’s arable land and forests are in Latin America, according to The Nature Conservancy. And Argentina in particular is an agricultural powerhouse, ranking among the top providers of food worldwide, according to the World Bank, and specializing in wine, beef and soybeans, among other crops. So, the country’s vintners and farmers have vast influence over global food production… The generational shift is one of the biggest challenges farmers face in Latin America — and in agriculture worldwide, Balussi says. “Producers are aging, and we don’t have young people coming into the industry to replace them,” he says. “But young people like technology, and it’s enticing and empowering them to get into agriculture and be successful.”

– Microsoft News

Synthetic data as an enabler for machine learning applications in medicine

Synthetic data generation is the process of using machine learning methods to train a model that captures the patterns in a real dataset. Then new or synthetic data can be generated from that trained model. The synthetic data does not have a one-to-one mapping to the original data or to real patients, and therefore has the potential of privacy preserving properties. There is a growing interest in the application of synthetic data across health and life sciences, but to fully realize the benefits, further education, research, and policy innovation is required. This article summarizes the opportunities and challenges of SDG for health data, and provides directions for how this technology can be leveraged to accelerate data access for secondary purposes.

– iScience

Synthetic data in health care: A narrative review

Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, existing literature was examined to bridge the gap and highlight the utility of synthetic data in health care. PubMed, Scopus, and Google Scholar were searched to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking.

– PLOS DIGITAL HEALTH

Synthetic data could be better than real data

When more than 155,000 students from all over the world signed up to take free online classes in electronics in 2012, offered through the fledgling US provider edX, they set in motion an explosion in the popularity of online courses. The edX platform, created by the Massachusetts Institute of Technology (MIT) and Harvard University, both in Cambridge, Massachusetts, was not the first attempt at teaching classes online — but the number of participants it attracted was unusual. The activity created a massive amount of information on how people interact with online education, and presented researchers with an opportunity to garner answers to questions such as ‘What might encourage people to complete courses?’, and ‘What might give them a reason to drop out?’. “We had a tonne of data,” says Kalyan Veeramachaneni, a data scientist at MIT’s Laboratory for Information and Decision Systems. Although the university had long dealt with large data sets generated by others, “that was the first time that MIT had big data in its own backyard”, says Veeramachaneni. Hoping to take advantage, Veeramachaneni assigned 20 MIT students to run analyses of the information. But he soon ran into a roadblock: legally, the data had to be private. This wealth of information was held on a single computer in his laboratory, with no connection to the Internet to prevent hacking. The researchers had to schedule a time to use it. “It was a nightmare,” Veeramachaneni says. “I just couldn’t get the work done because the barrier to the data was very high.” His solution, eventually, was to create synthetic students — computer-generated versions of edX participants that shared characteristics with real students using the platform, but that did not give away private details. The team then applied machine-learning algorithms to the synthetic students’ activity, and in doing so discovered several factors associated with a person failing to complete a course. For instance, students who tended to submit assignments right on a deadline were more likely to drop out. Other groups took the findings of this analysis and used them to help create interventions to help real people complete future courses.

– nature

Synthetic Health Data Hackathon

Synthetic data is emerging as a promising technique that offers a way to accelerate research and innovation by enabling faster access to fictional yet useful data sets. Synthetically generated data does not entail the usual privacy risks. Instead of sharing sensitive data, a synthetic population is generated, which maintains most of the deep statistical properties of the real population. Synthetic data is thus one of several techniques that aim to bridge the gap between privacy and utility. These techniques are maturing rapidly due to a combination of the general evolution of advanced analytics and the specific interest in finding secure ways to work with sensible data, not least health data. However, these techniques share a common fate of being difficult to understand for people outside the field and hence of being difficult for authorities to regulate and for organisations to adopt. We need awareness and a better understanding of relevance, potential use cases and limitations. Therefore, this first synthetic health data hackathon in Denmark was planned with an intention to explore value, use cases and limitations and to help raise awareness of the method and its possible role in the advancement of research and innovation in healthcare. Twenty-two teams encompassing 79 researchers and students from across the globe spent a weekend in November 2020 working on challenges related to diabetes and Alzheimer’s based on synthetic data sets. The teams were mentored by a set of international experts within the fields of data science and bioinformatics. The overarching finding was that synthetically generated data sets were considered a valuable new addition to the toolbox with multiple use cases. The evaluation resulted in a number of learnings and pointed towards next steps. We look forward to seeing the method evolve and demonstrate results in healthcare.

– Deloitte, Digital Hub Denmark, Rigshospitalet

Generative AI Surveys

In a few weeks’ time, Gartner completed 7 business function specific surveys with a total response of 833 leaders, across 3 continents, representing 21 industries about their impressions of generative AI programs, and the associated opportunities, risks, and use cases. While it remains early days for many respondents, their feedback gives significant insight into the potential future attitudes of about and applications of these tools.

– Gartner

Is Synthetic Data the Future of AI?

Synthetic data is often treated as a lower-quality substitute and used when real data is inconvenient to get, expensive or constrained by regulation. However, this reaction misses the true potential of synthetic data. Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models.

– Gartner

Five insights about harnessing data and AI from leaders at the frontier

What was once unknowable can now be quickly discovered with a few queries. Decision makers no longer have to rely on gut instinct; today they have more extensive and precise evidence at their fingertips. New sources of data, fed into systems powered by machine learning and AI, are at the heart of this transformation. The information flowing through the physical world and the global economy is staggering in scope. It comes from thousands of sources: sensors, satellite imagery, web traffic, digital apps, videos, and credit card transactions, just to name a few. These types of data can transform decision making. In the past, a packaged food company, for example, might have relied on surveys and focus groups to develop new products. Now it can turn to sources like social media, transaction data, search data, and foot traffic—all of which might reveal that Americans have developed a taste for Korean barbecue, and that’s where the company should concentrate.

– McKinsey

Generative AI is just a phase. What’s next is interactive AI.

DeepMind cofounder Mustafa Suleyman wants to build a chatbot that does a whole lot more than chat. In a recent conversation author Will Douglas Heaven had with him, he told me that generative AI is just a phase. What’s next is interactive AI: bots that can carry out tasks you set for them by calling on other software and other people to get stuff done. He also calls for robust regulation—and doesn’t think that’ll be hard to achieve. Suleyman is not the only one talking up a future filled with ever more autonomous software…

– MIT Technology Review

PwC 2022 AI Business Survey

AI success is becoming the rule, not the exception. In PwC’s fourth annual AI business survey, most companies working with AI report results: promising proof of concepts (PoCs) that are ready to scale, active use cases and even widespread adoption of AI-enabled processes. But some companies stand out. They’re far more likely to be advanced in their AI usage and to be achieving valuable business outcomes — ones that produce not only a functioning AI model, but also significant ROI. What sets these companies apart, the data indicates, is that instead of focusing first on one goal, then moving to the next, they’re advancing with AI in three areas at once: business transformation, enhanced decision-making and modernized systems and processes. Of the 1,000 respondents in our survey, 364 “AI leaders” are taking this holistic approach and reaping the rewards.

– PwC

Can synthetic data enable data sharing in financial services?

Good data underpins good decision-making. This is as true for humans as it is for computers. But what if the data available are so sensitive that processing them would be unlawful, difficult, unethical, or all of the above? But what if analysing that data could provide unique insights into critical areas of regulation, such as anti-money laundering? Should the ability to reduce criminal activity be traded off against privacy, and who should decide which is more important? To put it differently, is it possible to gain valuable data insights in critical areas of regulatory activity while respecting sensitivity and the legitimate need for protection? This question underpins the emerging debate on synthetic data.

– OECD.AI Policy Observatory

Synthetic Data For Real Insights

For AI models to be effective at demonstrating human behavior in business scenarios, they need to be trained on large quantities of data that are representative of reality. The financial services industry generates large amounts of data which could be very beneficial, but such data is often not available for use. This poses a fundamental challenge for researchers and developers. Real data may be challenging to access along many dimensions including privacy, legal permissions, and technical aspects related to volume, representation, and meaning. The question is then, how to enable innovation and building of new products and services that depend on data. One answer is the use of synthetic data, which can share format, distributions and standardized content with the real data while not incurring the risks of using real data. Synthetic data potentially has the added benefit of representing exploratory scenarios beyond historical data to prepare AI algorithms and support decision making in novel situations. As such, synthetic data enables us to be more robust in our response to challenging situations. Further, synthetic data can multiply examples that may be rare in the real data, in order to train machine learning algorithms more effectively.

– J.P. Morgan AI Research