The Dawn of Synthetic Data ByPhilipp M Diesinger

width= Dr Philipp M Diesinger has occupied key roles in data science since 2009. His career includes a post-doctoral position at the Massachusetts Institute of Technology, Data Science Consultant at SAP and Head of Global Data Science at Boehringer Ingelheim. Philipp’s specialisms include Predictica Analytics and machine learning.
In this post, Philipp examines a groundbreaking trend that’s emerged from the evolving data landscape: synthetic data. A novel approach in which artificial datasets mimic the properties of real-world data, synthetic data offers businesses an innovative way to address privacy concerns and data scarcity:

In the rapidly evolving landscape of technology and data, a groundbreaking trend is emergingthe rise of synthetic data. As data becomes the lifeblood of modern businesses, researchers and developers are looking for innovative solutions to harness its power while addressing privacy concerns and data scarcity. Synthetic data, a novel concept that generates artificial datasets with properties mimicking real-world data, is gaining momentum.

Synthetic data is generated by algorithms and models that replicate the statistical properties, structures, and relationships found in real data. It is often used as a substitute for actual data, especially in cases where privacy, security, or a limited dataset pose challenges. Synthetic data is information that has been artificially created by computer algorithms, as opposed to traditional real data based on observations of real-world events.

Synthetic data is not a new concept. Academic disciplines, including computational physics and engineering, have long employed synthetic data. These fields have successfully modelled and simulated complex systems, spanning from molecular structures and intercity traffic to optical devices and entire galaxies. These simulations are grounded in first principles, generating data that portrays the behaviour of these systems. Subsequently, this synthetic data is subjected to statistical analysis to create insights and predict system properties.

Additionally, synthetic data is often generated using known statistical probability distributions of system components. This method also allows for the creation of synthetic data, even from limited datasets, by empirically measuring distributions and then sampling them to expand and augment the dataset.

Well before the advent of computational power, mathematicians employed analytical techniques. They derived probability distributions from first principles and propagated them to the system level, often utilising theories like the central limit theorem.

While the notion of synthetic data is not a recent development, its relevance has witnessed a significant upswing in recent years. The number of industry applications has increased dramatically. Synthetic data finds its applications spanning a multitude of industries. Notably, the realm of autonomous vehicles, aircraft, and drones relies on the training of these technologies with hyper-realistic 3D-rendered data. Industry giants like Amazon have made a name for themselves by employing synthetic data to instruct their warehouse robots in recognising and managing packages of diverse shapes and sizes. The healthcare sector is increasingly harnessing synthetic data to train AI systems, ensuring that patient privacy remains uncompromised.

The surge of relevance of synthetic data is aided by innovative data generation techniques fuelled by the accessibility of cost-effective computational resources and abundant data storage capabilities. This synergy has led to the emergence of a multitude of cutting-edge approaches to synthetic data generation.

New methods for the synthesis of data include Generative Adversarial Networks (GANs), which are deep learning models that consist of competing generator and discriminator neural networks. The generator learns to produce data that is indistinguishable from real data, while the discriminator learns to differentiate between real and synthetic data. GANs are widely used in generating realistic data, especially in the domains of image data, audio or text data. Variational Autoencoders (VAEs) are another type of generative model that learns to encode and decode data. VAEs can be used to generate new data points that are similar to the given training data.

Methods for creating synthetic data still strongly depend on the type of data being generated, as well as their respective verticals.

Synthetic data can be as good or sometimes even better learning data for AI systems. One of the most significant advantages of synthetic data is its potential to safeguard privacy. With the increasing awareness of data protection laws like GDPR and HIPAA, organisations must ensure that sensitive information is not exposed. Synthetic data allows for the creation of datasets that maintain statistical accuracy while eliminating any personal or sensitive information. Synthetic data can often be the solution to a data bottleneck created by privacy protection.

Synthetic data can be used to augment real datasets, expanding the size and diversity of the data available for machine learning and AI models. This enables more robust model training and enhances model generalisation, ultimately leading to better performance. Developers can create synthetic datasets with varied scenarios and edge cases that might be challenging to collect in the real world. This diversity, or “data variability”, is crucial for testing the resilience and adaptability of AI systems in different conditions.

Synthetic data is a cost-effective alternative to collecting, storing, and managing large volumes of real data. It saves time, money, and resources, making it an attractive option for startups and organisations with limited budgets.

Synthetic data can help overcome naturally occurring limitations of real-world data due to actual physical constraints.

Synthetic data can be used to overcome cold-start problems. Small organisations might not have sufficient data to develop their AI models and, therefore, might choose to augment the existing data with algorithmically generated data. Synthetic data must be created with care. Artificial data must meet the same properties

of the underlying systems as real-world data would. Generating high-quality synthetic data requires sophisticated algorithms and substantial computational resources.

While synthetic data mimics real data, it sometimes cannot capture all the nuances, anomalies, or subtleties present in genuine datasets. This limitation may affect the performance of AI models later in real-world scenarios. Unanticipated events that can occur in real life may challenge AI models.

Ensuring that synthetic data accurately represents the distribution of real data is still a significant challenge. Biases and inaccuracies can lead to models that do not perform well on real-world data.

In the data-hungry field of GenAI and machine learning, the generation of high-quality synthetic data plays an increasingly important role and can render significant competitive advantages.

Training neural networks requires large amounts of data. Data quality and quantity are significant drivers for the performance of AI models. Real-world datasets often suffer from limitations and biases, which can result in biased machine-learning models. Synthetic data can oftentimes provide larger data variability and thus provide better training data for AI models, enabling them to learn and predict system behaviour in unusual situations. Data variability is a key driver for model performance under real-world conditions.

Oftentimes, training neural networks requires large amounts of well-annotated data. Tagging or annotating data can be a significant cost driver in the development process of performant neural networks. Besides the cost, tagging synthetic data can even be more accurate than annotating real-world data, thus avoiding false labels and reducing noise in training data.

Gartner (1) predicted that by 2030, most data utilised in the field of AI will be synthetical data. The ability to create high-quality synthetic data may become a necessity for the development of high-performance AI systems.

The advent of synthetic data marks a promising step forward in the realm of data generation and utilisation. It offers tangible advantages, particularly in safeguarding privacy, enhancing data diversity, and optimising resources. However, the technology is not without its challenges, including authenticity, ethical considerations, and the need for sophisticated algorithms.

A fun way to explore the generation of synthetic data is SDV – the Synthetic Data Vault – a system of opensource libraries for the generation of synthetic data. SDV (2) offers a variety of synthesisers that can be used to create synthetic data and benchmark it.