Synthetic Data Generation

6 min readFeb 22, 2021

Introduction

As the name correctly suggests synthetic means unreal. So synthetic data is unreal or artificially curated data. This data is not real data which has been observed, rather it is mimicking some real data generation process. It has to possess similar statistical patterns as real data. For example, mean, mode(s) of the real and synthetic data should be in comparable range. But wait, why should we even think of creating new data. Aren’t we in the age of data deluge where we are creating and storing enormous amount of data every day?

Yes, we are creating data but still data curation makes great requirement to start prototyping any ML based intelligent system development. Let me explain some real use cases which will motivate us regarding the requirement of the synthetic data generation.

Data privacy issues- Often most of the employees of organizations have no access to real data for security and regulatory reasons. There are issues in data exchange between employees if they are geographically separated. Synthetic data source will allow teams to access data and develop system which can be tested with real data by very few who have access, that way we can maintain the security and regulatory requirements.

Data anonymization solves data privacy to some extent; however, it can’t totally eradicate the risk of reverse engineering the process and identify personally identifiable information. Synthetic data can be of great help here. If two parties have to exchange data, they can do so using synthetic data rather than using real data or anonymised version of real data

Expensive Labelling Process- Sometimes data labeling can be very expensive. Let’s imagine a scenario where an experienced neurosurgeon can only label images, we can do the math to see how expensive it would be to collect even 10s of examples.
Fast Development Cycles and Experimentations- Often times ideas can’t be prototyped in the organizations because of the complexity of real data access process. One can build intelligent system prototype and test with synthetic data and show efficacy of the system. This will help in getting management buy in before deploying a system.
Simulate Edge Cases- The data generating process usally don’t generate examples which can be used to test edge cases while development of a system. Synthetically curated edge case generation might help in solving the requirement.
Imbalanced datasets- Data scientists face challenge of imbalanced data sets, where example proportion across class is grossly biased towards one or more particular class(s). One can apply synthetic data generation process to generate sample of rare class. This process will help in solving ML model building challenge of imbalanced data.

Synthetic data generation service in production

How to generate synthetic Data?

Now you must have been conviced with the benefits of synthetic data generation. Below I am explaining how we generate useful synthetic data which is useful for our purpose.

1. Imbalance Data: Under and Oversampling (SMOTE & ADASYN)

Synthetic Minority Oversampling TEchnique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

SMOTE selects examples that are close in the feature space, drawing a line between the given examples in the feature space and drawing a new sample at a point along that line.

ADASYN: Adaptive Synthetic Sampling Method for Imbalanced Data, It uses k-NN to sample minority class from each of the minority neighbourhood. It also picks more sample from harder-to-learn neighbourhoods.

This above two methods generates data for imbalanced class. Biggest problem is that we need real data to generate synthetic data on demand. Another drawback is that it solves between class imbalance problem, it can’t handle between class imbalance.

2. Bayesian Network Approach:

Bayesian Network provides a graphical structure to analyse the joint distribution for a set of variables. This graphical causal structure helps us in simplifying the conditional probability distribution to make the problem tractable. This probabilistic graph will help us in generating new data

Lets consider 3 variables with relationship as described by the directed graph.

We can compute joint probability P(A,B,C) = P(A)*P(B|A)*P(C|A,B)

Applying the joint distribution we can sample from high density space. This is a generative model

This method can be very expensive and challenging for sparse data

If number of variable increases then it becomes really comples and have to follow multiple assumptions. Therefore it is very difficult to implement in practice

3. Variational Autoencoder (VAE)

Variational AutoEncoder for synthetic data

VAE is an unsupervised method where the encoder compresses the original dataset into a more compact structure (latent space representation) and transmits data to the decoder. Then the decoder generates an output which is a representation of the original dataset. During production, while generating the output only the decoder is needed, which takes a random noise as input and generates sample synthetic data. Random Noise ensures a different sample at each time point. The system is trained by optimizing the correlation between input and output data.During prediction time we do not need the encoder or the original data.

4. Generative adversarial network (GAN)

In GAN model, two networks, generator and discriminator are trained iteratively. The generator network takes random noise and generates synthetic data. Discriminator network compares synthetically generated data with real data based on conditions that are set before. These two networks i.e. generators and discriminators are adversaries which always compete against each other during the training process. Generator models try to generate such fake (synthetic data) which can fool the discriminator and discriminators always try to set the bar high. This is a case of adversarial game (2 player game) and no statistical distributions are assumed which makes the process very robust. After the training is done, we do not need the discriminator, and we only use the generator which creates near realistic synthetic data.

Quality Check or Validation of the synthetic data generated

The task of synthetic data generation task requires training a data synthesizer G learnt from a table T and then using G to generate a synthetic table Tsyn. A table T contains c continuous columns {C1, . . . , CNc } and Nd discrete columns {D1, . . . , DNd }, where each column is considered to be a random variable. These random variables follow an unknown joint distribution P(C1:Nc , D1:Nd ). One row rj = {c1,j , . . . , cNc,j , d1,j , . . . , dNd,j}, j ∈ {1, . . . , n}, is one observation from the joint distribution. T is partitioned into training set Ttrain and test set Ttest. After training G on Ttrain, Tsyn is constructed by independently sampling rows using G. We evaluate the efficacy of a generator along 2 axes. (1) Likelihood fitness: Do columns in Tsyn follow the same joint distribution as Ttrain? (2) Machine learning efficacy: When training a classifier or a regressor to predict one column using other columns as features, can such classifier or regressor learned from Tsyn achieve a similar performance on Ttest, as a model learned on Ttrain? (3) Each of the corresponding columns in table T and Tsyn follows same univariate distributions? (4) We can also quantify the distance between distributions using KL divergence (5) Visually test the T and Tsyn distributions by projecting them to lower dimensional spaces using PCA or t-SNE