The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

« May 2020 | Main | August 2020 »

July 31, 2020

Learning to Trust Synthetic Data...

There has been a growing demand in today's AI revolution to generate synthetic data. This demand for very large volumes of data to train and build AI/ ML models are need to solve key uses cases in the real world. For instance, Google's self-driving car needs up to 3 million miles of autonomous driving data per day to tweak, train the algorithms that control the autonomous cars. Google needs new scenarios based on real-world situations -- adjusting the speeds of cars at a highway merge to check performance and accuracy...the only way they can run these complex simulations is by Contextual Synthetic data generation.

What are the key trends driving the demand for synthetic data?

"Imutation" - a word I learned before a few days, this seems to be origin for Synthetic data. In the 90s, when there was missing data used for reporting. There was a process to replace the missing value with an estimated value. For instance, somebody did not complete a questionnaire, the missing answers are filled.

  • In the last three decades, synthetic data generation has evolved but rapidly gained popularity as synthetic data has been used for advanced machine learning, especially to evolved AI-Vision capabilities to train algorithms on camera and sensor-based data for preventive maintenance
  • Secure offshoring - Post pandemic, remote work has become prevalent and synthetic data enables secure offshoring where developers and testers can develop solutions without access to sensitive customer information working remotely. Synthetic data is a secure substitute for sensitive customer information which could cause a privacy breach.
  • Training data - Synthetic data can be substituted for labeling that may be very expensive or not available for training your algorithms. In addition, synthetic data can be modified and generated in large volumes to improve the model.
How can we learn to trust synthetic data?

At a conceptual level, synthetic data is not real data but generated with the same structure and statistical properties as the real data. The key in any synthetic data set is the variation of the degree of how close, this is with the actual data. To qualify the degree of closeness to the real data is a measure called Utility. 

iEDPS_Synthetic Data Generation.png

For the majority of the AI/ML models, there is a need to provide very high utility to accurately predict key outcomes. For instance, prediction of customer behaviors, the data could be structured (rows/columns) or unstructured such as call notes, conversations, chat transcripts - There is a need have data synthesis controlled through the following techniques:

  1. Leverage a Data Utility Framework - One of the best practices in data generation is to benchmark with a data utility assessment even for synthesized data set. Over time, the synthetic data would become a proxy for real data - hence at regular intervals, there is a need to baseline this data with the real data set.
  2. Hybrid Synthetic Data - Sensitive data that cannot be generated from real data sets have to manufactured as hybrid synthetic data, where specific characteristics of the actual data are manipulated. The key sensitive information is removed and also can be resistant to reconstruction by Differential Privacy.
  3. Digital Watermarking of Data - In my view, this is an edge case scenario where synthetic data could be commonly used in data analytics. This would cause a concern of differentiating between the real data and synthetic data. One way to differentiate is by adding a unique pattern of data skew which is identifiable

As a part of the iEDPS Product Team, our vision is to build synthetic data sets that can outperform traditional testing techniques and makes your product development Secure by Design. This essentially means our focus is to build data that is safer and improve your time to market. We recommend leveraging iEDPS to manage the utility of the synthetic data for training your AI / ML Models.

References and Further Reading: