Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« what exactly is quantum key management | Main | Differential Privacy - A milestone in data privacy »

Differential Privacy: The Privacy Guarantee

We are living in times where "data" has become the driving force of our lives. Organizations, Governments, and individuals are continuously collecting information to draw insights and provide the best user experience possible. As the world was embracing intelligence derived from data as the primary tool to make all the decisions, there came the growing concerns about the privacy of individuals. And to keep things in check, a host of regulations and laws have come into effect (GDPR, CCPA, ISO 27001), to ensure responsible usage of the data. In this scenario, where both data privacy and data analytics are required, differential privacy enables data access for analysis without the risk of privacy violations.

The traditional approaches to data privacy predominantly involve the removal of PII (personally identifiable information) using techniques like anonymization, pseudonymization, data obfuscation, etc. Although effective to an extent, these techniques have certain limitations. An analyst performing computational analysis on a dataset, in which the date of birth was anonymized, would get misleading results as the anonymized data doesn't give a guarantee of retaining the original statistical value of the dataset. Sometimes, only a few fields that are deemed as PII are protected/anonymized and other fields are left as is to preserve the statistical values of the data set. In this scenario, there is a threat of revealing the user's identity using linkage attacks by combining this data with data available in another dataset.


What is Differential Privacy 

Differential Privacy of data leverages a statistical framework for provable privacy protection against potential privacy attacks.

A differentially private algorithm is one wherein, by analyzing the output, we cannot ascertain if a data subject was part of the analysis or not. Thus, a differentially private algorithm provides a guarantee that its behavior remains unaffected when a data subject is included or removed from the analysis. This ensures that the output obtained by performing a differentially private analysis on a dataset containing a particular data subject will give similar results to an analysis performed on the same dataset after excluding that particular data subject. This gives a formal guarantee that individual-level information about participants in the database is not leaked. This is in alignment with the generally perceived concept of privacy by individuals, i.e., their data should not be able to be singled out by specific queries. 


How Differential Privacy Works

Traditional data protection techniques work on the notion that privacy is characteristic of the result of an analysis. However, it must be considered as an attribute of the analysis itself.

Differential Privacy safeguards an individuals' privacy by introducing random noise into the dataset while performing the analysis. By the introduction of noise, it would not be possible to identify an individual based on the outcome of any analysis. However, due to the introduction of noise, the output of the analysis is an approximation and not the exact result that would've resulted if performed on the actual data set. It is also very likely that a differentially private analysis if performed multiple times might result in different outcomes each time due to the randomness of the noise introduced.

The privacy loss parameter, Ɛ (epsilon) determines the amount of noise to be introduced. This parameter is derived from the probability distribution called Laplace Distribution. It determines how much the computation can deviate if one of the data subjects was removed from the data set. Smaller values of Ɛ results in smaller deviations in the computations where a users data was to be excluded from the data set. Hence, a smaller value of Ɛ will result in stronger data protection but the computational results will be less accurate. An optimal value of Ɛ has not yet been identified, which will guarantee the required level of protection and accuracy. We're still in the early stages of adoption of differential privacy. It's a trade-off between privacy and accuracy that users must make.


Use Cases and Implementation

Differential Privacy techniques can be used to perform a wide range of statistical or computational analyses. Below are some of the broad categories of computation, which can leverage differential privacy:

·         Count queries

·         Histograms

·         Cumulative distribution functions

·         Linear regression

·         Statistical and Machine Learning techniques that involve clustering and classification

·         Synthetic data generation and other statistical disclosure limitation techniques


Different approaches to differential privacy have been implemented by various organizations. Some of the approaches taken are below:

o   Interactive Mechanism: Users can perform analysis like custom linear regressions on a dataset and get differentially private results.

o   Non-Interactive Mechanism: Providing data that is differentially private, such as synthetic data, which can be used for performing analysis.

o   Curator Based: Assigning a database administrator to provide datasets that are differentially private.

o   Local Model: Consider the example of a survey conducted in a differentially private manner. In this method, users do not provide their personal information to a trusted third-party, but instead provide responses to questions involving their own personal information in a differentially private manner. The individual differentially private answers are not useful by themselves, but an aggregation of these responses can be leveraged to perform meaningful statistical analysis.


We at iEDPS (Enterprise Data Privacy Suite from Infosys), are building an interactive mechanism for differential privacy. We will be leveraging differentially private algorithms to perform computations and enable users to query data without the risk of leaking personal information.


Food for Thought

It has been proven that risk of privacy loss increases as the frequency of analysis of data increases. The privacy loss parameter can be implemented as a "privacy budget" to be consumed by multiple analyses of individuals' data. If there is only a single analysis to be performed on a given dataset, we can allocate the entire privacy budget for this single analysis. However, multiple analyses will be run on a given dataset, and hence the cumulative usage of the privacy budget by all the analyses must be computed.

Differential Privacy provides robust data protection which is not usually possible with traditional data privacy techniques. However, mechanisms must be developed that ensures differential privacy meets the legal requirements and are able to identify suitable privacy loss parameter Ɛ based on such regulations. There should be a synergy between data providers and legal entities while choosing differential privacy tools for protecting privacy to ensure the data privacy implementations adhere to the mandated data privacy regulations.






Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.