Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« Differential Privacy: The Privacy Guarantee | Main | Data Lineage »

Differential Privacy - A milestone in data privacy

Confidentiality and Integrity are two of the prime data privacy goals in today's scenario. Though there have been continuous enhancements in the cryptographic mechanism, there occurs a proportionate growth in the security attacks both 'active' and 'passive'. When it comes to a statistical database say big data, it is indeed a great matter of concern in protecting the sensitive information which may or may not be individual specific, while still unrevealing the actual PII data. There comes the differential privacy for rescue.

The need for Differential Privacy

Netflix in 2007 released a data set of movie rating by users to conduct a competition after anonymizing PII information about the users. But still, analysts had cleverly linked the Netflix anonymized training database with the auxiliary data from the IMDB database to partly de-anonymize the Netflix training database.

The present world is going through a tough time battling the pandemic COVID-19. Governments across the world are trying to figure out the source and route maps of the affected individuals to have a track on the outbreak of the virus. Governments are also releasing statistical data about COVID patients to the public.  At the same time, Governments must make sure that their PII is protected. One of the traditional ways to do so is by anonymizing the PII information. But as we saw above from the case of Netflix anonymizing the PII is not enough. Since auxiliary information and other sources of information are available in the Public domain this can be combined with the statistical data and do reverse engineering to rediscover the actual PII data. This may lead to a privacy breach. Here comes the need of Differential Privacy.

What is Differential Privacy (DP)?

        Differential Privacy redefines "privacy" for statistical databases. Differential Privacy is a mathematical framework that provides privacy for statistical databases. A statistical database in this sense is any database that provides large-scale information about a population without revealing the individual-specific information. The sensitive data in the statistical database is secure such that it is devoid of any third party potential privacy attacks. In other words, it is difficult to reverse engineer a differentially private data. This is already being used by several organizations some of which are Apple, Uber, US Census Bureau, Microsoft.

Goals of Differential Privacy

1.      To make sure that the data is not compromised at the same time maximize the data accuracy.

2.      To eliminate potential methods that may distinguish an individual from a large set of data.

3.      To ensure the protection of an individual's PII under any circumstance.

 

The Mechanism behind Differential Privacy

The Conventional way of preserving data privacy is by anonymizing the data sets. But the main mechanism behind differential privacy is to shield the dataset by introducing carefully tuned random noise to the data set while using it to perform any analysis. The amount of noise that is to be added to a data set is controlled by a privacy parameter called the privacy loss parameter represented by the Greek letter ɛ.  ɛ computes the effect of each discrete information on the respective analysis of the output. This parameter ɛ determines the overall privacy provided by a differentially private study. The smaller value of ɛ indicates better protection causing low privacy risks. Conversely, a larger value indicates worse protection causing high privacy risks. A value of ɛ = 0 gives complete data privacy but its usability will be zero. Privacy loss is independent of the database as well as the database. Larger the database greater the accuracy amount for a differentially private algorithm.

 

The privacy loss parameter is proportional to the "privacy budget". It is up to the different analysis performed on the data set one can decide how much privacy budget is to be utilized on a given data. This means one can exactly define how much of your privacy budget you can use until the data is not considered as anonymous anymore.

View image

The above picture shows how differential privacy works.

If someone who is a data expert was depending on the databases which are having a single data entry difference, then the chance of that reflected change in the result will not get affected by the variation of that particular entry. The only probability of that change would be that of a multiplicative factor. This means the expert cannot differentiate one database from others depending on the output when differential privacy is made utilized.

Conclusion

Differential privacy is one of the most anticipated research topics now in the field of data privacy. The adoption of differential privacy is still in its early stages. There is no doubt that differential privacy provides a guarantee of privacy and security to one's data. Yet the limitation is that if we have a high dimensional data and we need to provide more privacy to that, then we might end up adding lots of noise. This may make the data unworthy. But still, it is a far better approach in protecting the personal data against the privacy breach for high dimensional data while comparing with the traditional data privacy techniques. We at iEDPS Product Team have been analyzing the market, where our customers require data sets that have to be reconstruction resistant. This is one area where we feel iEDPS Differential Privacy Module will add a lot of value to solve these business use cases.

Author: Jessmol.paul

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.