Why
UDD using AI and NLP?
Well whenever there is an
'un' before anything, mostly it signifies uncertainty and same is the case with
unstructured data. Following is a simple definition of unstructured data -
"Unstructured data is the data which doesn't
have a uniform structure or format and can't modelled in some pre-defined
data-model easily."
As
a matter of fact, in 1998, Merrill Lynch said, "Unstructured
data present in vast majority of various organizations and some people
estimates as much as 80% of total data present."
That
looks huge amount of data considering the fact how much data is generated and
exchanged in any specific organization. Now as multiple data privacy laws are
in place. It genuinely is a tedious task to be compliant according to those
norms. How to decide if some data is a PII (Personally Identifiable
Information) and need to be protected from unauthorized use this is a concern
with all the data exists in an organization. The solution to overcome this
issue is to discover such data and protect it as per the governing laws for
data privacy. But with unstructured data it is even more time and resource
consuming process. Let us see what the difference between structured and
Unstructured data discovery is.
Structured Data
Data Processing
Some optimal approaches available
to efficiently read/write data
Searching
Indexing mechanism is available
to make searches lightning fast.
Categorical data
Mostly data is divided into
multiple column and categories.
Once category is found to have
sensitive data. Whole data is marked as Sensitive data
Contextualization
The data is well structured and
not much context is involved. Hence Pattern based matching can be used
efficiently.
Unstructured Data
Data Processing
Most data are to be stored in
files and a file processing is required for read/write operations
Searching
In most of cases a full text scan
is needed to search.
Categorical data
There is no structure of the
data, so a full text processing is needed to recognize sensitive data.
Contextualization
The data is highly contextual in
nature and requires an intelligence to discover such data.
As
we are clear that there are a lot of challenges in doing sensitive data
analysis on Unstructured data in comparison to structured data despite it is
very crucial to do it to be compliant to the governing privacy laws.
This
process of data discovery using any regular expression pattern require human
efforts to come up with the optimal pattern to find out that data. Whilst,
there will always a possibility that it is error prone and will not suffice in
recognizing data with good accuracy. In addition to that it requires a lot of
human effort and resources to analyze the data. A few good examples where regex
pattern-based analysis will not work-
· Name - Sherlock Holms
· Address - 221B, Baker street,
London
· Contextual data - Sherlock Holms went
to Apple store to buy an apple which was nearby his house - 221B, Baker
street, London. (Looks Funny right 😊?)
In all
above examples there is not a single regex pattern which will find this data as
sensitive. The last one though is more complex as it involves finding the one
apple as Organization and other one just as fruit.
I
think that is pretty much sufficient to answer why we cannot use traditional
approaches to discover all kind of sensitive data, mainly stored unstructured.
Now the next question will be what then? The answer is simple an approach that
can recognize these patterns automatically without human effort and even in
contextual data and that is what Artificial Intelligence1 is
for.
What
we can do?
Now
as we are clear why we need to do a sensitive data discovery in unstructured
data using AI. As we are dealing with human language data. So, we can use Natural
Language Processing (NLP) for this as this is what NLP is designed for -
processing the text data.
NLP
provide a vast variety of text processing capabilities like tokenization,
lemmatization, Syntactical Analysis, Semantic Analysis, Named Entity
Recognition (NER), Parts of Speech (POS) tagging and many more.
Particularly, NER is very useful for our requirement as we want to recognize
some sensitive entities and for getting the contextual analysis from data POS
tagging is very useful.
There
are multiple libraries available now for NLP. Some of those libraries are -
· NLTK (Natural Language Tool
kit) - A scientific toolkit available for dealing with human language data
· Stanford NLP - A Java based
natural language analysis package.
· SpaCy - Industrial-Strength
Natural Language Processing etc.
SpaCy
among these provides a very clean API and very efficient, ready to use NLP
package for multiple NLP tasks. It also provides a lot of general Named
Entities with very good speed to accuracy ratio.
Now
as we know what to do, the next question will be how? Let us dig deeper into
this NLP journey to find how this sensitive data can be recognized.
How
to find Sensitive data?
As we
have seen in last section, we can use NLP to do text analysis and to recognize
sensitive data among Unstructured text data. Now main concern with Machine
Learning models (that is what NLP NER models are) is, they cannot give 100%
accuracy and they cannot work on all kind of data. So, the good approach will
be to use a combination of NLP, RegEx Pattern and Contextual Pattern Matching
to get better accuracy and results. I will explain this through some examples -
Let
us consider following text. Highlighted fields are sensitive as per our
requirement -
"Nash Smith (Person) went to Apple (Organization) store to buy an iPhone (Product). On the way back home, he also
bought an apple (Product). He used
his credit card 3782-8224-6310-0055 (Credit Card) for purchasing apples."
1. NLP NER -
text_data
= 'Nash Smith went to Apple store to buy an iPhone. '\
'On the
way back home, he also bought an apple.'\
'He
used his credit card 3782-8224-6310-0055 for
purchasing '\
'apples.'
nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)
render_data = doc.to_json()
render(render_data, style = 'ent')
Code - GitHub Gist
Output:
Nash Smith PERSON went to Apple ORG store to buy an iPhone. On the way back home,
he also bought an apple. He used his credit card 3782-8224-6310-0055 DATE for purchasing apples.
We can see using NLP NER we are able to
recognize Name and Organization correctly that is nearly impossible for RegEx
to find. But there are still few tokens which are not recognized like Credit
Card is recognized as DATE but incorrect and iPhone and apple as
product. A credit card is a fixed pattern and can be recognized using Regex
Patterns.
2. RegEx Matching -
Let
us add a few more lines of code to above for recognizing Credit Card.
text_data
= 'Nash Smith went to Apple store to buy an iPhone. '\
'On the
way back home, he also bought an apple.'\
'He
used his credit card 3782-8224-6310-0055 for purchasing '\
'apples.'
nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)]
regex = r'\d{4}[\-\,]?\d{4}[\-\,]?\d{4}[\-\,]?\d{4}'
credit_cards = re.findall(regex, text_data)
print(mat)
Code
- GitHub Gist
Output:
Nash Smith PERSON went to Apple ORG store to buy an iPhone. On the way back home,
he also bought an apple. He used his credit card 3782-8224-6310-0055 CREDIT CARD for purchasing apples.
We can see using NLP NER + RegEx we are able
to recognize Name and Organization and Credit Card correctly that is nearly impossible
for either of them to find individually. But there is still a token that is not
recognized like iPhone and apple as
product. Here we can see a product like this will not have a fixed
RegEx pattern. Now here comes the context-based matching. We can see, whenever
there is a buy verb followed by a determinant (DET) and
following a NOUN. It is a product.
3. Context Based Matching -
Let
us add a few more lines of code to above for recognizing products as well.
text_data
= 'Nash Smith went to Apple store to buy an iPhone. '\
'On the
way back home, he also bought an apple.'\
'He
used his credit card 3782-8224-6310-0055 for purchasing '\
'apples.'
nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)]
regex = r'\d{4}[\-\,]?\d{4}[\-\,]?\d{4}[\-\,]?\d{4}'
credit_cards = re.findall(regex, text_data)
matcher = Matcher(nlp.vocab)
matcher.add('PRODUCT',None, [{'LEMMA':'buy'},{'POS':'DET'},{'POS':'PROPN'}],
[{'POS': 'VERB', 'LEMMA': 'buy'},{'POS': 'DET', 'OP': '?'},{'POS': 'NOUN', 'OP': '+'}])
print(mat)
Code
- GitHub Gist
Output:
Nash Smith PERSON went to Apple ORG store to buy an iPhone PRODUCT. On the way back home, he
also bought an apple PRODUCT. He used his credit
card 3782-8224-6310-0055 CREDIT
CARD for
purchasing apples.
Now we can see how merging all three
approaches can achieve the discovery task that is impossible to achieve from
individual approach. Now next question is what is there after discovery. To protect
individual's data best approach is to anonymize the data such that it becomes
unidentifiable.
Conclusion
Data
privacy is one of the main concerns in today's information age. Every
organization has a lot of data to be handled and most of that data is
Unstructured. To get the understanding of sensitive information from that data
is tedious and resource consuming task that can be automated in multiple ways.
But to get the best results those multiple ways needs to be combined. We can
achieve very good accuracy with lesser resources in discovering data that can
even surpass human intelligence.
Tool which offers unstructured
data discovery:
Infosys Enterprise Data
Privacy Suite is
a patented, enterprise class data privacy and security product which enables
organizations to protect and de-risk sensitive data. iEDPS has helped more than
40+ enterprises to reduce the cost of privacy and improve data security.
Further it is a one stop shop for protection of confidential, sensitive,
private, and personally identifiable information within enterprise data
sources.
iEDPS
supports 180+ algorithms. Intelligently provides Data Protection rules such as
Data Masking, Anonymization or Pseudonymization for Databases, Log Files, and
unstructured data sources such as text files and images. For highly complex
needs, customized Obfuscation logic can also be plugged into the customer's DevOps
pipeline.
https://www.youtube.com/watch?v=aidUrPnmoK4
Author:
Javed
Hussain
Senior
Systems Engineer, INFCAT