Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« Graph DB - A Simpler Approach to Data | Main | Digital transformation Strategy : Part 1 - find your customers voice »

Why Unstructured Data Discovery using AI and NLP?

Why UDD using AI and NLP?

Well whenever there is an 'un' before anything, mostly it signifies uncertainty and same is the case with unstructured data. Following is a simple definition of unstructured data -

 

"Unstructured data is the data which doesn't have a uniform structure or format and can't modelled in some pre-defined data-model easily."

 

As a matter of fact, in 1998, Merrill Lynch said, "Unstructured data present in vast majority of various organizations and some people estimates as much as 80% of total data present."

 

That looks huge amount of data considering the fact how much data is generated and exchanged in any specific organization. Now as multiple data privacy laws are in place. It genuinely is a tedious task to be compliant according to those norms. How to decide if some data is a PII (Personally Identifiable Information) and need to be protected from unauthorized use this is a concern with all the data exists in an organization. The solution to overcome this issue is to discover such data and protect it as per the governing laws for data privacy. But with unstructured data it is even more time and resource consuming process. Let us see what the difference between structured and Unstructured data discovery is.

 

 

Structured Data

Data Processing

Some optimal approaches available to efficiently read/write data

Searching

Indexing mechanism is available to make searches lightning fast.

Categorical data

Mostly data is divided into multiple column and categories.

Once category is found to have sensitive data. Whole data is marked as Sensitive data

Contextualization

The data is well structured and not much context is involved. Hence Pattern based matching can be used efficiently.

 

Unstructured Data

Data Processing

Most data are to be stored in files and a file processing is required for read/write operations

Searching

In most of cases a full text scan is needed to search.

Categorical data

There is no structure of the data, so a full text processing is needed to recognize sensitive data.

Contextualization

The data is highly contextual in nature and requires an intelligence to discover such data.

 

As we are clear that there are a lot of challenges in doing sensitive data analysis on Unstructured data in comparison to structured data despite it is very crucial to do it to be compliant to the governing privacy laws.

This process of data discovery using any regular expression pattern require human efforts to come up with the optimal pattern to find out that data. Whilst, there will always a possibility that it is error prone and will not suffice in recognizing data with good accuracy. In addition to that it requires a lot of human effort and resources to analyze the data. A few good examples where regex pattern-based analysis will not work-

路         Name - Sherlock Holms

路         Address - 221B, Baker street, London

路         Contextual data - Sherlock Holms went to Apple store to buy an apple which was nearby his house - 221B, Baker street, London. (Looks Funny right 馃槉?)

In all above examples there is not a single regex pattern which will find this data as sensitive. The last one though is more complex as it involves finding the one apple as Organization and other one just as fruit.

I think that is pretty much sufficient to answer why we cannot use traditional approaches to discover all kind of sensitive data, mainly stored unstructured. Now the next question will be what then? The answer is simple an approach that can recognize these patterns automatically without human effort and even in contextual data and that is what Artificial Intelligence1 is for.

 

What we can do?

Now as we are clear why we need to do a sensitive data discovery in unstructured data using AI. As we are dealing with human language data. So, we can use Natural Language Processing (NLP) for this as this is what NLP is designed for - processing the text data.

NLP provide a vast variety of text processing capabilities like tokenization, lemmatization, Syntactical Analysis, Semantic Analysis, Named Entity Recognition (NER), Parts of Speech (POS) tagging and many more. Particularly, NER is very useful for our requirement as we want to recognize some sensitive entities and for getting the contextual analysis from data POS tagging is very useful.

There are multiple libraries available now for NLP. Some of those libraries are -

路         NLTK (Natural Language Tool kit) - A scientific toolkit available for dealing with human language data

路         Stanford NLP - A Java based natural language analysis package.

路         SpaCy - Industrial-Strength Natural Language Processing etc.

SpaCy among these provides a very clean API and very efficient, ready to use NLP package for multiple NLP tasks. It also provides a lot of general Named Entities with very good speed to accuracy ratio.

Now as we know what to do, the next question will be how? Let us dig deeper into this NLP journey to find how this sensitive data can be recognized.

 

How to find Sensitive data?

As we have seen in last section, we can use NLP to do text analysis and to recognize sensitive data among Unstructured text data. Now main concern with Machine Learning models (that is what NLP NER models are) is, they cannot give 100% accuracy and they cannot work on all kind of data. So, the good approach will be to use a combination of NLP, RegEx Pattern and Contextual Pattern Matching to get better accuracy and results. I will explain this through some examples -

Let us consider following text. Highlighted fields are sensitive as per our requirement -

"Nash Smith (Person) went to Apple (Organization) store to buy an iPhone (Product). On the way back home, he also bought an apple (Product). He used his credit card 3782-8224-6310-0055 (Credit Card) for purchasing apples."

1.      NLP NER -

text_data = 'Nash Smith went to Apple store to buy an iPhone. '\
            'On the way back home, he also bought an apple.'\
            'He used his credit card 3782-8224-6310-0055 for purchasing '\

            'apples.'

nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)
render_data = doc.to_json()
render(render_data, style = 'ent')

Code - GitHub Gist

Output:

Nash Smith PERSON went to Apple ORG store to buy an iPhone. On the way back home, he also bought an apple. He used his credit card 3782-8224-6310-0055 DATE for purchasing apples.

We can see using NLP NER we are able to recognize Name and Organization correctly that is nearly impossible for RegEx to find. But there are still few tokens which are not recognized like Credit Card is recognized as DATE but incorrect and iPhone and apple as product. A credit card is a fixed pattern and can be recognized using Regex Patterns.

 

2.      RegEx Matching -

Let us add a few more lines of code to above for recognizing Credit Card.

 

text_data = 'Nash Smith went to Apple store to buy an iPhone. '\
            'On the way back home, he also bought an apple.'\
            'He used his credit card 3782-8224-6310-0055 for purchasing '\

            'apples.'

nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)]

regex = r'\d{4}[\-\,]?\d{4}[\-\,]?\d{4}[\-\,]?\d{4}'

credit_cards = re.findall(regex, text_data)

print(mat)

Code - GitHub Gist

Output:

Nash Smith PERSON went to Apple ORG store to buy an iPhone. On the way back home, he also bought an apple. He used his credit card 3782-8224-6310-0055 CREDIT CARD for purchasing apples.

We can see using NLP NER + RegEx we are able to recognize Name and Organization and Credit Card correctly that is nearly impossible for either of them to find individually. But there is still a token that is not recognized like iPhone and apple as product.  Here we can see a product like this will not have a fixed RegEx pattern. Now here comes the context-based matching. We can see, whenever there is a buy verb followed by a determinant (DET) and following a NOUN. It is a product.

 

3.      Context Based Matching -

Let us add a few more lines of code to above for recognizing products as well.

text_data = 'Nash Smith went to Apple store to buy an iPhone. '\
            'On the way back home, he also bought an apple.'\
            'He used his credit card 3782-8224-6310-0055 for purchasing '\
            'apples.'

nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)]
regex = r'\d{4}[\-\,]?\d{4}[\-\,]?\d{4}[\-\,]?\d{4}'
credit_cards = re.findall(regex, text_data)
matcher = Matcher(nlp.vocab)
matcher.add('PRODUCT',None, [{'LEMMA':'buy'},{'POS':'DET'},{'POS':'PROPN'}],
[{'POS': 'VERB', 'LEMMA': 'buy'},{'POS': 'DET', 'OP': '?'},{'POS': 'NOUN', 'OP': '+'}])
print(mat)

Code - GitHub Gist

Output:

Nash Smith PERSON went to Apple ORG store to buy an iPhone PRODUCT. On the way back home, he also bought an apple PRODUCT. He used his credit card 3782-8224-6310-0055 CREDIT CARD for purchasing apples.

Now we can see how merging all three approaches can achieve the discovery task that is impossible to achieve from individual approach. Now next question is what is there after discovery. To protect individual's data best approach is to anonymize the data such that it becomes unidentifiable.

 

Conclusion

Data privacy is one of the main concerns in today's information age. Every organization has a lot of data to be handled and most of that data is Unstructured. To get the understanding of sensitive information from that data is tedious and resource consuming task that can be automated in multiple ways. But to get the best results those multiple ways needs to be combined. We can achieve very good accuracy with lesser resources in discovering data that can even surpass human intelligence. 

 

Tool which offers unstructured data discovery:

Infosys Enterprise Data Privacy Suite is a patented, enterprise class data privacy and security product which enables organizations to protect and de-risk sensitive data. iEDPS has helped more than 40+ enterprises to reduce the cost of privacy and improve data security. Further it is a one stop shop for protection of confidential, sensitive, private, and personally identifiable information within enterprise data sources.

 

iEDPS supports 180+ algorithms. Intelligently provides Data Protection rules such as Data Masking, Anonymization or Pseudonymization for Databases, Log Files, and unstructured data sources such as text files and images. For highly complex needs, customized Obfuscation logic can also be plugged into the customer's DevOps pipeline.

 

https://www.youtube.com/watch?v=aidUrPnmoK4

 

Author:

Javed Hussain

Senior Systems Engineer, INFCAT

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.