Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« August 2020 | Main | October 2020 »

September 29, 2020

Digital transformation Strategy : Part 1 - find your customers voice

Very recently, I came across  a comment of Bill Patterson  from Salesforce in  Gartner report 2020 - Salesforces  is committed to constant innovation ; Salesforce is helping companies quickly adapt to new challenges ...and provide the best experiences for their customers , employees and communities .


Today, I take this opportunity to write as part of the Digital transformation strategy how to find your  customers voice ? 

For a long time customer engagement if its BFSI , Automobile , any other domain included calling up customer service , sending email and (fax long back ) , etc. to get know status of their queries , complaints and issues . That is is getting changed ..... As the voice for digital transformation grows across domain globally - customer portal is becoming a integrated part of the digital transformation and is one of the key pillar for success . Salesforce communities enables the customers  in various ways and in this article we will focus on the same.


Customers can be enabled using below functional capabilities from Salesforce communities :


  • Self service capability
  • Security
  • Mobile enablement - ease of collaboration from any device
  • Creating a branded community for the brand
  • Customers should be able to see data coming from multiple systems at one place for example status of a product to be delivered
  • Expert groups to guide customers
  • Enabling the customer service team with knowledge articles
  • Case management
  • Community engagement management
  • And more ...


Salesforce enables digital transformation from technology perspective by bringing all the business functionalities to enable the customers using salesforce communities which are scalable with a lightning UI to make a intuitive interface .


I would like to take a classic  example of a laptop manufacturer who wants to engage with his prospective customer  . The digital transformation for such case can also involve enabling customers to salesforce customer portal to track the entire journey  in terms of status from placing of the order till the customer receives the good . Additionally post receipt of the laptop , the customer can ask queries \ reach out to customer service \ experts using the self service capability . We will keep use of Salesforce commerce cloud for placing the order for separate discussion in my forthcoming articles . Salesforce commerce cloud enables the customers to have a seamless experience for the shoppers \ prospective customers .


The customer can track the conversation on the go using a mobile intuitive interface and get a feel of the branded community which will be more engaging and appealing .


At the back end, the laptop manufacturer will be fulfilling the information by bringing information from multiple application and offering a seamless one view to the consumer .


We can engage better with the consumers  by creating expert groups and a consumer can qualify to be part of the group if he wishes to . There will be various levels of such experts and they will be given certain score based on their contribution and they can avail special offers during festival season in order to attract and retain them as part of brand management


We will also have customer service team enabled using salesforce service cloud who will answer queries using knowledge article , this will reduce the average time required to handle the query .


Also customers can be enabled to create cases if their queries go answered or they want to do so . We can also plan for escalation management of the issue based on certain criteria.

Thus Salesforce brings lot of process automation and transformation in the way the end customer is engaged . 

Sky is the limit as salesforce communities has matured over the years and the offerings have improved . Pls stay tuned to my next article for continuation ...till than - thanks everyone .



September 25, 2020

Why Unstructured Data Discovery using AI and NLP?

Why UDD using AI and NLP?

Well whenever there is an 'un' before anything, mostly it signifies uncertainty and same is the case with unstructured data. Following is a simple definition of unstructured data -


"Unstructured data is the data which doesn't have a uniform structure or format and can't modelled in some pre-defined data-model easily."


As a matter of fact, in 1998, Merrill Lynch said, "Unstructured data present in vast majority of various organizations and some people estimates as much as 80% of total data present."


That looks huge amount of data considering the fact how much data is generated and exchanged in any specific organization. Now as multiple data privacy laws are in place. It genuinely is a tedious task to be compliant according to those norms. How to decide if some data is a PII (Personally Identifiable Information) and need to be protected from unauthorized use this is a concern with all the data exists in an organization. The solution to overcome this issue is to discover such data and protect it as per the governing laws for data privacy. But with unstructured data it is even more time and resource consuming process. Let us see what the difference between structured and Unstructured data discovery is.



Structured Data

Data Processing

Some optimal approaches available to efficiently read/write data


Indexing mechanism is available to make searches lightning fast.

Categorical data

Mostly data is divided into multiple column and categories.

Once category is found to have sensitive data. Whole data is marked as Sensitive data


The data is well structured and not much context is involved. Hence Pattern based matching can be used efficiently.


Unstructured Data

Data Processing

Most data are to be stored in files and a file processing is required for read/write operations


In most of cases a full text scan is needed to search.

Categorical data

There is no structure of the data, so a full text processing is needed to recognize sensitive data.


The data is highly contextual in nature and requires an intelligence to discover such data.


As we are clear that there are a lot of challenges in doing sensitive data analysis on Unstructured data in comparison to structured data despite it is very crucial to do it to be compliant to the governing privacy laws.

This process of data discovery using any regular expression pattern require human efforts to come up with the optimal pattern to find out that data. Whilst, there will always a possibility that it is error prone and will not suffice in recognizing data with good accuracy. In addition to that it requires a lot of human effort and resources to analyze the data. A few good examples where regex pattern-based analysis will not work-

路         Name - Sherlock Holms

路         Address - 221B, Baker street, London

路         Contextual data - Sherlock Holms went to Apple store to buy an apple which was nearby his house - 221B, Baker street, London. (Looks Funny right 馃槉?)

In all above examples there is not a single regex pattern which will find this data as sensitive. The last one though is more complex as it involves finding the one apple as Organization and other one just as fruit.

I think that is pretty much sufficient to answer why we cannot use traditional approaches to discover all kind of sensitive data, mainly stored unstructured. Now the next question will be what then? The answer is simple an approach that can recognize these patterns automatically without human effort and even in contextual data and that is what Artificial Intelligence1 is for.


What we can do?

Now as we are clear why we need to do a sensitive data discovery in unstructured data using AI. As we are dealing with human language data. So, we can use Natural Language Processing (NLP) for this as this is what NLP is designed for - processing the text data.

NLP provide a vast variety of text processing capabilities like tokenization, lemmatization, Syntactical Analysis, Semantic Analysis, Named Entity Recognition (NER), Parts of Speech (POS) tagging and many more. Particularly, NER is very useful for our requirement as we want to recognize some sensitive entities and for getting the contextual analysis from data POS tagging is very useful.

There are multiple libraries available now for NLP. Some of those libraries are -

路         NLTK (Natural Language Tool kit) - A scientific toolkit available for dealing with human language data

路         Stanford NLP - A Java based natural language analysis package.

路         SpaCy - Industrial-Strength Natural Language Processing etc.

SpaCy among these provides a very clean API and very efficient, ready to use NLP package for multiple NLP tasks. It also provides a lot of general Named Entities with very good speed to accuracy ratio.

Now as we know what to do, the next question will be how? Let us dig deeper into this NLP journey to find how this sensitive data can be recognized.


How to find Sensitive data?

As we have seen in last section, we can use NLP to do text analysis and to recognize sensitive data among Unstructured text data. Now main concern with Machine Learning models (that is what NLP NER models are) is, they cannot give 100% accuracy and they cannot work on all kind of data. So, the good approach will be to use a combination of NLP, RegEx Pattern and Contextual Pattern Matching to get better accuracy and results. I will explain this through some examples -

Let us consider following text. Highlighted fields are sensitive as per our requirement -

"Nash Smith (Person) went to Apple (Organization) store to buy an iPhone (Product). On the way back home, he also bought an apple (Product). He used his credit card 3782-8224-6310-0055 (Credit Card) for purchasing apples."

1.      NLP NER -

text_data = 'Nash Smith went to Apple store to buy an iPhone. '\
            'On the way back home, he also bought an apple.'\
            'He used his credit card 3782-8224-6310-0055 for purchasing '\


nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)
render_data = doc.to_json()
render(render_data, style = 'ent')

Code - GitHub Gist


Nash Smith PERSON went to Apple ORG store to buy an iPhone. On the way back home, he also bought an apple. He used his credit card 3782-8224-6310-0055 DATE for purchasing apples.

We can see using NLP NER we are able to recognize Name and Organization correctly that is nearly impossible for RegEx to find. But there are still few tokens which are not recognized like Credit Card is recognized as DATE but incorrect and iPhone and apple as product. A credit card is a fixed pattern and can be recognized using Regex Patterns.


2.      RegEx Matching -

Let us add a few more lines of code to above for recognizing Credit Card.


text_data = 'Nash Smith went to Apple store to buy an iPhone. '\
            'On the way back home, he also bought an apple.'\
            'He used his credit card 3782-8224-6310-0055 for purchasing '\


nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)]

regex = r'\d{4}[\-\,]?\d{4}[\-\,]?\d{4}[\-\,]?\d{4}'

credit_cards = re.findall(regex, text_data)


Code - GitHub Gist


Nash Smith PERSON went to Apple ORG store to buy an iPhone. On the way back home, he also bought an apple. He used his credit card 3782-8224-6310-0055 CREDIT CARD for purchasing apples.

We can see using NLP NER + RegEx we are able to recognize Name and Organization and Credit Card correctly that is nearly impossible for either of them to find individually. But there is still a token that is not recognized like iPhone and apple as product.  Here we can see a product like this will not have a fixed RegEx pattern. Now here comes the context-based matching. We can see, whenever there is a buy verb followed by a determinant (DET) and following a NOUN. It is a product.


3.      Context Based Matching -

Let us add a few more lines of code to above for recognizing products as well.

text_data = 'Nash Smith went to Apple store to buy an iPhone. '\
            'On the way back home, he also bought an apple.'\
            'He used his credit card 3782-8224-6310-0055 for purchasing '\

nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)]
regex = r'\d{4}[\-\,]?\d{4}[\-\,]?\d{4}[\-\,]?\d{4}'
credit_cards = re.findall(regex, text_data)
matcher = Matcher(nlp.vocab)
matcher.add('PRODUCT',None, [{'LEMMA':'buy'},{'POS':'DET'},{'POS':'PROPN'}],
[{'POS': 'VERB', 'LEMMA': 'buy'},{'POS': 'DET', 'OP': '?'},{'POS': 'NOUN', 'OP': '+'}])

Code - GitHub Gist


Nash Smith PERSON went to Apple ORG store to buy an iPhone PRODUCT. On the way back home, he also bought an apple PRODUCT. He used his credit card 3782-8224-6310-0055 CREDIT CARD for purchasing apples.

Now we can see how merging all three approaches can achieve the discovery task that is impossible to achieve from individual approach. Now next question is what is there after discovery. To protect individual's data best approach is to anonymize the data such that it becomes unidentifiable.



Data privacy is one of the main concerns in today's information age. Every organization has a lot of data to be handled and most of that data is Unstructured. To get the understanding of sensitive information from that data is tedious and resource consuming task that can be automated in multiple ways. But to get the best results those multiple ways needs to be combined. We can achieve very good accuracy with lesser resources in discovering data that can even surpass human intelligence. 


Tool which offers unstructured data discovery:

Infosys Enterprise Data Privacy Suite is a patented, enterprise class data privacy and security product which enables organizations to protect and de-risk sensitive data. iEDPS has helped more than 40+ enterprises to reduce the cost of privacy and improve data security. Further it is a one stop shop for protection of confidential, sensitive, private, and personally identifiable information within enterprise data sources.


iEDPS supports 180+ algorithms. Intelligently provides Data Protection rules such as Data Masking, Anonymization or Pseudonymization for Databases, Log Files, and unstructured data sources such as text files and images. For highly complex needs, customized Obfuscation logic can also be plugged into the customer's DevOps pipeline.



Javed Hussain

Senior Systems Engineer, INFCAT

September 24, 2020

Graph DB - A Simpler Approach to Data

Data is growing faster and faster these days and along with it the complexity in their relationships. Having said that, one can say that, since each entity is related to many other entities in some way, it becomes cumbersome to manage such data. So far, the current market dominant Database vendor promised such relation storage but has limited capability in terms of relational multitude and do not scale much. Graph DB answers such concern and plays a vital role in managing dynamic relationships among entities whereas traditional RDBMS do not perform well. Is the todays market open for such change? Well, we cannot deny the fact the use of such technology is gaining popularity more rapidly than it should. E.g. on 3rd April 2016, in a massive leak, the financial files from the dataset of Mossack Fonesca (the fourth biggest offshore law firm) were leaked anonymously to German newspaper S眉ddeutsche Zeitung (SZ). With data dating back to 1970's. These data were leaked and linked to the individual or business entity by using Graph DB. The document exposed network of more than 200, 000 tax havens and involving people from 200 countries. The nature of Graph DB allows to relate this complex relation so easily and quickly. This leak is well known as "Panama Papers". This changes the world's view of Graph DB as a novice player to an emerging technology. The game is only beginning.

What is a Graph DB?
A Graph DB is a database management software which stores data as a representation of graphs, where each entity is denoted by a node and the line connecting them are the edges. What makes it different is that each node represents certain class or real-world entity and their interactions are represented via edges. The additional information can be stored with nodes as a key-value pair. Also, there is no meta model defined beforehand for storing any data and are very flexible, unlike traditional RDBMS. We will discuss this in detail in next section.
Below are the properties of a Graph DB:
1. Nodes or Vertices: A entity or class of entity representation which can be schema less or schema enforced. Additional meta information are also tagged with this which depends on Graph DB vendor.
2. Edges or Relationships: A directed edge connecting two nodes or vertices to represent some relationship among entities. Although they are stored in a specific direction, relationships can always be navigated efficiently in either direction. Some vendor provides extra information to be stored on edges e.g. Orient DB.
3. Property: A key-value pair of all attributes of a Node or Vertex which are stored with Nodes or Vertex. Constraint can be applied on property like must contains or not null based on the chosen vendor Graph DB.
4. Index: A by default turn off feature provided by almost all Graph DB which enables superfast searching of nodes with given key-value pair when enabled.

Why Graph DB?
In previous section we have seen, what is Graph DB? Now we will take closer look on why it is important in today's business. So far there is a dominance of RDBMS (Relational DBMS) over a long period to these days. What makes them so dominant for long is their capability in terms of OLAP, OLTP and ACID. Also, they were redundant in meta information and are very rigid in structure i.e. once defined cannot be alter easily. We will compare each of the features in below points.
Efficiency: All Graph DB are very efficient while querying because they are built in such a way it will directly start picking up the nodes and relationship directly from DB and start building the result set, on the other hand RDBMS always required certain joins and filter predicates which means that it will pre-compute everything before it start to give any result. This computation can be some time very time consuming especially in scenario where relationships are very dynamic and multiplicity in nature.
Multiplicity/Flexibility: As we already learn that Graph DB stores every entity as nodes and relationship as edges. So, it provides very flexible way of representing relationship and query on that at same time too. On the other hand, RDBMS, relations must be stored in a specific column and cannot be dynamic. E.g. A person's friend relation can be stored in column Friends as a foreign key relation but lacks if the same person also follows him, then there is another column is required to store this detail, again at later point if the same person endorsed some action of his/her friend then again new column where this details can be stored is required. Making RDBMS more verbose in terms of storage capability.
Meta model/Metadata/Schema: RDBMS are always and always rigid in structure i.e. once defined editing will be tedious. Whereas Graph DB by nature are schema-less and can be extended to any schema at any given point of time. Other nodes can hold some additional property whilst affecting none of the existing nodes by any means. Also, this can be schema oriented based on certain Graph DB vendor again this schema binding is completely optional.
ACID: Today's almost all the vendors of Graph DB provides ACID capability to Graph DB and OLAP and OLTP with full control over DB admin to configure like in RDBMS.
Relation Based Query: It is one of the most admiring capability of Graph DB. It simply talks about nodes to be queried based on N degree of its relation. E.g. let's say a person "Tom" wants to know its 7th degree mutual friends list i.e. Friends-of-Friends and start collecting names after reaching 7th layer deep. Traditional RDBMS will die on this type of query and results are not real time.

Graph DB in Action

Let's see one simple use case.

Here I will be using Orient DB as a Graph Engine. Orient DB is a multi-model DB engine which provides Document based, key-value based and Graph based processing engine.

Connect via console:
Once your server is running you can start you console with console.bat for windows and for Unix based.
Give connection URL as below:
connect remote:localhost/demodb admin admin
you are now connected to demodb
To create node/vertex and edges
a. Here extends V is mandatory as OrientDB is a multi-model database so we need to be sure we were creating a vertex. V stands for vertex and similarly E stands for edges
a. This is a class of type edge
This is how you see details in form of table:

And here is the graph:
Now here if TOM wants to know all of his mutual friends who are FEDRICK and HARRY he can use the below query 


Above query will fetch below result : 
For more detail explanation, follow the link given in references where all types of query examples are available by OrientDB community.

We have so far covered all the uses and its explanation. I will list all the advantages of using Graph DB.
鈥 Performance
鈥 Flexibility
鈥 Schema-less
鈥 Agility

鈥 No common declarative query language like SQL
鈥 Different vendor has different API
鈥 There isn't any mature community network support unlike RDBMS to get support of all types unless you buy an Enterprise Edition with support subscription.

Final word
With the increase use of Graph DB in market there is a requirement to make available of common query language like SQL. Early in 2017 the idea of standalone graph query language (GQL) was raised by ISO SC32/WG3 members. So, we will be going to have one GQL for all Graphs engine vendor in coming times. Currently Cypher of Neo4j is enabled to use on certain other vendors like SAP HANA Graph and Redis Graph via open cypher project.

We at iEDPS (Enterprise Data Privacy Suite from Infosys), are leveraging Graph DB to give user centric insights on data privacy assessments and also give an end to end view of how sensitive data flows within the organization.


Open cypher project

Written By : Amit Sinha

IoT Data Protection - Privacy by Design

IOT - Internet of Things, connects physical and virtual objects, and can handle a large amount of data captures, transfers, storage. The global IOT market is expected to grow into a trillion-dollar business in the next few years. With having application in various fields, IOT has been used to connect with remote objects, collect data, process it, transfer the captured data, and store it in a database. IOT helps in getting data from places, where we cannot connect physically with a simple internet connection. Consumers can be institutions, individual researchers, or enterprises. With customers range at various levels, we must protect the data collected from the consumers and the data captured by the IOT devices. IOT devices can have various sensors and components, and it is important to protect it from every aspect. 

The smart objects, which help in decreasing human intervention on a task, need smart data protection in order to avoid incidents like Casino data leak or Orvibo data leak. Let's study these incidents in short.

Casino Data Leak:

A major data leak of high paying customer information was uploaded to a public cloud from the casino. The casino had a centrally connected IOT based temperature sensors for the aquarium.  Connecting the temperature sensor connected with the Wi-Fi, served as the entry point in breach. 

Orvibo Data Leak:

Orvibo manages the smart home IOT application in various countries. SmartMate, the database platform used for this had no password, leaving 2 million customers smart devices unprotected with around 2 billion records. 

All the above cases didn't have an efficient data protection plan, which led to the exposure of billions of private information to the public. Thus, it is necessary to plan data privacy and protection from the start.

Privacy by Design:

The concept of designing the framework by proactively including the required privacy-protective measures into the design of information technology systems, networked infrastructure, and business practices from the start helps identify the vulnerable areas.

Privacy by Design advances data protection and privacy from the start. Data privacy has become an important factor in the buying process of many consumers. Consumers today are more conscious of the importance of data protection due to high profile data breaches that have occurred in the past. Companies investing in data privacy may be able to win consumer trust more easily than companies that do not. Organizations prioritizing data protection will gain a significant competitive advantage. Below are mentioned best practices while adapting privacy by design. 

鈥 It is important to think and implement privacy from the early stages of the development process considering the sensitive nature of personal data collected

鈥 Collection and processing of personal data must be strictly limited to the defined purpose and personal data must not be used for any other purpose

鈥 Avoid collecting or processing personal data that is not necessary for fulfilling the purpose of processing and limit the amount of personal data collected

鈥 The application should allow data subjects to delete their personal data whenever they choose to do so

鈥 Implement strong cybersecurity measures that are consistent with industry standards, as its essential for safeguarding the privacy

鈥 Information should be provided clearly on what data processing will be done on the personal data by the application

Infosys offers Enterprise Data Privacy Suite (IEDPS), which can help in avoiding the data breach incidents and abiding with data privacy regulations. A data leak from a hacked IOT device can have adverse effects, so we must protect the data stored in IOT cloud databases. Before protecting the data present, it is necessary to identify the sensitive fields at first. Secondly, encryption to those sensitive fields should be done. Below are some key features of IEDPS

鈥 Data Discovery

鈥 Regular scan of the database and finding the presence of sensitive information helps in understanding what data should be protected.

鈥 Data Protection via Masking

鈥 Around 180+ novel algorithms are present to choose from. Sensitive data identified in the discovery are masked with suitable algorithms. 

鈥 Data Subsetting options are provided too.

鈥 Test Data generation, which generates realistic test data, almost comparable to genuine data for better testing

鈥 Efficient data copying options.

Along with a well-planned privacy design for IOT, IEDPS has various ways to protect the customer data and help organizations abide by the data privacy rules. 

Written By : Rohini & Avin