Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« IoT Data Protection - Privacy by Design | Main | Why Unstructured Data Discovery using AI and NLP? »

Graph DB - A Simpler Approach to Data

Data is growing faster and faster these days and along with it the complexity in their relationships. Having said that, one can say that, since each entity is related to many other entities in some way, it becomes cumbersome to manage such data. So far, the current market dominant Database vendor promised such relation storage but has limited capability in terms of relational multitude and do not scale much. Graph DB answers such concern and plays a vital role in managing dynamic relationships among entities whereas traditional RDBMS do not perform well. Is the todays market open for such change? Well, we cannot deny the fact the use of such technology is gaining popularity more rapidly than it should. E.g. on 3rd April 2016, in a massive leak, the financial files from the dataset of Mossack Fonesca (the fourth biggest offshore law firm) were leaked anonymously to German newspaper Süddeutsche Zeitung (SZ). With data dating back to 1970's. These data were leaked and linked to the individual or business entity by using Graph DB. The document exposed network of more than 200, 000 tax havens and involving people from 200 countries. The nature of Graph DB allows to relate this complex relation so easily and quickly. This leak is well known as "Panama Papers". This changes the world's view of Graph DB as a novice player to an emerging technology. The game is only beginning.

What is a Graph DB?
A Graph DB is a database management software which stores data as a representation of graphs, where each entity is denoted by a node and the line connecting them are the edges. What makes it different is that each node represents certain class or real-world entity and their interactions are represented via edges. The additional information can be stored with nodes as a key-value pair. Also, there is no meta model defined beforehand for storing any data and are very flexible, unlike traditional RDBMS. We will discuss this in detail in next section.
Below are the properties of a Graph DB:
1. Nodes or Vertices: A entity or class of entity representation which can be schema less or schema enforced. Additional meta information are also tagged with this which depends on Graph DB vendor.
2. Edges or Relationships: A directed edge connecting two nodes or vertices to represent some relationship among entities. Although they are stored in a specific direction, relationships can always be navigated efficiently in either direction. Some vendor provides extra information to be stored on edges e.g. Orient DB.
3. Property: A key-value pair of all attributes of a Node or Vertex which are stored with Nodes or Vertex. Constraint can be applied on property like must contains or not null based on the chosen vendor Graph DB.
4. Index: A by default turn off feature provided by almost all Graph DB which enables superfast searching of nodes with given key-value pair when enabled.

Why Graph DB?
In previous section we have seen, what is Graph DB? Now we will take closer look on why it is important in today's business. So far there is a dominance of RDBMS (Relational DBMS) over a long period to these days. What makes them so dominant for long is their capability in terms of OLAP, OLTP and ACID. Also, they were redundant in meta information and are very rigid in structure i.e. once defined cannot be alter easily. We will compare each of the features in below points.
Efficiency: All Graph DB are very efficient while querying because they are built in such a way it will directly start picking up the nodes and relationship directly from DB and start building the result set, on the other hand RDBMS always required certain joins and filter predicates which means that it will pre-compute everything before it start to give any result. This computation can be some time very time consuming especially in scenario where relationships are very dynamic and multiplicity in nature.
Multiplicity/Flexibility: As we already learn that Graph DB stores every entity as nodes and relationship as edges. So, it provides very flexible way of representing relationship and query on that at same time too. On the other hand, RDBMS, relations must be stored in a specific column and cannot be dynamic. E.g. A person's friend relation can be stored in column Friends as a foreign key relation but lacks if the same person also follows him, then there is another column is required to store this detail, again at later point if the same person endorsed some action of his/her friend then again new column where this details can be stored is required. Making RDBMS more verbose in terms of storage capability.
Meta model/Metadata/Schema: RDBMS are always and always rigid in structure i.e. once defined editing will be tedious. Whereas Graph DB by nature are schema-less and can be extended to any schema at any given point of time. Other nodes can hold some additional property whilst affecting none of the existing nodes by any means. Also, this can be schema oriented based on certain Graph DB vendor again this schema binding is completely optional.
ACID: Today's almost all the vendors of Graph DB provides ACID capability to Graph DB and OLAP and OLTP with full control over DB admin to configure like in RDBMS.
Relation Based Query: It is one of the most admiring capability of Graph DB. It simply talks about nodes to be queried based on N degree of its relation. E.g. let's say a person "Tom" wants to know its 7th degree mutual friends list i.e. Friends-of-Friends and start collecting names after reaching 7th layer deep. Traditional RDBMS will die on this type of query and results are not real time.

Graph DB in Action

Let's see one simple use case.

Here I will be using Orient DB as a Graph Engine. Orient DB is a multi-model DB engine which provides Document based, key-value based and Graph based processing engine.

Connect via console:
Once your server is running you can start you console with console.bat for windows and console.sh for Unix based.
Give connection URL as below:
connect remote:localhost/demodb admin admin
you are now connected to demodb
To create node/vertex and edges
1. CREATE CLASS PERSON EXTENDS V
a. Here extends V is mandatory as OrientDB is a multi-model database so we need to be sure we were creating a vertex. V stands for vertex and similarly E stands for edges
2. INSERT INTO PERSON SET NAME='TOM', OCCUPATION='ENGINEER', NATIVE='AMERICAN'
graphDB1.png
3. CREATE CLASS FRIENDS EXTENDS E
a. This is a class of type edge
4. CREATE EDGE FRIENDS FROM (SELECT FROM PERSON WHERE NAME='TOM') TO (SELECT FROM PERSON WHERE NAME='CHARLES')
graphDB2.png
This is how you see details in form of table:

And here is the graph:
graphDB4.png
Now here if TOM wants to know all of his mutual friends who are FEDRICK and HARRY he can use the below query 

SELECT FROM PERSON WHERE @RID IN (MATCH {CLASS:PERSON, AS:P, WHERE: (NAME='TOM')}-FRIENDS->{CLASS:PERSON, AS :DF}-FRIENDS->{CLASS:PERSON, AS:MF} RETURN MF)

Above query will fetch below result : 
graphDB5.png
For more detail explanation, follow the link given in references where all types of query examples are available by OrientDB community.

Advantages
We have so far covered all the uses and its explanation. I will list all the advantages of using Graph DB.
• Performance
• Flexibility
• Schema-less
• Agility

Disadvantages
• No common declarative query language like SQL
• Different vendor has different API
• There isn't any mature community network support unlike RDBMS to get support of all types unless you buy an Enterprise Edition with support subscription.

Final word
With the increase use of Graph DB in market there is a requirement to make available of common query language like SQL. Early in 2017 the idea of standalone graph query language (GQL) was raised by ISO SC32/WG3 members. So, we will be going to have one GQL for all Graphs engine vendor in coming times. Currently Cypher of Neo4j is enabled to use on certain other vendors like SAP HANA Graph and Redis Graph via open cypher project.

We at iEDPS (Enterprise Data Privacy Suite from Infosys), are leveraging Graph DB to give user centric insights on data privacy assessments and also give an end to end view of how sensitive data flows within the organization.

References:


Open cypher project http://www.opencypher.org/



Written By : Amit Sinha

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.