Main

October 30, 2016

Pragmatic Data Quality Approach for a Data Lake

On 26th Oct 2016, we have presented our thought paper at the PPDM conference hosted in Calgary Telus Spark science center(Calgary Data Management Symposium, Tradeshow & AGM)

http://dl.ppdm.org/dl/1830

Abstract:

With the increase in amount of data produced from sensors, devices, interactions and transactions,
ensuring ongoing data quality is a significant task and concern for most E&P companies. As a result, most of the systems that are sources of data have deferred the task of data clean-up and quality improvement to the point of usage. Within the Big Data world, the concept of Data Lake which allows ingesting all type of data from source systems without worrying about the type or quality of data, further complicates the aspect of data quality as the data structure and usage is left to the consumer. Without a consistent governance framework and set of common rules for data quality, Data Lake may quickly end up into a Data Swamp. This paper examines the important aspects of data quality within Upstream Big Data context, and proposes a balanced approach for data quality assurance across data ingestion and data usage, to improve data confidence and readiness for downstream analytical efforts.
 

The key points/messages that we presented were,

1. Data quality is NOT about transforming or cleansing the data to fit into the perspectives...instead  it's about putting right perspective to the data....


2. Data by itself is not Good or Bad it's just data, pure in its most granular form


3. Quality is determined by the perspective through which we look at the same data


4. Architectural approach to abstract data from the perspectives or standards and build a layer of semantics to view the same data from different point of views. We don't need to populate data into models (PPDM, PODs etc.) instead we put models on top of the existing data promoting the paradigm of "ME and WE" where each consumer of the data has their view point of the same data. The concept of the WELL can be viewed in reference to Completion, Production, Exploration etc. without duplicating the data in the data lake.


5. Deliver quick value to the business and build their trust on the data in the data lake scenario


Please refer to the below link for the details

http://dl.ppdm.org/dl/1830

July 13, 2016

Industrial Internet of Things (IIoT) - Conceptual Architecture

 

The popularity of Internet of Things (IoT) is growing rapidly. More and more devices (things) are getting connected to the internet every day. The value potential through these connected devices is enormous. We have witnessed just a fraction of its potential yet. Many startups are in process of building data driven value products, solutions or services that can disrupt the traditional operational procedures. Major cloud vendors have also ventured into it, providing IoT as a key offering in their product stack.

Industrial IoT extends the general concept of IoT to an industrial scale. Every industry has their own set of devices, home grown or proprietary applications with limited interfaces and for some even network bandwidth is of a major concern. Considering the challenges and limitations, varying from industry to industry, there is no single solution that fits all. Every industry is unique in itself with varied set of use cases and require custom tailoring.

This article will talk about the conceptual architecture for an Industrial Internet of Things (IIoT), agnostic of technology or solution.

Below are the key components of any typical IIoT landscape


IIoT-ConceptualArchitecture-2.jpg

a) Industrial Control Systems (ICS)

These provide first hand view of events across industrial systems to the field staff to manage the industrial operations. They are generally deployed at industrial sites and includes Distributed Control Systems (DCS), Programmable Logic Controllers (PLCs), Supervisory Control and Data Acquisition (SCADA) systems and other industry specific control systems.

b) Devices

These are industry specific components that interfaces with digital or analog systems and expose data to the outside digital world. They provide machine to machine, human to machine and vice versa capability for ICS to exchange information (real-time or near real-time) enabling other components of the IIoT landscape. It includes sensors, interpreters, translators, event generators, loggers etc.

They interface with the ICS, Transient Data Stores, Channels, and Processors

c) Transient Store

This is a temporary optional data store that is connected to a device or an ICS. Its primary purpose is to ensure data reliability during outages and system failures including networks.  It includes attached storage, flash, discs etc.

They generally come as an attached or shared storage to  the devices .

d) Local Processors

These are low latency data processing systems located near or at the industrial sites. They provide fast processing of the small data. It includes data filters, rule based engines, event managers, data processors, algorithms, routers, signal detectors etc.

They generally feeds data into the remote applications deployed at the industrial sites. At times these are integrated with the devices itself for data processing. 

e) Applications (Local, Remote, Visualization)

These are deployed on site or offshore to meet business specific needs. They provide insights/views of the field operations in real time (for the operators), real time and historical (for business users and other IT) staff enabling them to make effective and calculated decisions.  It includes web based applications, tools to manipulate the data, manage devices, interact with other systems, alerts, notifications, visualizations, dashboards etc.

f) Channels

These are the mediums for data exchange between devices and outside world. It includes satellite communication, routers, network protocols (Web based or TCP)   etc.

g) Gateways

These provide communications across multiple networks and protocols enabling data interchange between distributed IIoT components. It includes protocol translators, intelligent signal routers etc.

h) Collectors

These are data gatherers that collect and aggregate data from gateways leveraging standard protocols. It can be custom built or off-the-self products that vary from industry to industry. For example, OPC data, event stream management systems, application adapters, brokers etc.

i) Processors

These are the core of any IIoT solution. Their function is primarily to cater to specific business needs. It includes stream processors, complex event processing, signal detection, scoring analytical models, data transformers, advance analytical tools, executers for machine training algorithms, ingestion pipelines etc.

j) Permanent Data Store and Application Data Store

These are the long term data storage systems generally linked to an IIoT solution. They act as a historians for the device data along with data from other sources. They feed data into the processors for advanced analytics and model building. It includes massively parallel processing (MPPs) data stores, on-cloud/on-prem data repositories, data lakes providing high performance and seamless data access to both business and IT. For example historians, RDBMS, open source data stores etc.

k) Models

There are two type of models that are widely used in the IIoT solutions i.e. Data Models and Analytical Models. The data models defines a structure to the data while the analytical models are custom built for catering to industry specific use cases. Models play an important role in any IIoT solution. They provide a perspective to the data. Models are generally built by leveraging the data in the permanent data stores, human experience, and industry standards. The analytical models are trained leveraging historical data sets or through machine based training process. Some examples of the analytical models are clustering, regression, mathematical, statistical etc. Some examples of data models are Information models, semantic models, Entity relationships mapping, JSON, XML/XSD etc.

The models are fed back into the data stores, processors, applications, and gateways

l) Security

Security is the most important aspect of any IIoT application. It runs through entire pipeline from source to the end consumption. It is very critical for small, medium and large data driven digital enterprises dealing with their data in IIoT world. It includes data encryption, user access, authentication, authorization, user management, network, firewalls, redaction, and masking etc. 

m) Computing Environments

These vary from industry to industry depending upon their business landscape and nature of the business (Retail, Health Care, Manufacturing, Oil and Gas, Utilities etc.)

  • Fog Computing - Bringing analytics near to the devices/source

  • Cloud Computing - Scaling analytics globally across the enterprise

  • On-Prem Computing - Crunching data in existing high performance computing centers

  • Hybrid Computing - Mix of on-cloud, on-prem and fog computing optimizing operations tailored for specific industrial business needs   




 


Continue reading "Industrial Internet of Things (IIoT) - Conceptual Architecture" »

April 11, 2016

How to make a 'Data Lake'

Data Lake has become a buzz word these days and we see enterprises actively investing to have their own Data Lake.

As part of Digital Agenda for most of the enterprises, Data Lake is one of the most prominent focus areas. Investments are happening in terms of Data Acquisition, Storage (Cloud or On-Prem) and Analytics. The success rate for most of the enterprises is dismal. The reason is not the capability or technology, instead the right direction and focus on the Value.

The hype to have all the data at one place and think of its usage later has created more Data Swamps than a valuable Data Lake.

My article in the Digital Energy Journal (Issue 60- AprMay2016) is a first step to give some structure to the concept of the Data Lake.

Below is the image taken from the article with the permission from the Editor.

DataLake.png

November 13, 2015

Analytics Funnel

  Mining "VALUE" from data is an art of science commonly referred as Data Science. The value lies in the data hidden deep within the fabric of the enterprise. Extracting the value requires skills, tools and techniques. If these are combined with the right methodology governed by principles and standards the process becomes simpler. In one of my published articles, I have tried to depict this methodology in the form of an Analytics Funnel.

http://www.digitalenergyjournal.com/n/fbdf03f6.aspx

October 27, 2014

OPC UA and High Speed Data streaming- Enabling Big Data Analytics in Oil and Gas Industry

Role of OPC UA in Oil and Gas Business

OPC primarily a Microsoft based solution for interfacing manufacturing and production systems. It was an abbreviation for "Object Linking and Embedding" (OLE) for Process Control (OPC). With time this concept has evolved into Open Platform Communication (OPC). Today it's more commonly referred as Open Productivity and Connectivity (OPC). The classic Microsoft version of OPC relies on COM and DCOM objects for interfacing the machine data.  The COM and DCOM has their own limitations mostly related to platform dependence, security, and data portability across firewalls and other non-Microsoft based applications.

OPC Foundation has come up with an architecture framework and standards to overcome the limitations of classic flavor of OPC. This new standard or framework is referred as OPC Unified Architecture (OPC UA). It provides standards for data extraction, security, and portability independent of any proprietary technology or platform. APIs are available in Java, .NET, and Ansi C. European energy companies are one of the early adopters of these standards and funding vendors to develop products around these standards. Vendors are already in market offering their product stacks around OPC data streaming, web services, visualization tools, and Real time data analytics. It is not be going to replace the existing OPC Classic implementations in a short term but can act as a mechanism of exposing this data to the enterprise in a much more effective fashion.

OPC Foundation certifies the products developed by these vendors and exposes APIs to manipulate the data. It provides a more standard mechanism of representing data types ranging from simple to more complex data structures. Models can be created to cater to different needs of the business while data is in motion (Real Time Data Analytical Models) or data at rest (Staged Data Analytical Models). Business rules can easily be configured to process the information in more time and cost effective manner. The various streaming data tools enable simultaneous streaming of information to multiple applications and data stores by use of high speed data transfer protocols. OPC UA provide a robust security model for data transfers. It also enables custom application development by leveraging the APIs. It will help the enterprises to get most of the value out the data.

Most of the Oil and Gas Exploration & Production companies relies on proprietary products that cover the deficiencies of OPC classic by creating their own wrapper to expose this data to the enterprises. Hence creating a strong dependency on these product vendors and their line of products to cater to the different business needs. It leads to a high licensing, and Infrastructure costs for upstream businesses. Due to proprietary nature of the data; data extraction, transformation and Integration add up to the cost for these enterprises. Not only there is a cost impact but also an impact to the business operations. By the time these enterprises gets a chance to even look into the data, they have already lost most of the value that this data has to offer. The operational real time risks that could have been avoided and can be converted into opportunities, get materialized.

From the performance perspective OPC DA is good for simple data types while OPC UA is designed for Complex data types which is more relevant for the upstream enterprises. The address space concept in OPC UA makes if more lucrative to the enterprise data management systems. OPC UA currently supports secure and firewall friendly high speed Binary TCP data transport and the web based protocols. With openness of these standards we can use custom protocols if we need to achieve higher data transfer speeds. There are various other protocols in market like FASP©, proprietary protocol developed by Aspera®, now an IBM® Company. FASP Byte Streaming APIs can eliminate the limitations of TCP in relation to data packet tracking and packet loss. It is independent of geographical distances and can transmit data at blazing speeds.

The Upstream, Midstream, and Downstream operations relies heavily on the PLCs, DCSs, PACs, data recorders, and control systems. These OPC enabled devices produces data every second. The data is managed using popular proprietary OPC Classic servers. Using OPC UA the data can be exposed and ported to the enterprises in a much cost effective and timely fashion.

The OPC UA has opened the doors for the enterprises to have a real time view of their operations and devise new business process models leveraging the benefits of Big Data Analytics, Real time predictive Analytics, and much more.

Benefits to Oil and Gas Operations:

Enterprises can benefit from the new open standards in many ways saving on cost and efficient operations ranging for Well Life Cycle Management, Distribution, Refining and Trading.

Areas of focus:

Action and Response Management

OPC UA provides context and content in real time to the enterprise. Alerts and Events generated by various SCADA systems can be made accessible to the system operator and enterprise at the same time. System operators need not rely just on their instincts and experience but has support from entire enterprise. Notifications can be sent to multiple stakeholders and efficient response strategies can be implemented for each event. It also enables the analyst to visually interpret the data on the web and mobile devices in order to respond to the incidents in real time. Data movement need not rely on proprietary systems to move it across multiple network layers. Custom visualization models give a different perspective of the data flowing out to the enterprise.

Decision Management

Making right decisions at right time can save millions for the enterprises. Decisions are based on the insights generated by the data across the enterprise. Most of this is the OPC data generated by the devices operating at remote locations. Faster we analyze data the better is the value we get from the insights. For example, exploration sensor data can guide upstream to decide on whether to proceed with drilling operation at a particular site, understand the geology of the site and procure right equipment's to execute drilling operations, deciding on the well trajectories to maximize the production, optimize the drilling parameters for safe and efficient drilling, optimize the refining operations based on hydrocarbon analysis of the oil, determining shortest route for transporting oil and gas, help in scheduling oil and gas operations, better decisions for executing large oil and gas trades.

Risk Management

Oil and gas industry is highly prone to risks, whether it's related to deep water drilling operations or transportation of oil and gas across the challenging terrains. A small incident can lead to loss of billions of dollars and on other hand it can open doors to tremendous opportunities. It's about understanding the risks, its consequences and leveraging right strategy to handle the risk. Most of the assets and equipment are OPC enabled and generate tons of data every second. The data if tapped at right time the organizations not only deal with any risk with confidence but also can exploit the opportunities. Analytical models can easily crunch the data at is most granular form leveraging OPC UA and provide ammunition to the enterprise to optimize their operations.

Health and Safety

The OPC UA data can be streamed directly from the drilling assets to the enterprise with an ability to perform in-flight analytics. The data can feed into analytical models designed to predict outcomes and prescribe actions to ensure safety of the operational landscape. Data is the new currency for the modern world and can provide insights to improve health and safety aspect of the oil and gas enterprise and meet the legal and regulatory needs of the region.

 

Future of Oil and Gas Enterprises

With the latest technological advancements, right investments, and capacity to accept change, it's not far when our oil and gas enterprises will step into the new era of intelligent field operations. As quoted by OPC Foundation that "There will be a need of only two living beings on the oil field, a Dog and a Human, where Human ensures dog gets his food on time and Dog ensures Human do not interferes with the field operations."

March 4, 2014

The Rising Bubble Theory of Big Data Analytics

The big data analytics has gained much importance in recent times. The concept of analyzing large data sets is not new. Astronomers in olden days used large observational data to predict the planetary movements. Even our forefathers used years of their experience for devising better ways of doing things. If we look through our history, evolution of modern medicine, advancement in space research, industrial revolution, and financial markets; data has played a key role. The only difference as compared to recent times is the speed by which the data got processed, stored, and analyzed.

With the availability of high computing and cheaper data storage resources, the time of processing the information has gone down drastically. What took years of experience and multitudes of human effort, now machines can do it in split of a second. Super computers are breaking the barriers of computing power day after day.  Classic example is of weather forecasting. Statistical modelling of data and using the computational power of modern machines, today we can predict weather with an hourly accuracy.

The concept of big data analytics has also spread in financial markets to predict the stock prices based on thousands of parameters. Financial models that can predict the economies of countries. We can find examples of big data analytics in any field of modern civilization. Whether its medicine, astronomy, finance, retail, robotics or any other science known to man, data has played a major role. It's not only the time aspect but the granularity of data that determines the richness of information it brings.

The Rising Bubble Theory of Big Data Analytics is a step towards understanding the data based on its movement through various layers of an enterprise. It is based on the analogy to the bubble generated at the base of an ocean and the journey it makes to reach the surface coalescing with other bubbles, disintegrating into multiple bubbles, or getting blocked by various obstructions in the turbulent waters. The data can take multiple paths based on varied applications in an enterprise. The granularity of data changes as it moves through the various layers of applications. The objective is to tap the data in its most granular form for minimizing the time for its analysis. The data undergoes losses due to filtering, standardization and Transformation process as it percolates through the different application layers. The time aspect refers to the transport mechanism or channels used to port data from its source to its destination. When we combine the analysis of data granularity and time aspects of it movement we can understand the value that it brings.

Data Value (dv)                  Granularity (g) /Time (t)

Data granularity can be associated to data depth linked to its data sources. Granularity of the data increases as we move closer to the data sources. At times due to complex nature of the proprietary data producers, it becomes difficult to analyze the data. The data need to be transformed into a more standard format before it can be interpreted into a meaningful information. Tapping this data as early in its journey can add great value for the business.

Data can move both horizontally or vertically. The horizontal movement involves data replication while vertical movement involves aggregation and further data synthesis.

 

Real-time vs. Non-Real-time Data Analytics and its relevance to Oil and Gas Industry

With the recent technological advancements, cheaper data storage options, higher processing power of modern machines, and availability of wide range of toolsets, the data analytics has gained much focus in Energy domain. Enterprises have started looking into newer ways to extract maximum value out of the massive amount of data they generate in their back yards. Unlike to the other domains (Retail, Finance, Healthcare), Energy companies are still struggling to unleash the full potential of Data Analytics. The reasons could be many but most common are:

·         High Capital Cost with low margins, limiting their investments

·         Dependency on legacy proprietary systems with limited or restricted access to the raw data in readable format

·         Limited network bandwidth at the exploration and production sites for data crunching and effective transmission

With advent of new standards like OPC UA, WITSML, PRODML, RESQML, evolution of network protocols, and powerful visualization tools, the barriers to Exploration and Production data analytics are breaking down. Oil and Gas companies have already started looking to reap the benefits of their massive data lying dormant in their data stores. Massive amounts of data is getting created every second. OPC data related to assets, remote devices, sensors; well core & seismic data, drill logs, production data etc. are some of common data categories in Exploration & Production (E&P) domain. The new data standards, readable formats (XML) has enabled these enterprises to interpret and transform this data into a more meaningful information in most cost effective manner. They only need to tap into this vast repository of data (real time or staged) by plugging in some of the leading data analytic tools available in the market. The data analytics tools has enabled these enterprises to define and implement new data models to cater to the needs of the business by customizing information for different stakeholders (Geoscientists, Geologists, System Operators, Trading departments, etc.).

Broadly, the Exploration and Production (E&P) data analytics can be classified into two categories:

1.       Real Time Data analytics

2.       Staged Data Analytics

 

RealtimeVsStagedDataAnalytics.png

Need of Real Time Data Analytics

Real time analytical solutions cater to the mission critical business needs like predicting the behavior of a device under specific set of conditions (Real Time Predictive Analytics) and determining the best suitable action strategy. It can help in detecting the thresholds levels of temperature and pressure for generators, compressors, or other devices and mitigating the impacts of fault conditions. Alerting based custom solutions can be built on top of the real time data analytical models. Most critical monitoring is done using proprietary tools like SCADA systems or other proprietary tools at onsite locations. It can become very challenging to provide large computing capacity, and skilled human resources at these remote and hazardous locations. Network bandwidth is also a limiting factor for transporting massive amounts of data to the enterprise data centers. Most of the information is limited to onsite system operators with limited toolsets. Enterprises gets a much delayed view of this data, creating too much dependency on system operators to manage the systems. Current approach to tackle the problems has become more reactive than proactive.

Real Time Data Analytics

Exploration and Production data streams can be tapped and mapped to the real time analytical models for in flight data analytics. These models can help the operators to formulate response strategies to mitigate the impact of fault conditions in a more effective way. System operators can focus more on their job rather than worrying about the logistics. They can have wider access to the enterprise knowledge base.

The data is streamed in real time to the enterprise data centers where live monitoring can be performed using more advanced computing techniques. Multiple data streams can be plugged together and analyzed in parallel. Data modelling techniques enable the enterprises to design cost effective data integration solutions. The advantages of real time analytics are huge. Implementations of fussy logic and neural networks, real time predictive analytics, and applications of advanced statistical methods are few to mention. It has opened the doors to limitless benefits for E&P organizations.

Staged Data Analytics

The data streamed from remote locations can be stored into high performance databases for advance staged data analytics. Complex statistical models and data analysis tools can work their magic on the staged data.  Staged data analytics is done on historical data sets to identify data patterns and design more effective business solutions. It also helps enterprises to improve the performance of the systems, identify gaps, optimize existing systems, and identify the need for new processes and lot more. Models can be created to simultaneously analyze massive amount of data from other data sources (related or unrelated) by use of leading industry analytical tools. Generally the E&P companies use these tools for reporting purposes to cater to the varied need of stakeholders across the enterprise. The full potential of staged data analytics is still to be explored in the Energy domain. It can bring benefits ranging from business process optimization, identifying process bottlenecks, more effective and safer operating conditions, forecasting outcomes using simulation techniques and so on.  It can create a totally new perspective of a business scenario.

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter


Categories