Enterprise architecture at Infosys works at the intersection of business and technology to deliver tangible business outcomes and value in a timely manner by leveraging architecture and technology innovatively, extensively, and at optimal costs. Stay up-to-date on the latest trends and discussions in architecture, business capabilities, digital, cloud, application programming interfaces (APIs), innovations, and more.

« Is your Enterprise Architecture SMART? | Main | 5G to unleash new wave of disruption - are you ready? »

Data Analytics: Doing it right!

Author: Ramkumar Dargha, AVP and Senior Principal Technology Architect, Enterprise Architecture 

Companies across industries are realizing that there is enormous data within and outside their organization that can be leveraged for high performance and thus, turning data into a strategic asset.  Data analytics is the cornerstone to make this happen. For, Data is of no use unless we convert that data into insights. Such insights are of no use unless those insights help us to solve a business need or a business problem. Simply put, Data analytics help translate raw data into insights. Data analytics make data a strategic asset. So getting data analytics right is key to successfully leveraging data as a strategic asset. So, how to get data analytics right?

 Here are the salient points, I believe, one has to follow to get this most important thing right. The data analytics!

 Always link data to a business decision and the business decision, in turn, to a business KPI. That means, data has to result into an insight. That insight has to result into a business decision and that business decision should positively impact one or more business KPIs. Analyze every step in the business process map and evaluate how data can make that step or a process more simplified, or more efficient, or more intelligent or even made redundant. Think end-to-end business process. The business process can be an operational process, IT process, a marketing process or a sales process. Every step in the business process may have a potential to leverage data.

2.       Do not underestimate Data preprocessing and Exploratory Data Analysis (called EDA in analytics terminology. It is a step we do before applying the algorithms to data. EDA helps us to understand the data better in terms of its distribution, any missing data, any inherent relationships or correlations. This understanding helps us to clean the data so that the data is in right shape to be used as input to subsequent algorithms).   In fact, this is the most important step in the data analytics process. This step can take nearly 70% of the time spent. Wrong data leads to wrong analytics, which leads to wrong prediction and wrong business decisions and degraded KPIs - Garbage in Garbage out! Models and algorithms (however great or sophisticated they are) will not come to your rescue if we give inappropriate data to them. Give them right data in right form and apply the models/algorithms for the right business situation. They become precious. Not standalone on their own.

3.       Data analytics is always a probability game. Do not expect or wait for perfection in data analytics. What does this mean? If we make our decisions based on the insights from data analytics and by following the right process, it just means that we are more probable to succeed. It increases our probability to succeed. Sometimes multi-fold. But it does not mean that insights will exactly predict the future. Data analytics is not meant for that. It makes us more probable to succeed! That's a great value-add actually!

4.       Measurement and Evaluation of models. There are multiple techniques available to evaluate effectiveness of a model. For example, model accuracy, sensitivity, false positive rate, lift charts, gain chart - for classification models. MSE, SSE, R-square - for estimation/regression models. Use them to check accuracy and usefulness of models before employing them in production.

5.       Ensemble. In simple words, Ensemble means leveraging multiple models that are similar, to improve prediction. For example - CART, C5.0, logistic regression, Neural Networks for the same classification problem. The principle of collective intelligence applies. For a classification problem, create a group (of ensemble) of models as mentioned above. Check the classification output results from each of the models. Apply an averaging mechanisms like majority voting or propensity averaging. In almost all cases, ensembles increase the prediction capability of models.

6.       If a model is too good to be true, it actually is. For example, if you happen to come across a classification model which claims 90% accuracy in predicting, it is too good to be true. Either the model has memorized the training data set (model over-fitting) or there could be other issues in training data (like auto-correlation etc.). Such model works well for already known cases, but might lack generality.  Such models are of limited use in predicting results for new data points (which is the real purpose of data analytics). Prefer a classification model which is 75% accurate (rather than 90 % accurate) but which has better generality (ability to predict from unseen or new data points better). Consider this for example. 100 data records out of which only 5 cases of fraud transactions. The first classification model can predict all data non-fraud and still have an accuracy of 95%. But the second model will predict 25 cases as frauds out of 100 records including those 5 fraud cases. The accuracy of the second model is only 80% . But the second model is more useful than the first one!

7.       Do not believe in models which do not make logical intuitive sense. Note that data analytics will give us mathematical proof and a way to represent and codify the inherent relationships, classifications, predictions etc. But even if you come across a model with high accuracy to predict (for the test data set) but the relationship uncovered by the model does not make logical intuitive sense, then it makes sense to discard that model.

8.       Data analytics is mainstream. Nearly 80% of data analytics problems can be solved by following straightforward algorithms like classification, regression, clustering and collaborative filtering (recommendation systems). There are multiple tools, frameworks, technologies available today to solve such problems. The algorithm knowledge is no more a barrier. The nuances and complexities of the algorithms are pretty much part of libraries available in standard languages like Scala, R, Python etc. What we need to know is how to apply the algorithms/models to new data and new scenarios that we come across. Do not get overwhelmed by assuming that data analytics (and thus AI, ML, data science etc.) is only for a selected few who have PHDs. It is ready being adopted mainstream.  However, move onto the next point. (BTW, how is data analytics related to AI, ML, and Data Science etc.? That will be the topic of my next blog)

9.       Self-Service BI. It is useful. However, do not get carried away by this. Sometimes it makes sense to let business users do the BI themselves through self-service BI, for instance - for slicing and dicing or what-if analysis. But predictive analytics is not just traditional BI. Predictive analytics is a discipline in itself.  It needs people who are knowledgeable and experienced in data analytics. Remember, data analytics is easy to get it all wrong. Provide ability to do self-service BI where appropriate. But do not forget that there is lot more value to be gained beyond self-service BI.

10.   Time is of essence. Once you find a model which has good accuracy and makes logical intuitive sense, proceed quickly and put that into production. As I mentioned earlier, no model is perfect. But as long as every new model or a next version of an existing model is better than the status quo, proceed. But iterate to improve as you go along. This brings us to the last point.

11.    Continuous improvement. Every model has a shelf life. Fraudsters are as intelligent, if not more, than data analysts'/data scientists. A fraud prediction model which is good today will not be good tomorrow. The inherent relationships will change. Always keep measuring the accuracy of models in production on an on-going basis. Once we find that the current model is no longer serving the purpose, it is time to go back to the drawing board and re-work the model or create a new one.

 I have identified some of the most important points. There could be more. Let me know what you think.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter