Big Data: Is the 'Developer testing' enough?
A lot has been said about the What, the Why and the How of Big Data. Considering the technical aspect of Big Data, isn't it enough that these implementations can be production ready with just the developers testing it? As I probe deeper into the testing requirements, it's clear that 'Independent Testers' have a greater role to play in the testing of Big Data implementations. All arguments in favor of 'Independent testing' hold equally true for the Big Data based implementations. In addition to the 'Functional Testing' aspect, the other areas, where 'Independent Testing' can be a true value add are:
· Early Validation of Requirements
· Early Validation of Design
· Preparation of Big Test Data
· Configuration Testing
· Incremental load Testing
In this blog, I will touch upon the listed additional areas and what should be the focus of 'Independent Testing'.
Early Validation of Requirements
In the real world, Big Data Implementations are mostly system integrated with existing 'Enterprise Data Warehouse (EDWH)' systems or 'Business Intelligence' systems and clients want to decipher and see the business value coming out from both the already-being-explored-data sources as well as never-before-explored-data sources.
In the Requirements validation stage, the tester should ensure whether the requirements are mapped to the right data sources and whether all feasible data sources, for the customer's business, have been considered in the Big Data Implementation. If a certain data source has not been considered, then this should be raised as probable defect. It would be resolved as either 'Not a Defect' as the data source in question does not provide any cost effective way for analysis or 'A New Requirement' that should be implemented in the future version of the system implementation.
Early Validation of Design
In the context of Big Data, it is important that the implementation is 'Data based'. It implies that both storage and analytics are done using the right components, based on the nature of the data. For ex: there is no advantage of copying the structured data in EDWH to Hadoop Files System (HDFS) and querying them using HiveQL, while the data would continue to reside in EDWH. Similarly, there is no advantage of analyzing a few MBs of new structured data within HDFS.
In the Design Validation stage, the tester should ensure whether the right external data source is mapped to the right internal data source. Any concerns on this have to be resolved earlier in the development cycle as not selecting the right internal data sources for storage and analytics would defeat the reality of cost-effective Big Data implementation.
Another area where the tester can provide a value-add is to check for the data duplicates between EDWH and HDFS and whether there is any real business benefit in duplicating the data. Any miss in synchronizing the duplicate data between various sources during data maintenance would result in misguiding data analytics and might prove to be detrimental to the customer's business.
Preparation of Big Test Data
Whether it is 1000 files or 100000 files, it does not make any difference to the developers when scripting a Map Reduce Code that analyses the data. But, testing with near-real-volumes of data is very important to check the inherent scalability of the scripts, any inadvertent hardcoding of the paths and data sources in the scripts, handling of the erroneous data by the scripts etc.
To ensure production confidence, the tester should intelligently replicate data files, with some incorrect schema and erroneous data and ensure that the map reduce code has taken care of all possible variations of input data.
Hadoop Eco System is highly configurable. A tester should probe on identifying the configurable parameters in the Hadoop Eco system and determining the default and acceptable customization ranges, for the Big Data implementation that's being tested. Specific tests have to be created on configurable parameters to test the behavior of the system.
Incremental Load Testing
Tester should plan for testing with additional (at least one cluster) clusters added, removed dynamically. This testing should be done along with configuration testing to ensure that the system design has taken care of the node/cluster scalability with appropriate configuration parameters.
While, testing the functional requirements of Big Data Implementation can be achieved to a certain extent by 'Developer Testing', the overall production confidence of the system can be achieved only with focused 'Independent testing' that would cater to both the 'Conformance to Requirements' as well as "Fitness to use".
Click here to read a blog on Testing Big Data vs. Data Warehouse implementations.