Three Stages of Functional Testing 3Vs of big data
By now, everyone has heard of big data. These two words are heard widely in every IT organization and across different industry verticals. What is needed, however, is a clear understanding of what big data means, and how big data can be implemented in day-to-day businesses. The concept of big data refers to a huge amount of data, petabytes of data and huge mountains of data. With ongoing technology changes, data forms an important input for making meaningful decisions.
When the data is presented in large numbers, it poses a number of challenges in testing. Data includes less structured formats (website links, emails, Twitter responses, pictures/images, written text on various platforms) which make its analysis more difficult.
Three Vs of big data needs to be kept in mind when you are validating any big data application
- Volume: Huge amount of data flows through systems and is to be tested and validated for its quality
- Velocity: This is the speed at which new data is getting generated, generally when the velocity with which data can be analyzed is greater, the bigger is the profit for an organization
- Variety: Big data comprises large data sets - which may be structured, semi-structured or unstructured.
Three Stages of functional Validation of 3 Vs
Many organizations are finding it difficult to define a robust testing strategy and set up an optimal test environment for big data. Big data involves the processing of a huge volume of structured / unstructured data across different nodes, using frameworks like "Map-reduce", and scripting languages like "Hive" and "Pig". Traditional testing approaches on Hadoop are based upon sample data record sets, which is fine for unit testing activities. However, the challenge comes in determining how to validate an entire data set consisting of millions, and even billions of records.
Three stages of functional testing of big data:
Figure 1: Three stages of big data testing
To successfully test big data analytics application - the test strategy should include the following testing considerations.
1. Data Extraction Testing
Data from various external sources like social media, web logs (unstructured) and sourcing systems such as RDBMS (structured) should be validated to ensure that proper data is pulled into the big data store (Ex: Hadoop system). The data should be compared from source (unstructured) to big data store (structured). This can be achieved by following two specific Test approaches described below:
- Comparing source data with data landed onto big data store to ensure they match.
- Validating business rules to transform data (Map-reduce validation) - This is similar to data warehousing testing, wherein a tester verifies that the business rules are applied on the data. However, in this case, there is a slight difference in the test approach as big data store should be tested for volume, variety and velocity.
2. Data Quality Analysis
Data quality analysis is the second test step followed after data extraction testing. This is performed in big data store, once the data is moved from the source systems. The data is measured for:
- Referential integrity checks
- Constraints check
- Metadata analysis
- Statistical analysis
- Data duplication check
- Data correctness / consistency check
As part of the test, the approach to verify data quality, sample tables and small amounts of data are copied to temporary tables and validations need to be performed on the minimal set. These tests are applied on the sample data:
- Deletion of parent record to check if child records are getting deleted, to verify referential integrity checks
- Validation of all foreign and primary key constraints of the tables
- Metadata analysis check to find the metadata variables by checking all the connections between metadata and actual records
- Data duplication check by inserting similar records in the table, when there are unique key constraints
- Data correctness or data integrity checks by insertion of alphabets into a record, where it only accepts numbers.
Reports and visualization testing, forms the end user part of the testing where output is validated against actual business requirements and design. Reports are the basis for many decisions, but are also critical components of the control framework of the organization.
The reports, dashboards and mobile outputs are validated using two approaches:
I. Visualization Approach
In this approach, the output is visually compared or validated against a predefined format or templates, designed by users or data architects.
II. Attributes validation
In this approach, the attributes and metrics, which are part of the reports, are validated and checked for correctness.
Testing big data is a challenge and there needs to be a clear test strategy in place to validate the 3 Vs of big data. The test stages provided above can be used as a starting point to understand different validation stages and ensure data is tested as early as possible in the data workflow.
As more and more organizations move into big data implementations, testers need to start thinking on various strategies to test these complex implementations.