Testing Services provides a platform for QA professionals to discuss and gain insights in to the business value delivered by testing, the best practices and processes that drive it and the emergence of new technologies that will shape the future of this profession.

« Data validation nuances for unifying Omnichannel data | Main | Trends impacting the test data management strategy »

Hadoop based gold copy approach: An emerging trend in Test Data Management

Author: Vikas Dewangan, Senior Technology Architect


The rapid growth in data volumes of both structured and unstructured data in today's enterprises is leading to new challenges. Production data is often required for testing and development of software applications in order to simulate production like scenarios.

In most cases, enterprises have a gold copy of this data where it is stored and masked before moving it to non-production environments. Having a gold copy helps in the ability to mine this data and choose the optimal subset or specific records for migrating to the target non-production development and testing environments. With increasingly larger volumes of diverse data in production environments, the volume of data stored in these gold copies is becoming correspondingly massive . As enterprise class storage media (e.g. SAN hard disks) is quite expensive, IT departments are finding it more and more difficult to justify their investments for additional non-production data storage such as a gold copy. The business may regard these as non-critical investments. Enter Hadoop.

Hadoop benefits

Hadoop provides a highly scalable and cost effective framework that enables distributed data storage and processing. Hadoop runs on commodity hardware as compared to enterprise class storage media; various industry estimates peg the disk storage savings of the former as anywhere between 3 to 10 times  as compared to enterprise class storage media. Given this saving, it makes a lot of sense to consider leveraging Hadoop as a solution. Further, Hadoop can be considered for providing a centralized gold copy for the entire application portfolio of the enterprise. Another key consideration for selecting a platform for gold copy is that a future-proof architecture is required that can support various types of test data including RDBMS, semi-structured (e.g. flat files) and unstructured data with high scalability. This is an area where Hadoop shines.

Mechanisms for data loading, refresh and mining

Some key requirements of a gold copy from a test data management perspective are: (1) the ability to ingest data from production to the gold copy, (2) the ability to carry out a full or selective refresh from the gold copy to the target data stores and (3) the ability to mine and identify required test data records against various test cases.

Unstructured and semi-structured data can be directly copied on to HDFS (using file copy operations), as there is generally not much need for carrying out any mining / querying on this data. For structured data, as there is a need for mining and full or selective extraction (sub-setting), we need an effective mechanism of storing this data on Hadoop. While there are several technological alternatives available in the Hadoop ecosystem to achieve this, let us look at a popular approach being adopted by the industry - which is using Hive. Hive enables users to write SQL like statements to analyze and query a Hadoop data store using HiveQL (Hive Query Language).  To load data into Hive from RDBMS, we recommend leveraging Sqoop. Sqoop supports both import and export from a large variety of RDBMS types. Using Sqoop, these records can be selectively refreshed into the relevant test or development environments. Alternatively, using relevant filtering criteria, an optimal subset of data can be refreshed. For refreshing unstructured and semi-structured data into the test and development environments from the Hadoop based gold copy, HDFS file copy operations may be used.

A conceptual figure of a traditional data store and a Hadoop based intermediate data store is presented below:


While Hadoop has been around for a while, it is an emerging trend that Hadoop is being leveraged as a test data store by enterprises today.

Implementation steps
The key steps in implementing a Hadoop based gold copy approach are: 

  1. Analyze: Includes understanding the current application and environment landscape in terms of type and volume of data, database technologies involved, test data needs, reusability of test data, refresh frequency required etc.
  2. Design: Define the high level solution architecture of the Hadoop based gold copy data store. This will include the mechanism for data ingestion, mining and refresh of the target environments. The design should consider all types of data including unstructured, semi-structured and structured data. Design needs to include how the data will be maintained and kept current.
  3. Setup and configure: Includes provisioning the hardware, setting up the Hadoop cluster, configuring the key aspects (like data ingestion using Sqoop and tables using Hive)
  4. Roll out: Will involve the initial data load of the Hadoop cluster and provisioning data. It is advisable to start with a pilot for a few applications to iron out potential issues.
  5. Expansion: After the initial roll out, the solution can be expanded to other applications and portfolios across the enterprise. In addition, a strong focus should be kept on continuous improvement of the solution.

Conclusion
In summary, this solution works by loading production data from diverse platforms into Hadoop, which serves as the gold copy for data storage. Identified records are migrated to the target test and development environments, which have the same technology platform as the production environments. The key benefits of this solution are that it is cost effective, highly scalable and can handle a wide variety of test data types. All the major solution components mentioned here including Hadoop, Sqoop and Hive are open source. Considering the benefits, several enterprises today are evincing keen interest in leveraging Hadoop as a test data store. There is undoubtedly a strong value proposition to embrace this emerging trend and move to a next generation test data management (TDM) solution.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Infosys on Twitter