Big data and performance testing approach
Author: Shreya Banerjee, Project Manager
A big performance: Performance testing of big data
Making it big: Performance testing approach for big data
Seeing the big picture with big data
Thinking big: The right approach to big data performance testing
Big data came into the picture with a big bang. In today's information world, we are constantly bombarded with data. We like our data to be processed, that is, we want information. However, with the overload of data from several sources, we are losing sight of structured data.
According to a new report by the analyst firm International Data Group (IDG), 70 percent of enterprises have either deployed, or are planning to deploy, big data projects and programs in the coming years due to the increase in the amount of data they need to manage. Traditional databases are inadequate and will no more be around to hold such an avalanche of data.
More and more organizations are looking for ways to streamline complexities and get more out of their data-related investments. At the same time, these companies are realizing the usefulness of the power of big data and how it can be utilized for expansion and growth.
Data growth challenges and opportunities are now being defined as three-dimensional -- increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, use this 3V model for describing big data.
The emphasis is not only on managing data complexities and integrity, but also on the performance of the system which will make the data useful. Hence, a large chunk of the investment is being put into failover, performance testing of the framework, and data rendition.
In the ideal scenario, even before the performance testing starts, architectural testing is considered crucial to ensure success. Inadequate or poorly designed systems may lead to performance degradation.
The following points need to be considered for a performance-testing strategy:
1. Data ingestion: It is the process of ingesting or absorbing data into the system, either for storage or for immediate use. The focus is not only on the validation of files from various sources, but also on routing them to the correct destination within a given time frame
2. Data processing: Data, when gathered from sources, has to be processed or mapped within a certain framework. This is usually done in batches due to sheer volume. The focus needs to be on the system's scalability and reliability
3. Data persistence: Irrespective of the data storage option (relational database management system, data marts, data warehouse, etc.), the focus is on data structure, which needs to remain constant or easily adaptable for various storage options
4. Reporting and analytics: It is the process of examining large data sets containing a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful business information. The focus is on applying correct algorithms and reporting useful information within a certain SLA
Due to data complexity, the approach to performance testing is also different. Here we need to deal with huge volumes of both structured and unstructured data. The following depicts a five-step approach to performance testing:
FIVE KEY STEPS IN PERFORMANCE TESTING BIG DATA PLATFORM
Performance testing includes testing of job-completion time, CPU and memory utilization, data throughput, and similar system metrics. The two things which need to be kept in mind while defining the approach are:
1. The speed with which a system is able to consume data -- the data insertion rate
2. The speed with which queries are processed while data is being read -- the data retrieval rate
The system is usually made of multiple components; hence, it is wise to first test them in isolation ― at the component level ― before proceeding to test them altogether. Testers have to be well-versed with big data technology and framework, such as Hadoop, NoSQL, messaging queues, etc. Some of the tools which are gaining popularity in the market include:
1. Yahoo! Cloud Serving Benchmark (YCSB): Cloud service testing client that reads, writes, and updates according to specific workloads
2. Sandstorm: Automated performance testing tool that supports big data performance testing
3. Apache Jmeter: Provides plug-ins for testing Cassandra database
Big data performance testing is challenging, but a mix of right tools, skill sets, and a robust strategy, will go a long way in driving a successful project.