Monitoring Load Generators during Performance Testing
Is monitoring of load generators required during performance testing? - This question is generally answered with a YES but in practice it is not followed. The significance of understanding load generation process is generally overlooked in performance testing. This blog discusses a case study where load generators were not able to generate the load as expected and how it was resolved.
We were working on a Proof of Concept (PoC) to study the performance of a sample application under consistent peak load. The load tests were executed for a single business transaction with multiple loads starting from 100 users. All the load tests included ramp up to let the application handle the gradual increase in the load. The load scenarios were created with minimal think time of one second and were executed for a shorter period of 15 minutes as the objective was to find the peak load which results in 80% CPU utilization.
The application was able to handle up to 300 users load but lot of errors were thrown when the user load reached about 330 users during the ramp-up of 500 users test. The test results log showed HTTP 500 Internal Server Errors and exceptions such as java.net.BindException, java.net.ConnectException and java.net.SocketException. Based on the server log analysis it was found that the HTTP 500 errors were caused due to wrong parameter values passed for some of the request parameters. The script was designed to capture these dynamic values from the response of the previous request and use it in the subsequent request as required. Later it was found that the first request itself failed and the second request was not updated with parameter values, which resulted in java.lang.NumberFormatException and HTTP 500 errors were returned for the second request. As the primary cause of failures were the requests that failed first, analysis to find out why the first request failed was done. The error responses of those requests were related to socket exceptions, but none of those socket exceptions were logged in the server logs. To understand things in more detail the load generator was monitored to find out how the socket connections are established from the testing tool to the server. This was done using the Microsoft Windows netstat command. The output of the netstat command showed too many socket connections to the server and the number was much higher than the number of users that were simulated using the testing tool. It was also observed that most of these sockets were in TIME_WAIT state.
After finding that the issue is based on how the connections are established from the load generator to the server, further investigation was required on how HTTP calls are being made. This lead to the analysis of the testing tool plug-in used to simulate HTTP requests and it was found that using the default plug-in causes this issue as the HTTP connections were not reused. This also clarified why so many sockets were in TIME_WAIT state. Based on the documentation of the tool the plug-in was replaced with another plug-in which supported reuse of connections. The problem did not get resolved until the script was updated to use the new plug-in. The test was executed again and the load was generated as expected without any socket issues.
This whole exercise made one thing very clear: close/detailed monitoring of load generators during performance testing (at least for basic system level metrics) should be considered as part of the performance testing process. This will help uncover load generation related issues during the testing cycles. It will also help in ensuring that the load generator is able to generate the load as expected and reduce the time and effort spent in unnecessary application analysis.