Proper monitoring is fundamental to the operation of any high-performance eCommerce site. The reason for this is that it is very difficult to validate that code is correct prior to going live, given the limitations of the testing process, and the limited time available to properly test. As a result, we test a little, and then throw the code over the wall to the user.
In an ideal world, the competition would test their code better, thereby giving us enough time to test ours. This may become the norm in ten or fifteen years, but for now, everyone is running as fast as they can, and very few eCommerce businesses would feel comfortable slowing down.
An alternative is to accept the fact that you are not going to get the time to test prior to going live. All is not hopeless, as there is no reason that you can't keep testing a system after it goes live. Normally, this type of testing is called monitoring, but it resembles testing in that data is gathered that indicates where the system is having problems. The fact that the transactions are real rather than synthetic is actually a positive, if the user frustration is not intolerable.
For example, if you go live with your new version of your eCommerce site in August, it may be fine for today's loads. By monitoring and continuously improving, you may be able to get it ready for the holidays, even if it is not ready on the first day in production.
There are several types of monitoring:
· 24x7 Health Monitoring - examining performance and behavior metrics on an ongoing basis with a goal of spotting problems before they become serious.
· Incident Detection and Notification - the notification of the correct resource whenever a critical metric deviates from an acceptable range so that remedial action can be taken
· Rapid Triage - finding a way to get the site back up as quickly as possible.
· Root Cause Analysis - investigating a problem with a goal of finding the root cause and designing a permanent solution.
· Continuous improvement - Code written by vendor partners is very difficult to assess for quality of construction. The best practice for ensuring that quality code is being delivered is to proactively examining all aspects of the site after each release. The goal of the examination is to identify poorly coded sections, improve the response time (as experienced by the user), increase the throughput, forecast future hardware needs, increase the site's stability, etc.
Here are my recommendations for how to monitor to achieve each goal:
· 24x7 Health Monitoring
· Install a "Quality of Service" monitor for every hardware and software system whose performance is critical to maintain the site's scalability.
· Create Alerts to fire whenever a "Quality of Service" threshold falls below a minimum.
· Create a series of dashboards for Level 1 and Level2 support to monitor that shows the overall health of the environment.
· Provide Standard Operating Procedures for the support team to follow for any unhealthy conditions.
· Incident Detection
· Install a heartbeat monitor for every hardware and software component whose availability is required for the site to be selling.
· Create Alerts to fire whenever a critical heartbeat is lost.
· Create a series of dashboards for Level 1 and Level2 support to see what is up and what is down
· Provide Standard Operating Procedures for the support team to follow for addressing a server failure
· Rapid Triage
· Install an application-level profiler like Dynatrace in both the Pre-Production and Production environments.
· Train the Level 2 support staff to the profiler to quickly discover what part of the code is the performing badly
· Provide Standard Operating Procedures for the support team to follow depending on where the problem is localized
· Root Cause Analysis
· Install an application-level profiler in both the Pre-Production and Production environments.
· Use Tealeaf to isolate a transaction that triggered the undesirable result
· Train Level 3 support in the use of the profiler evaluate the application's behavior during the unsuccessful user experience.
· Isolate the root cause and submit it to the architecture team to design a fix of the problem.
· Continuous Improvement
· Install an application-level profiler in both the Pre-Production and Production environments.
· Profile every new code drop both before and after it is put into production to detect components that are performing poorly.
· Isolate the root cause and submit it to the architecture team to design a fix of the problem.
If you instrument your site this well, you will be able to maintain you site a the high-level that your customers desire but rarely experience.