First off, there's a new episode of The Testing Show up over at Qualitest and on Apple Podcasts. This month Matt and I are talking with Lisa Crispin of mabl and Jessica Ingrassellino of SalesForce.org about Advanced Agile and DevOps and how those things are both similar and different. I think we did a good interview this month. Well, I almost always think we do good interviews but I especially enjoyed doing this one so I hope you will go and give it a listen.
Anyway, on with the "Thirty Days of Testability". As you might notice, today is the fifteenth. This is entry five. Catchup in full effect. You have been warned ;).
What monitoring system is used for your application? Do the alerts it has configured reflect what you test?
Again, we are going to be considering Socialtext as it is currently implemented as a standalone product because that's all I can really talk about. Before this week, I had a partial understanding of what we actually do for this and some holes in that knowledge. I'm fortunate in that I have a couple of great Ops people who are associated with our team. I should also mention that Socialtext can be deployed in two manners. First is a hosted SAAS option and second is a local installation. We leave the monitoring of the local installations to the small percentage of our customers who prefer to do that. The majority of our customers utilize our SAAS option and therefore we host their servers. To that end, we use the following (I'm pretty sure I'm not spilling any trade secrets here, so this should be OK. If this post changes dramatically between when I post it and tomorrow, well, I'll have learned differently ;) ). Anyway:
for Monitoring the systems (CPU, Disk space, https, http) we use a tool called Nagios.
For monitoring site uptime, https lookup, and time to respond, we use an external app called Alertra.
In addition to those two tools, we also have a variety of hand-rolled scripts that allow us to scan all of our servers looking for specific aspects of Socialtext instances and services to see if there are any issues that need attention. Examples here are things like the IP address, the hostname as viewed by the public and that it is accessible, that we are running certain key services (search, replication, cron, mail, ntp, our scheduler, our search implementation, what version that particular server is running, etc.).
The second part of the question deserves a legitimate answer and that is "Yes" and "No". Yes, in that some of the alerts map to what we test but "no" in that there's a lot of areas we don't actively test as consistently as we should. The chat with our ops team was definitely enlightening and has given me some ideas of what I can do to improve on that front. What are those things? Well, other than verifying that we are actively doing things that affect and trigger those alerts, I will have to ask that you respect the fact that I am now veering into trade secret territory and I kinda' like the idea of keeping my job ;).