Wednesday, September 18, 2013

Adventures in "Epic Fail": Live From Wikimedia Foundation

Ahhh, there is nothing quite like the feeling of coming off a CalTrain car at the San Francisco terminus, walking onto the train platform, breathing that cool air, sighing, and saying "yeah, I remember this!"


Palo Alto is nice during the day, a great place to walk around in shirt sleeves, but there's something really sweet about the cool that eastern San Francisco gives you as you wind your way through SoMa.


OK, enough of the travelogue... I'm here to talk about something a bit more fun. Well, more fun if you are a testing geek. Today, we are going to have an adventure. We're going to discuss failure. Specifically, failure as relates to Selenium. What do those failure messages mean? Is our test flaky? Did I forget to set or do something? Or... gasp... maybe we found something "interesting"! The bigger question is, how can we tell?

Tonight, Chris McMahon and Zjelko Filipin are going to talk about some of the interesting failures and issues that can be found in their (meaning Wikimedia's) public test environments, and what those obscure messages might actually be telling us.

I'll be live blogging, so if you want to see my take on this topic, please feel free to follow along. If you'd like a more authoritative real time experience, well, go here ;) :

https://www.mediawiki.org/wiki/Meetings/2013-09-18


I'll be back with something substantive around 7:00 p.m. Until then, I have pizza and Fresca to consume (yeah, they had one can of Fresca... how cool is that?!).

---
We started with a public service announcement. For anyone interested in QA related topics around Wikimedia, please go to lists.wikimedia.org (QA), and if you like some of the topics covered tonight, consider joining in on the conversations.


Chris got the ball rolling immediately and we started looking at the fact that browser tests are fundamentally different compared to unit tests. While unit tests deal with small components, where we can get right to the issues where components fail, with browser tests, we could have any variety of reasons why tests are failing.


Chris started out by telling us a bit about the environment that Wikimedia uses to do testing.

While the diagram is on the board, it might be tough to see, so here's a quick synopsis: GIT and Gerrit are called by Jenkins for each build. Tests are constructed using Ruby and Selenium (and additional components such as Cucumber and RSpec). Test environments are spun up on Sauce Labs, which in turn spin up a variety of browsers (Firefox, Chrome, IE, etc.) which then point to a variety of machines running live code for test purposes (whew, say that ten times fast ;) ).


The problem with analyzing browser test failures is trying to figure out what the root cause of failures actually is. Are the issues with the system? Are the issues related to timeouts? Are there actual and legitimate issues/bugs to be seen in all this?

System Failures

Chris brought up an example of what was considered a "devastating failure", a build with 30 errors. What is going on?! Jenkins is quite helpful if you dig in and look at the Console Output, and the upstream/downstream processes. By tracing the tests, and looking at the screen captures taken when tests failed, in this case there was a very simple reason... the lab server was just not up and running. D'oh!!! On the bright side, the failure an the output make clear what the issue is. On the down side, Chris lamented that, logically, it would have been way better for there to be tests that could have run earlier in the process to confirm if a key server was not up and running. Ah well, something to look forward to making, I guess :).

Another build, another set of failures.... what could we be looking at this time? In this case, they were testing against their mobile applications. The error returned "unable to pick a platform to run". Whaaah?!!!  Even better, what do you do when the build is red, but the test results report no failures Here's where the Console output is invaluable. Scrolling down to the bottom, the answer comes down to... execute shell returned a non-zero value. In other words, everything worked flawlessly, but that last command, for whatever reason, did not complete correctly. Yeah, I feel for them, I've seen something similar quite a few times. All of these are examples of "system problems", but the good news is, all of these issues can be analyzed via Jenkins or your choice of CI server.


Network Failures


Another fact of life, and one that can really wreak havoc on automated test runs are tests that require a lot of network interaction to make happen. The curse of a tester, and the most common (in my experience) challenge I face, is the network timeout. It's aggravating mainly because it  makes almost all tests susceptible to random failures that, try as we might, never replicate. It's frustrating at times to run tests and see red builds, go run the very same tests, and see everything work. Again, while it's annoying, it's something that we can, again, easily diagnose and view.


Application Failures


Sauce has an intriguing feature that allows tests to be recorded. You can go back and not only see the failed line in a log, but you can also see the actual failed run in real time. That's a valuable service and a nice thing to review to prove that your test can find interesting changes (the example that Chris displayed was actually an intentional change that hadn't filtered down to the test to reflect the new code,  but the fact that the test caught the error and had a play by play to review it was quite cool).


Theres an interesting debate about what to do when tests are 100% successful. We're talking about the ones that NEVER fail. They are rock solid, they are totally and completely, without fail, passing... Chris says that these tests are probably good candidates to be deleted. Huh? Why would he say that?


In Chris' view, typically, the error that would cause a test like that to fail would not be something that would provide legitimate information. The value of the test is such that, because it takes such a vastly strange situation to cause the test to fail, and under normal usage, the test never, ever fails, those tests are likely to be of very little value and provide little in the way of new or interesting information.  To take a slightly contrarian view... a never failing test means that we may be getting false positives or near misses. IOW, a perpetually passing tests isn't what I would consider "non-valuable", but instead should be a red flag that, maybe, the test isn't failing because we have set the test to not be able to fail. Having seen those over the years, those tests are the ones that worry me the most. I'd suggest not deleting a never failing test, but explore if we can re-code it to make sure it can fail, as well as pass.

Another key point... "every failed automated browser test is a perfect opportunity to develop a charter for exploratory testing". Many of the examples pointed to in this section are related to the "in beta" Visual Editor, a feature Wikimedia is quite giddy about seeing get ready to go out into the wild. Additionally, a browser test failure may not just be an indication of an exploratory test charter, it might also be an indication of an out of date development process that time has just caught up to. Chris showed an example of a form submission error that demonstrated how an API had changed, and how that change had been caught minutes into the testing.

So what's the key takeaway from this evening? There's a lot of infrastructure that can help us to determine what's really going on with our tests. We have a variety of classes of issues, many of which are out of our control (environmental, system, network, etc.) but there are a number of application errors that can be examined and traced down to actual changes/issues with the application itself. Getting experience to see which are which, and getting better at telling them apart, are key to helping us zero in on the interesting problems and, quite possibly, genuine bugs.

My thanks to Chris, Zjelko and Wikimedia for an interesting chat, and a good review of what we can use to help interpret the results (and yeah, Sauce Labs, I have to admit, that ability to record the tests and review each failure... that's slick :).

---
Thanks for joining me tonight. Time to pack up and head home. Look forward to seeing you all at another event soon.

No comments: