02 July 2013

Dealing with a screwup: The story of Firestorm 4.4.2

Monday was Canada Day, eh? It was also a big, busy, pain in the ass of a day for the Firestorm team.

The first we heard of the problem was when Jessica Lyon told us in the developer chat that there was a big problem with Firestorm 4.4.1 and she'd convinced LL to give us until Tuesday to put out a new release before 4.4.1 was blocked.

To explain what the problem was, I need to first explain the statistics system. Linden Lab keeps statistics on a wide variety of performance measurements in the viewer. You may know that there's a crash statistics list that TPV developers in the Directory get if their viewer reaches a usage threshold, and that determines the order viewers are listed in the Directory each week.

There's far more than that, though. When Oz Linden says that 75% of users can run with Advanced Lighting Model turned on and get acceptable performance, that number didn't come from a Ouija board. It comes from actual reports of system configurations and measured performance sent in by every viewer directly to LL in the form of a statistics message.

That statistics message is supposed to be sent in once every 10 minutes. It contains performance and resource utilization numbers. It does not contain any personally identifying information. It's fed directly into LL's statistics system, where it gets crunched.

The problem was that Firestorm 4.4.1 was sending that message every 30 seconds instead of every 10 minutes. The statistics servers were getting hammered badly by having to deal with a sudden 20-fold spike in data. Oz discovered this when he went to generate the statistics report for last week.

There's a clause, 2.f, in the Third Party Viewer Policy that says:
You must not impose an unreasonable or disproportionately large load on our infrastructure or interfere with our providing the normal functionality of Second Life.
(The statistics message itself is required by 2.h of that same policy.) Guess what we did? LL was completely justified in telling us to fix it or else.

The reason we were spamming the server turned out to be a leftover from our helping to test server-side baking as it got close to its rollout next week. LL needed a big test case, and not just with their own viewer. We were in the final stages of doing QA on 4.4.1, so we made a change to enable enhanced logging of the viewer's responses to appearance update messages so that anyone who encountered a problem would have already collected the information needed to debug it, and LL would have a leg up on actually finding the problem. Part of that change was enabling a debug setting that sped up the viewer statistics message.

We handed out builds with that change made to folks to use in the test. The test was a rousing success by all reports, and LL was quite happy with it.

The problem was that we never undid that change in the release branch of the code. Oops. (All right, it was more of an "aw, SHIT!".)

So we had this problem to deal with. We talked for a bit about how to deal with it. There was no actual code change needed. All we had to do was change the default of one debug setting and, for the sake of completeness, the contents of an XML file used to control how the viewer logs data. Unfortunately, there is no good way to get the userbase to make this change en masse. Not only would we miss many who ignore messages of the day and such, making the changes would be difficult for many of them. (Users of SL are, by and large, not techies. This is a Good Thing.) It quickly became clear that we were going to have to spin a new release.

We knew right up front that we needed to own up to the problem and be open and public about dealing with it. Our users expect, and deserve, nothing less.

We also chose to block 4.4.1 ourselves, rather than have LL do it. We have the ability because Firestorm downloads information from our servers at startup, and one thing it loads is a list of blocked releases. When a release is started that's on the blocked list, it puts up a message, with some explanation text, and then exits when the user clicks OK. This does not require that any information about the user be collected at all, let alone sent to the Firestorm Project servers. If LL were to block it, on the other hand, the user would get a message about how they were not allowed to log on to Second Life with that viewer - with no further explanation. We felt this would cause more harm from user confusion than the very limited benefit that might come from having LL be the boogey man would give. Unfortunately, while the statistics message spam probably does not hurt OpenSim servers, our method blocks 4.4.1 for OpenSim users, as well.

There's a bug in the GPU table that causes problems for folks using ATI Radeon 6000M, 7000M, and 8000 chipsets. We have a fix for that. There was a fair amount of discussion about including that fix in 4.4.2. We decided eventually not to do so, because it would have taken much more QA than the change that was forcing the update, and that would have taken time we just didn't have. Jess had gotten LL to extend the deadline for the block to the next day; they weren't going to sit still for a 2-day QA cycle, especially pushing up against the July 4 holiday.

Finally, we decided to pull 4.4.1 from distribution immediately. We decided that there was no benefit in allowing people to download a release we knew was broken and that we knew we were going to have to block. We rolled back the version check on the website to 4.4.0 so people wouldn't be nagged to install a version that they couldn't get, and Jess wrote the first of two blog posts telling people we were going to have to force an update.

Then we waited while 4.4.2 was updated and built. We had to do that twice because of an error in the first build that omitted the very SSB compatibility flag that was the reason we pushed out 4.4.1 when we did. Still, we got the builds done, and the QA team turned around a very quick smoke test. We pushed 4.4.2 out the door, published the second blog post, waited a few minutes, and blocked 4.4.1.

Then we all joined the support groups and dealt with the flood of questions and complaints from the users. Oy.

The questions themselves weren't so bad. We knew we were going to have to deal with that. There was a massive volume of them, to be sure, but that's why we all piled in. Chat lag in the Firestorm Support English group was ferocious. (Why, oh why, did LL have to fix SVC-7031? :-) ) The questions were repetitive, and many of them were answered in the blog postings that we'd asked people to read. That wasn't the annoying part, though. Not even the guy who refused to read the blog post and demanded that we answer in 10 words or less why 4.4.1 wasn't good enough any more was truly annoying. (By now, you should understand just how impossible that request is to satisfy.)

There was a very common reaction of "Is this some kind of a joke?" (Or a hoax?) We wish it was, and that people were asking the question says good things about the users' perceptions of the quality of our code.

There was a fair amount of unhappiness that were were pushing out a new release so soon after 4.4.1. We expected that, and deserved it. We screwed up; this is the price of that screwup.

No, the annoying part was the tinfoil hat brigade. There were people saying "ZOMG, they're capturing our personal information all over again! Just like Emerald!" Uh, no. There was even one guy who was convinced that we'd stolen his credit card info, though we did get him calmed down eventually.

The wait to get 4.4.2 built, tested, and pushed out to the download servers was interminable, but it finally got out there. That was when people discovered that installing 4.4.2 over the top of 4.4.1 was a non-event. They didn't even have to uninstall 4.4.1 first, though many did. A full clean install with manual clearing of caches and the like wasn't needed for those upgrading from 4.4.1. The install was universally reported to be painless and take about 5 minutes or so. They even loved the performance, though there shouldn't be much reason for performance to improve much. We'll take it.

Now, it's all over but the cussing. People will continue to be surprised over the next several days that they can't log in any more with 4.4.1, and the support folks will have to explain over and over and over. (As I write this, 4.4.2 has been downloaded just over 26000 times; we have far more users than that.) But the LL servers aren't getting hammered any more, and, I hope, people will forgive us for the madness.

We did learn a lesson about our release process: Once we branch for release, nothing goes in without being tracked and verified before we actually spin the release code. We had such a process in place, but it broke down. That won't be allowed to happen again.

We didn't like doing this. It was a lot of work, and a lot of hassle, and made our users' lives harder than they should have been. Forcing folks who'd upgraded on our strong recommendation to do so again, five days or less later, is not particularly user-friendly. We had no choice, though, and I don't know that there's much we could have done differently once the problem became apparent.

2 comments:

  1. There was a problem.
    You guys tried to help resolve it.
    Something went wrong.
    You guys owned the mistake and worked hard to resolve it.
    And you did your best to get the word out about it.
    You've done everything right.

    And anybody who has a problem with that needs to go to the bathroom, look in the mirror, and call themselves a moron.

    I figure y'all are looking over the process to improve the development, QA, and communications on it.

    If only the Lab had product owners who were so merciless with their errors.

    thanks,
    -ls/cm

    ReplyDelete