Earlier today I posted a question on a mailing list trying to find out how other IT Solution Providers are dealing with the increasingly unreliabile and costly Microsoft Security patches.
Please don’t turn this into a security issue because it’s a business question:
I am depressed with Microsoft patching to the point that I might have to drop my SLA against all Windows-based servers at Own Web Now. Even on a day when the patch does not cause any problems at all the reboots don’t happen as they should. Vanilla configurations just do not start all services. Make up a weirdest thing you can get a Windows server to do and we’ve seen it. Remember that this is on a good day, not on a bad day when the security patch locks out Blackberries one month, Macintosh the next, crashes Dell boxes the month after that.
I am considering automatically dropping all Windows servers into an automatic 8 hour maintenance cycle during the Microsoft patchday to compensate for Microsoft’s lack of QA. We can no longer minimize issues through testing because even identical boxes (Hardware and software, remember we virtualize the crap out of things) are not behaving the same. Reboots before the patch are fine, reboots after the patch.. poof.
How is everyone else handling this? Drop the SLA? Lower confidence in Microsoft (who does that help?) Extended maintenance cycle?
Second Tuesday of the month is becoming a religious holiday at Vladville…
The Process
Our process and our ingredients are pretty simple. We do a flash backup every Tuesday afternoon (EST). Those backups are generally complete by 10PM. We do a flash reboot just to make sure there are no hardware/software issues. We proceed with the patches that passed quality control / quality analysis earlier that day. We push using a collection of tools, WSUS and other bits and pieces. Other bits and pieces are used instead of WSUS when we want to apply hotfixes without a reboot to critical infrastructure systems.
Either way, pretty standard stuff. Most Windows servers run a similar configuration (actually, most are identical in both software and hardware as they are mostly Virtual Server systems) so there is little reason to expect one to work while the others fail.
The Costs
Do not let Microsoft WSUS and “Secure by Default, Design, Description…” fool you, patching is expensive, very expensive. There is no alternative to patching, we have to do it. With critical updates, we have to do it ASAP. No complaints there though, its just a part of business.
My complaint is with the unplanned costs related to patching. Costs that I and my customers have to pay because Microsoft produces unreliable and unstable patches. Let me explain what my definition of that is: “If a patch causes unexpected downtime or adversely impacts my system performance I do not consider it to be stable or reliable.” Simple as that. A patch is supposed to close a security hole in the software without affecting the rest of the system.
This is no longer the case. Few months ago Microsoft patch knocked out Macintosh systems (Entourage) from connecting to Exchange. Month after that it stopped Blackberry from operating properly. You remember my post about it regarding Dell.
My actual complaint is that I am at the verge of losing confidence in Microsoft’s ability to reliably and predictably patch the problems in their software. It is costing me a small fortune both financially and in terms of reputation. If I cannot stand behind my SLA (Service Level Agreement) which states just how often the server will be up then what value am I providing. If I am put in the position of having to appologize for things that are not my fault to begin with, where does that put my reputation at with my customers? Forget about the cost of overtime for employees, support calls, graveyard shifts, and the near cottage industry built around the patching tools, preparation process, reporting and followup just to make sure that the software we paid for continues to behave the way it was sold to us.
Forget about me
Now this is simply a blog post that will change… nothing. But it is an opportunity to review your SLA and consider how you deal with unreliable partners whose products and services you are supporting. I am at the verge of having to rewrite my SLA to put Microsoft patches into a maintenance cycle without any assurance on the time period. Here is one of the intriguing answers I got:
Vlad, we ran into the same issues as we started to scale and eventually had to build a lab for testing where, once approved, the patches would be put on our corporate network and when approved, we would roll them out to the clients. To resolve the reboot problems we put in “lights out” cards in all our servers. I agree it is not for the faint of heart.
Anyhow, something to consider…
Pingback: E-Bitz - SBS MVP the Official Blog of the SBS "Diva" : The Reboot problem