Friday, March 21, 2014

Fixing healthcare.gov

This is the second post I have done that is in response to material published by Steven Brill.  My first post is at http://sigma5.blogspot.com/2013/02/medical-costs.html.  Both were published in Time magazine.  His more recent article is in the March 10, 2014 issue and is titled "Code Red".  Brill is an astute observer of the Health Care industry.  But he is not a techie.  This did not get in the way of his previous article.  With his substantial business background he was able to dissect the hospital business to devastating effect.  I recommend both his original story and my analysis of it.  And, unfortunately, all of the defects he found then are still with us.

His lack of computer expertise shows in his recent outing.  But he still has a lot of useful things to say.  So it is a worthwhile endeavor on his part.  And don't forget the news environment.  Problems with healthcare.gov were wretchedly excessively over covered by all parts of the media.  The "it's fixed" story has been covered, although perhaps less than it should have been given the overkill that preceded it.  But Brill's story is unique in my experience in that he addresses the "how it got fixed" story.  If you are tempted to say "that's not news" then check out any sports section from any paper on any day.  Every one is jammed full of "how we won/lost" stories.  On to Brill's piece.

The obamacare.gov saga can be broken down into 4 eras.  Before October 1 the story was "It will work great from the get go on October 1.  Trust me."  Era two ran from October 1 through October 17.  This was the "It's broken.  Nothing serious.  We'll have it fixed any minute now." era.  The third era (October 17 through December 23) is the one Brill focuses most of his attention on.  This is the period when the site transitioned from seriously broken to working well enough for the moment.  The fourth era (after December 24) is the modern "It's working in the main.  There's more to fix but we have things under control." era.

Brill throws in some "tech talk" tidbits to spice things up but does not get into that much nitty gritty.  Instead he puts more focus into a management level perspective.  This turns out to be very useful.  At a low level all tech projects are different.  But at the management level there is a lot of commonality and a lot that can be usefully applied to the next tech project.  So let's take a look at what Brill found. 

And let me start by picking something out of the middle.  I believe it's the most important thing Brill reports.  Three rules for behavior were quickly adopted.  1.  Meeting are for solving problems.  Blame games can be played somewhere else.  2.  The ones who should be talking are the ones who know the most, not the ones with the highest rank.  3.  We need to stay focused on the most urgent issues.

If you take the opposite of these rules you will have the rules politics are usually played by.  But these rules are used over and over to pull big technical projects off successfully.  Politics has a big effect in the short run.  Technical changes have an even bigger effect in the long run.  Political effects tend to cancel each other out fairly quickly.  Technical changes tend to change everything and endure.  As an aside, it would be nice if the media, particularly the beltway media, stopped reporting everything through the lens of political rules and instead reported things through the lens of technical rules.  This change would help fix the mess our national politics are currently in.

The bottom line is that by adopting the above behavior (and some other things) a small team was able to fix the web site in six weeks or less.  Brill focuses on the contributions of the small team of techie superstars that the Obama Administration assembled, once they figured out they had a big problem.  And, as Brill points out, the Obama people had another big problem that happened to exactly coincide with era number two.  This is the time when the government shutdown was in effect.  Imagine the hew and cry if the administration had diverted its efforts away from getting the government shutdown ended and had instead focused on the "minor" (certainly how it would have been characterized) problem represented by the web site.  I don't think that it is a coincidence that the administration took major action to fix the site immediately after the shutdown crisis was resolved.

If you want lots more details of how the site was fixed, read the article.  But let me cherry pick some more.  A lot of time has been spent on the "blame the contractors" game.  This game has been played all over the place, not just with healthcare.gov.  But the first thing Brill reports is that the technical people employed by the site's contractors were not "defensive or hostile".  In fact the small team of fixers and the bigger teams of contractor people quickly developed a good working relationship and worked together well.  And once the fixers took over management of the project the contractor management also quickly fell into line.  In fact one of the contracting companies solved a big administrative problem by putting the fixers on their payroll.  This made them kosher with respect to various government rules and regulations.  The same turned out to be true of the Obama administration types.  No one started playing politics or putting up unnecessary roadblocks.  This broad cooperation up and down the line was one of the most important keys to getting the site fixed quickly.

The only administration side management flaw Brill found was that there seemed to be no one individual that was in overall charge.  Responsibility was diffuse and it was unclear who was responsible for what.  On the other hand, Obama himself comes off very well.  Early and often he was asking everyone if things were on track.  He was repeatedly reassured that they were.  Once the site started failing he held daily meetings to try to get a handle on the situation.  And as soon as the shutdown was over he tasked a single individual, the one person in his inner circle who actually had technical expertise, to get 'er fixed.  Things immediately started moving quickly from there.

The fixer team consisted of the best.  And they had deep and broad expertise in big technology projects.  Several were silicone valley heavyweights.  The fixer team immediately set out to fix things.  Some of this involved hands on work by members of the team like fixing code.  But they also applied their hard earned knowledge of how you do this kind of thing.  The first thing they did was put together something called a "dashboard".  It performs for software what the dashboard in your car does for your car.  It tells you what's working and what isn't (e.g. "check engine").  It also tells you things like how fast you are going or, in the case of the web site, how many people are accessing the site.  Then their are general health indicators (gas gage, engine temp).  In the case of the site things like the access times of the several databases the site depended on.  This allowed the team to translate "its broken" into "these specific things are not working right".  The more detailed information from the dashboard allowed engineers to investigate specific components looking for specific problems.  And that in turn led to specific fixes that made the site work better.  To boil it all down, six weeks worth of the right specific fixes resulted in a site that worked pretty well and did all the "needs to be working right now" things correctly.

Now let me back off and apply my technical expertise to fill in some blanks.  I'm sure the fix team fixed a lot of stuff.  But I suspect that most of the fixes were actually done by the engineers employed by the contractors.  One of the fix team observes "these guys want to fix things".  I suspect that the problem was not a lack of talent on the part of the contractors.  I suspect that the main problem was poor project management.  I have been involved in enough IT projects to know from personal experience that any IT project can be screwed up by managing it badly enough.  And it doesn't matter what anyone else does.  The best management has a good understanding of the technology (in this case web sites and databases) and a good understanding of the business (in this case health insurance).  Given enough good will (and time and money) shortcoming in one or both areas can be overcome.  In this case the basic technology side did not look that hard to me.  But, as I will explain below, the business side was actually a lot harder than it would appear.

Plenty of money was spent on the site.  Brill reports the amount was $300 million.  That should have been more than enough.  One problem that was identified early was that the hardware was inadequate.  But that problem was identified almost immediately.  Hardware was added in early October, well before the 17th.  And more hardware kept being added well into December.  It may be that more hardware is still being added.

Now let me provide some broader perspective, perspective that is beyond the scope of the subject matter of Brill's article.  I am going to take a quick look at two state sites, Washington and Oregon.  Both of these state sites had a simpler task.  They only had to deal with the idiosyncrasies of one state.  The federal site had to deal with the idiosyncrasies of more than 20 states.  Both states are blue states that are in decent shape from a budget perspective.  So there presumably was enough money in each case and there were no political roadblocks.  But the experience of the two states couldn't be more different.

The Washington site went up on time.  It has had a few glitches. But for the most part it has worked from the beginning and worked well.  Oregon's site has been a disaster.  And it has been a far bigger disaster than the federal site.  It never worked at all during the entire October - November time period.  The last time I saw any news coverage about the site (a couple of months ago now) reports were that it was still completely broken.  Oregon gave up and declared some time in October that they were going to an all manual system.  So why the differing outcomes?  Well, there were apparently some differences between how each state behaved that turned out to be critical.

Washington decided a couple of years ago that the state systems and databases needed to support the state web site were not up to the task.  So they rolled out new modern state systems before October 1.  This meant that the interfaces between the state web site and the state systems could be simple and that the state systems would be able to deliver exactly what the state site needed.  The web site would not have to "crutch around" inadequacies in the state system.  Washington also bid out the web site in the usual manner but was able to provide substantial state supervision "in house".  The result, as I said, was a site that worked pretty well pretty much from day one.

As you might expect, Oregon did things differently.  I don't know whether Oregon did a major upgrade of their state systems like Washington did.  I certainly haven't heard that they did.  The other thing Oregon did was outsource pretty much the whole thing including project management to Oracle Corporation.  I have a lot of personal experience dealing with Oracle.  They have good people and they have bad people.  The company I worked for was not big enough to justify always getting the good people so we frequently got the bad people.  Oracle missed out on a big business opportunity by so mismanaging a project (an outsourcing as it happens) that we cancelled the project and kept that functionality in house.  I don't know anything about the specifics of the Oregon project.  I definitely know nothing about who was or was not assigned to the project by Oracle.  But I do note that things failed.  And they failed spectacularly and in a very public manner.

So how does this relate to the federal project?  First, as I noted above, the federal project has to interface with a large number of state systems.  That is much more complicated than what Oregon had to pull off.  And the federal system always has worked better than the Oregon system, even in those bad old days of early October.  The only thing you can say in favor of the Oregon experience is that the site worked so badly that it was easy to decide early on to just scrap it.  So the Oregon experience is obviously apt.  The task was harder to pull off and easier to screw up that it would appear from the outside.  But what does the Washington experience tell us?

Remember the part where I pointed out that the state decided to do a major upgrade to the state systems the web site would depend on.  Many of the states that depend on the federal site are southern states.  These are states that are generally technology averse and poor.  What kind of shape do you think the state systems are in that have to tie into the federal site?  My guess is that they are in very poor shape.  This means that the federal site would have to have a lot of "crutch around" capability.

Add to this the fact that many of the states depend on the federal site are red states.  The GOP has been adamantly opposed to Obamacare.  I think the anti-Obamacare vote count in the U.S. House is now up to 50.  I pointed out above that it is easy for bad management to mess up IT projects.  It is also easy for hostile state governments to put all kinds of road blocks in the way of a successful web site implementation.  So the federal site has to deal with the complexity of servicing multiple states.  I don't know for sure but it is a good guess that it has to deal with state systems that are more or less inadequate.  And it has to deal with state administrations that are hostile to Obamacare, and therefore the web site, for political reasons.  Finally, I will note that some states made the decision to not do their own site very late in the game.  This meant that the federal site had to be reconfigured to handle these states very late in the game.  In short, the federal site had a number of things to contend with that neither Washington (success) nor Oregon (failure) did.

The site should have worked on October 1.  It seems apparent in retrospect that the biggest cause was that the project was poorly managed by the Obama administration.  One specific cause was that no single individual was in charge of getting the web site right.  But, if the site could be fixed in 6 weeks, it wasn't really all that broken.  And there were a number of contributing factors.  The government shutdown and the long run up to the shutdown were very disruptive.  And the knee jerk hostility of Republicans were definitely contributed.  As just one example, how much time was spent by senior administration officials trying to move other senior officials through the Senate confirmation process.  The time they spent in this necessary but essentially useless task could perhaps been spent doing a better job of managing the web site rollout instead.  And it is important to remember that the Oregon experience shows us that it was easier to bungle this process than it appeared.      

No comments:

Post a Comment