Yeah, Starliner failed, but that’s what’s supposed to happen before using it

Boeing and NASA are going to run a full code review on the “Starliner” after uncovering some terrifying embedded problems with the spacecraft’s operational code. Most people tend to think of code as some monolithic thing, put together by a sharp team whose embers know what each other is doing. Nothing could be further from the truth. Large codebases like that for the Starliner is written in components, and those components usually receive some pretty heavy QA, at least so the egos of the teams writing them are not hurt by reputation.

Something the industry has really gotten right is code quality within these components. The actual quality of the craftsmanship of the code itself. This used to be the biggest source of errors, but with unit testing, various agile techniques and automated component testing the actual components work better than ever…by themselves. In other words, the lego blocks are nearly perfect.

The problem I see with doing only a full code review is that even the best ones rarely find the issues that come up with distributed computing where components work with other components. Doing this end to end testing can be hella expensive and most sponsors won’t tolerate it. They want the unintegrated stack to “just work”. Sometimes it does, but most of the time, nope.

The strange thing is that the business and technical environment around the internet and its associated development tools have another way to find out how well those components work together. Put all that final e2e testing on their users. As a development team it’s liberating to be able to just push updates to a handful of all the servers you’ve got running, knowing that most of the servers are running older code that still works as expected. Then they watch 1-3% of their common users flush the gremlins and fix them as they pop up.

So hey, 90% of all software can afford to be glitchy for a small period of time, and we users put up with it because we gain so much, so cheaply. And what the hell, if Amazon botches a checkout basket I can always bail and come back a few hours later to find it working just fine. Good ol’ continuous releases!

It’s a totally different story with systems that kill people when they fail. High risk systems are either hooked up to physical systems that do things like fly us around at 600MPH and at 35,00 feet, or in Boing & NASA’s case sent humans into orbit and rendezvous with the ISS. High risk is also always associated with systems that count and track our money because well, money is the fuel of our well being and people loosing their money due to a bug has some recourse for recovering, often to the detriment of the entity owning the system.

Another thing about high risk systems is that all of their components really need to work, and subsystems really need to interact correctly, passing correct information, evens, commands, etc. The impact of even a tiny fail in these systems gets amplified. Physics rears its head and there is no stopping physics. Boom, another big smoking hole in the ground.

So basically it is really risky to use something like A/B testing on a high risk system, but we have a generation of coders and engineers who write ever faster for sponsors who only expect that less than 1-3% of the users in a day experience the bugs and the bugs get fixed asap. If that happens then all is fine.

Before 2000 most systems had some kind of end to end testing, but now in most cases it’s not needed and only adds drag to the continuous improvements of the system. As a result there’s also a generation of execs who always challenge their engineers, and pressure them to go faster with fewer people, since time and people are money. From what I’ve heard about the backgrounds of some of these execs at Boeing many of their execs have heard about the tactics used in websites that bring down the costs, and tried to apply them to the control and command systems for their Starliner, and their commercial airplanes.

They broke up the teams that used to share the same building, and could talk in real time. Now most components are created by teams all around the world, and integration is a problem with physical systems because all you can do is emulate the components that *your* component talks with. End to end testing on spacecrafts is hella expensive, since the craft is destroyed when there are intra-component failures, so who could blame folks for trying other ways to do the same thing?

Maybe NASA is looking at the intra-component interactions as well as the code quality. I hope so. This can be expensive if they choose to write a bunch of emulators of real physical components, almost as expensive as the cost of the code being tested but hey, fewer smoking holes in the ground.

Fortunately the Starliner didn’t leave a smoking hole in the ground like the Dragon did a few months ago, but as NASA implied, they were incredibly lucky. Boeing was not so lucky with their 727-MAX where sloppy management decisions caused the loss of 300+ lives. We may be in a time where some more of the old fashioned physical testing like what was done in the 1960s, 70s, 80s, & 90s might be appropriate. In the past few months we’ve seen this happen, with both Boeing and SpaceX. Good, keep at it, don’t back off, and don’t introduce last minute changes as those can introduce huge risk. Oh, and please eat the cost of retraining pilots in the 727 case. It’s a lot cheaper than passengers losing trust.