Good requirements are key to the success of your apps. We all know that and spend a ton of time figuring out how to ensure that this happens, from the gathering of the requirements to the User Acceptance Testing done either by your test team or your customers. To be sure that your sponsors and stakeholders are really happy you’ll want to make sure that you fulfill all of their requirements. All of them, including the unstated ones.
So what might be missing and why might you have missed them? I’ve noticed that the most common requirements that are missed, and cost big money and market share when missed, are from a major and often overlooked stakeholder in all of this; the Operations Engineer. Here’s a lot of their common requirements:
- Scripted and well documented installation and configuration aids. The best are the ones that run “headless”, kicked off from other scripts or programs. Think “elastic scaling”
- A list of the app components that must be running healthy* for the app to be healthy, ranked in order of “Critical” to “Trivial”
- A definition of a unit of scale for each component. I bunch these under a general term I call “transactions” but they could be actual transactions, request/response event, state change, whatever, but it needs to be something that can be counted and used to understand the component’s useful performance
- A list of less than 10 readable metrics** for each component, that give the operator a read on the health and headroom or upper capacity. What are those metrics at saturation & beyond? Since it could be running on many different kinds of servers we also need the server characteristics that affect capacity
- OS services (threads, processes, etc.), CPU, number of cores, disk IO, memory, network bandwidth & latency, etc.
- Upstream service responsiveness, reliability, and availability that affect capacity
- Prescriptive actions to take to restore a saturated component after it fails, not just inside the failed component but also what to do with the upstream & downstream components
- Impact on downstream components of the app when each upstream component fails
- Impact of failed upstream services that the component needs
- The optimal proportion of component instances running in the app so there are no bottlenecks or wasted excess capacity
* – Defining “Healthy”: a state of a component where it is functioning within the capacity of its supporting environment and is returning to a known nominal state after every transaction
** – Defining “Metric”: a measurement with stated values indicating that a component is heading into trouble, is in trouble, is recovering from trouble, and a range of values where the component is “healthy”