William Lieurance's Tech Blog

Liveness and Readiness are Not Health Checks

|

A change in live-site ownership between Dev and Ops has flipped inputs and outputs

In the world before every conference had at least two talks about the relationship between "dev" and "ops", live-site success was owned by the operators of the platform that ran the application. In order to know what actions to take to keep the site up, ops tooling would ask the app if it was "healthy", meaning that no ops action needed to take place. Developers were neither expected nor encouraged to understand the platform.

If the app responded "unhealthy" or failed to respond at all, the platform would execute some amount of business logic that figured out what needed to happen.

                         Veil of mystery
    Operations                  |                     Developers
                                |
 Platform                       |                           App
                                | 

    Live-site ownership
       means querying   ----------------------------->  healthy?
     and using business
      logic to respond

Over the last decade there has been a push, "DevOps", in part to empower the developers of the application with ownership of the success of that application on the live-site. At the same time, and for exactly the same reason, we're now seeing platforms (like Kubernetes) and app libraries (like Microprofile Health) that replace the single "health check" with a set of checks that report, not more nuanced aspects of application state, but nuanced responses that the platform should take.

                         Veil of mystery
    Operations                  |                     Developers
                                |
 Platform                       |                           App
                                | 

                                              Live-site ownership
                                              means using business
 live/ready  <-----------------------------   logic to signal the
                                              application's needs

The most common are a pair of checks called "Liveness" and "Readiness". Responding negatively to a "liveness" check tells the platform that the instance of the application should be restarted. Importantly, it tells the platform that the problem the app is having is one that is likely to be fixed by a restart. "Readiness" responses tell the platform whether or not the app should receive traffic. Importantly again, a negative response tells the platform that the problem the app is having is one that is not likely to be fixed by a restart.

Making the checks more nuanced is itself a form of empowering development teams. The business logic needed to evaluate whether or not a restart would be helpful is one that operations teams have frequently implemented poorly based on the single "up/down" response of a health check. By moving that nuance and the logic needed into the app, it lets the experts on the possible states of the application make the determination.

Liveness and Readiness check results are not about the state of the application. They are instructions to the systems around the application.

Spoken from the opposite perspective: Kubernetes as a platform cares about applications being "up" exactly as much as there is an action that Kubernetes should take when the application is "down". Otherwise, no information is being transmitted, and the internal state of the application is meaningless.

An interesting implementation detail is that in Kubernetes, liveness and readiness are only pull-based. While applications can try to push instructions to the Kubernetes control plane with regards to restarts and traffic shaping, pushing the data also requires authenticating and authorizing those instructions. Liveness & readiness checks use the authorization of the kubelet to communicate back to the control plane. Regardless, the signal itself is based on logic in the app, with an instruction ending up in the platform.

This does not mean that health checks no longer exist. Indeed, if the application cannot respond to a liveness check, that then relies on the business logic of the platform to decide what to do in that case. The most common answer is to destroy the app that cannot respond, and either spin up a new copy somewhere else, roll back a version and try again, or page a human to deploy more complex business logic.