Herding Azure WebApps : Health checks

Posted 24 Sep 2021

When running more than a few web and function apps in Azure you really need a generic means of checking in on them. Kubernetes has a nice convention for this using a couple of health endpoints; one for a liveliness check and one for a ready check. Applying the same pattern to Azure Web Apps provides some really nice operational insights.

Health checks mostly come in three flavors:

Liveliness
Mostly intended to detect application hangs, for example due to thread pool starvation and all out application crashes
Readiness
Verifies that your application is correctly configured and ready to do work.
Startup
Can be used to allow an app to warm up for a slot swap on Azure Web Apps

Health checks in ASP.NET Core by default return the following status:

Status	HTTP Status	Description
Healthy	200	Everything is fine
Degraded	200	The app is not functioning as it should, but the failures are recoverable. For example: a queue is filling up
Unhealthy	503	The app is missing a required dependency, is not configured correctly or has gone into a faulted state

Liveliness check

The liveliness check is a very basic check that should return HTTP 200 OK as quickly as possible. Its main purpose is to check if the app is running and can handle requests. It can evaluate additional metrics like memory consumption or remaining disk space, as long as it takes very little effort to collect these metrics.
You could for example return a Degraded status when available disk space drops below a threshold.

The live check will help detect issues with applications malfunctioning under pressure, for example due to thread pool starvation.

The Liveliness check should explicitly not invoke any calls on dependent services or other costly operations.

Worst case, calling dependencies from a liveliness check could end up circling back to the service creating an infinite loop

Ready check

The ready check is intended to verify that the service is configured correctly and able to access all it's essential dependencies. It should for example touch all (cloud) services like databases, storage accounts and messaging infrastructure. The ready check is also a good place to verify data models are compatible with the software version.

Synchronous connections

In distributed systems we usually try to avoid direct, synchronous connections between services. Sometime you you just have to and you may be tempted to have the readiness check invoke the liveliness check of a downstream service. While it is possible to do so, you should consider alternatives because every hit on a dependency is load on the service and we want to minimize that, especially when a service is already under stress.

To prevent making calls to health endpoints, consider implementing a circuit breaker on the connection. When the circuit breaker trips, you know the dependency is down. There's no need to check it with a network call that is going to time out anyway. In your ready check, use the circuit breaker status instead of the result of a direct liveliness check.

Rules

Health checks should be implemented with a couple of rules in mind:

Health checks should either execute fast or fail fast, use a timeout to limit duration to 10 seconds or less.
Caching health check results is fine for a short duration, maybe 30 seconds or a minute. It will help limit the number of calls made to dependencies
Prevent chaining health checks across your entire dependency graph. It will make the call expensive and introduce the risk of the call looping back to the service, forming an infinite loop
Rely on the infrastructure to check if your service is running by having it call your health endpoints.
Health end-points that expose information about your infrastructure must be secured
Don't log healthy responses, only log failed checks. Logging costs money and you're generally not that interested in stuff that's working as it should.

Consider what will happen if the health check doesn't return. For example when the service is drowned in request or suffering from other issues that allow it to respond timely, or even respond at all.

Health checks in ASP.NET Core

With .NET Core a standardized method for implementing health checks was introduced, making it easy to implement health checks of your own.

Health checks in Azure Functions

The default health check implementation uses a Hosted Service which is not allowed within a function host up to version 3. Functions v5 use out-of-process execution which should enable the normal health middleware to run.

There is a hack to work around the issue up on Github.

Monitoring

Azure WebApp health probe

Azure WebApps can monitor your health end-points. If you have scaled out to 2 or more instances, Azure will automatically take the app out of the load balancer and recycle it or restart the container.

It makes perfect sense to hook this functionality into the liveliness check since that will start to fail when your app locks up.

If you want to secure your health checks, make sure you follow the guidelines in the documentation to enable Azure to access the health end-points.

Slot swaps

When using slots to reduce down-time during deployment, you can configure the platform to use the readiness check to assess whether your app is configured correctly and ready to start receiving requests by setting the WEBSITE_SWAP_WARMUP_PING_PATH variable to the path of your readiness or startup check, if you have implemented one.

Log health failures

The ASP.NET Core health checks support pushing health check results to APM and log analytics services like ApplicationInsights, DataDog, Prometheus or Seq.

The AspNetCore.Diagnostics.Healthchecks open-source project includes plugins to enable this.

This solution will also log details about what dependency is failing so this will be a good source of information for problem analysis.

Using probes

It's usually a good idea to to probe your application and see if there is anything in the network preventing traffic from reaching your (public) end-points. If you only have an API, using a liveliness or ready check as your monitoring URL.

I strongly advise not to point probes at all services in a distributed services landscape. Instead, have each service log failing dependencies and build dashboards on the logs. This will give much more accurate insights into actual problems.

Use active probing only for the ready check of services in the critical path, keep the frequency low (minutes between calls rather than seconds).

In a distributed landscape, lots of stuff can and will fail, all the time. Actively probing all your services exposes all these little glitches. This will result in lots of false positives because a failure on synthetic traffic is not a proper signal to alert on.

Developer Notes