Observability and the One Legged Stool in Clojure

Imagine that you've built a small web application. It's generally well designed, and it seems to be working well. You have great test coverage. Are you ready to start collecting users?

The big question here is: what will happen when something goes wrong? Will you be able to take whatever crumbs of information some customer gives you (or ideally, doesn't need to give you) and reconstruct the story of what happened and why in order to fix it? What happens when the unexpected happens?

In some cases, maybe an exception is thrown, and the stacktrace might be enough to reconstruct what happened. But sometimes it's not.

Logs might not be enough either. Maybe the thing that's going wrong isn't "a user is running into an error", but instead "half the users hitting /some/endpoint are facing very slow response times."

The answer here is telemetry. The word "telemetry" is derived from Greek: "tele" means "remote," while "metron" means "to measure." So telemetry is the ability to remotely monitor what's going on with your system. In other words: debugging in production!

The Pillars of Telemetry

Traditionally, there are three pillars of telemetry:

Logs - the oldest form of telemetry, originally designed to be text messages read by a human. Today we have structured logging, which makes logs a bit more machine friendly.
Metrics - numeric data about the state of the system at some point in time, like "how many requests are we receiving?" or "how long is the typical request taking?" or "how long is the 99th percentile SQL query?"
Traces - the newest, fanciest pillar on the block, tracing is all about tracking a single request through your entire system. It's especially useful for distributed systems.

Note that there can be plenty of overlap between the different pillars of telemetry. In particular, the difference between a log, a metric, and a trace is less about the data itself, and more about the interpretation of the data.

You can imagine a log as the base case: it's an event that occurred in the system. With structured logging, you can include all sorts of additional information - for example, you might include the user who made a request, or a request id.

You could log a message like database query finished with an attached context, containing information like duration_ms, request_id, or even containing_span_id.

So you might end up with logs like this:

http request duration_ms=1000 request_id=1234 containing_span_id=null span_id=foobar
database query finished duration_ms=900 request_id=1234 containing_span_id=foobar span_id=binbaz

Now, in one sense, these are just log messages - your telemetry stool is missing two of its legs (er, pillars).

But from these logs, you can quite easily:

construct metrics, e.g. what was the average duration of an http request?
construct traces, e.g. why did the http request with request_id=1234 take 1000ms to complete? Because the database query it made took 900ms.

Now, don't get me wrong - if you try to just use logs for telemetry, you will have a bad time. Why? Because logs are not designed for this. Yes, a service like Sumologic will allow you to create a dashboard based on your logs, graphing out your average request time. But when a crisis happens and you need to rapidly create different charts to rule out various hypotheses about what's going wrong, a system like this is usually painfully slow.

And traces are similar. Yes, you could have a root_span_id on your logs, search for that log, and then walk through the logs looking a slow span. But a tool designed for ingesting traces will help you answer questions much more quickly.

muLog

Over time, I've developed a policy of, mostly, choosing boring technology.

I'm less interested in the new and shiny than I used to be, and more interested in the bog standard, battle tested library that has seen production use for the past 900 years (computer years; that's about 9 human years).

The boring technology for logging, in clojure, would probably be slf4j plus logback. slf4j is a logging facade, a single interface that can be plugged into whatever logging framework you want to use (like logback, or java.util.Logging). The logging framework is in charge of actually sending those logs to - wherever you configure it to.

muLog is less boring, but to be fair - it's not really just a logging library.

Essentially, it takes our insight above and runs with it.

For good telemetry, you don't need logs, metrics, or traces - not exactly.

What you need is events, and a system for translating those events into logs, metrics, and traces.

The difference between a log and a metric isn't about the underlying event - it's about how you process the underlying event data, and where you send the processed result.

After adding it to your dependencies, you use it like this:

(require '[com.brunobonacci.mulog :as mu :refer [log]])
(log ::an-event-name :key :val ... )
(mu/trace ::a-trace-name
  {}
  (do-a-thing))

By itself, this does nothing. You need to use (log/start-publisher!) to start a publisher. Which publisher? There are a lot - my particular favorites are:

the console (for logging in development)
elasticsearch (for logging in production)
prometheus (for metrics)
zipkin (for tracing) ¹

The Tradeoff

Of course, there is a significant tradeoff of muLog: namely, existing logs won't go to mulog. This is mainly a problem in the case of dependencies. For example, hikari-cp, a connection pool for JDBC connections, logs information about its pool.

One option is to just configure slf4j and logback and plug those logs into elasticsearch so they can be searched as well.

It's also possible to configure slf4j to use muLog as a backend, so your existing logs will become muLog "events." In general though, because many libraries write logs for humans, not for machines (e.g. a message like "Started ServerConnector@611ce57a{HTTP/1.1, (http/1.1)}{0.0.0.0:1337}" is useful, but not machine readable). So I generally prefer to just send the logs to elasticsearch.

Where to add telemetry

You should have a "trace" event wrapping:

an entire request/response cycle
an entire job
almost any interaction with external systems, including caches, databases, or other services (your own, or someone else's)
anything else that seems unusually likely to fail or take a long time!

The mu/trace macro is very nice. It:

automatically reports whether the operation completed successfully, or had an error. If an error occurred, it includes it in the trace.
allows you to capture details from the result of the operation, and include them in the trace. For example, you can extract the status returned by an HTTP handler

In general, you'll also want to wrap entire handlers and jobs with a call to mu/with-context. This sets a thread-local context that will be included in all traces. For example, for a request handler, you'd probably want a context that includes the request method and route that you're responding to. This way, individual logs deep within the handler will also include this context.

I haven't actually used Zipkin itself. Instead, I'll use the OpenTelemetry Collector, which you can very easily configure to ingest Zipkin-formatted data and export OTLP-formatted data to Honeycomb. I have a very strong love for Honeycomb - it's kind of a magical product. The best way to describe it is that it allows you to use traces as if they're metrics. So you can view a heatmap of how long your requests take, and graph the average or the median or the P999, and then drill down from there into individual requests, individual database queries, etc.↩