Grafana - correlating metrics with logs

rosanna · December 30, 2021, 7:33pm

Hi, I was wondering which features Grafana has to correlate metrics with logs. Suppose you have Loki and Prometheus configured as data sources in Grafana. Some incident occurs, and you are using Grafana to investigate the root cause of the problem. What I’ve done so far, when using the Explore view, is that I start looking at metrics in panel #1, then open a second panel (Split view) and use Grafana’s sync all views feature to at least synchronize the time range. Does Grafana have any other features to correlate metrics and logs somehow? I suppose the basic challenge is that the labels are dissimilar. namespace, container and pod seem to exist in both metrics and logs data, but for everything else, the label keys differ, unless I manually tweak promtail (and the ServiceMonitors) to have equal labels.

danielF · December 30, 2021, 8:34pm

You’d need to enable exemplar collection.
https://grafana.com/docs/grafana/latest/basics/exemplars/

rosanna · December 30, 2021, 8:48pm

Thanks for pointing out this concept (I didn’t know about it yet). Right now, my observability stack does not include distributed tracing yet. It seems that exemplars are a way to jump between metrics and traces, though. The figure at the bottom of the blog post https://grafana.com/blog/2021/03/31/intro-to-exemplars-which-enable-grafana-tempos-distributed-tracing-at-massive-scale/ mentions that jumping between logs and traces is possible since Loki 1/2, but I don’t really see how to do that jump. Am I missing something?

rosanna · December 30, 2021, 10:47pm

I suppose this video (https://www.youtube.com/watch?v=qVITI34ZFuk) sums up the current capabilities regarding metrics <-> logs. The basic idea is to start in a metrics dashboard, then press “x” to switch to the Explore view, Split the view (and sync the time range), and when switching the data source from Prometheus to Loki, Grafana will make sure that compatible parts of the PromQL query “survive” when converting to a LogQL query.

danielF · December 30, 2021, 11:45pm

Right, since metrics are an aggregation over a time slice the only real correlation to your logs is lining up timespans. And even then you’ve got to be careful because depending on the metric they may not even directly correlate to your logs. Like if you’re looking at a duration histogram you may spin your wheels looking for requests that correspond to the p99 value, but since that’s a calculation that depends on bucket intervals it may not align with what you see recorded in your nginx logs