We have a postgres table that stores all of our server’s HTTP requests/response with external services, allowing us to review the request/response body in case of a reported issue. We’re also starting to build Grafana panels off of this table as it is essentially a time series. Question: I’m trying to get a sense for how people typically model something like this in more typical time-series databases. I think that storing the entire request/response body in a TS database event would be overkill, and that you’d typically want to keep the data down to just simple numeric/string values, but preserving the request/response for potential later review is an important use case. So if I wanted to use a more robust TS database to build graphs around, how would I model this? Thinking maybe: continue to use the Postgres table for storing req/resp (and potentially prune old rows since they’re pretty large), and in addition add simpler statistical HttpInteraction events with duration/status_code/error to a more seasoned TS database that can keep the historical data around much longer for graphing purposes?
What you’re describing sounds a lot like log indexing problem. You can take a look at how Loki works, it’s pretty much the use case you have
Sorry, I’ve misread that you’re talking about request / response bodies. I would probably still index the logs and have a reference in a log line to req / res bodies ID. The bodies can then be stored using that ID either in a DB or in an object storage (S3 / GCS).
That makes sense to me. In my case, there wouldn’t be a log line, but rather a UUID of the log, that I could stash as part of the TS event.
Followup question: Grafana graphs are typically aggregated/averaged values over time; if I see a problematic region in the graph, how do I drill down and ultimately find the problematic UUIDs so that I can look up the req/resp bodies in our PG database?
(I think this is a very general question that could apply to lots of Grafana use cases/scenarios: how do you 1. de-aggregate the data and get specific events and 2. how would you add links to take you to another site (non-Grafana) to inspect the contents of the req/resps?
So in terms of problematic exemplars there’s a couple of things to be said:
- Metrics (as in some series of (ts, value) pairs) are not de-aggregatable. This is a design limitation, since it would be prohibitively expensive to store them that way.
- The answer is usually sampling - Prometheus has exemplars, which have some support in Grafana 7.4.
- In terms of links, Grafana has support for data links (https://grafana.com/docs/grafana/latest/linking/data-links/).
Here’s a blog post that might be relevant - https://grafana.com/blog/2020/11/09/trace-discovery-in-grafana-tempo-using-prometheus-exemplars-loki-2.0-queries-and-more/.
I’m reading that blog and the glossary and i still don’t really understand what an Exemplar is
An exemplar is essentially some sampled measurements in a timeseries that includes extra labels. A common use-case is to use those labels to provide IDs of traces or logs, so that you can cross-reference some metric measurement against a trace or a log line. You can check out for example this video here - https://www.youtube.com/watch?v=TzNZIEvhAdA, which expands on this idea.