Hello Colleagues, I’m trying to create a Loki Alert in Grafana 9.3.1 . The alert should fire when a certain substring is found in the roadrunner container log, e.g. {container="roadrunner", namespace="dev"} |= 502
. I’ve tried different reduce and threshold options, however I end up getting the error “input data must be a wide series but got type long (input refid)”. Is there a good document how to configure alarms triggered by the presence or count of a substring(s) or a regular expression in the Loki logs?
Step 4’s 2nd bullet is the root of your issue:
> • Enter a PromQL or LogQL expression to query. The rule fires if the evaluation result has at least one series with a value that is greater than 0. An alert is created for each series.
I imagine your query is something along the lines of:
{job="myjob"} |= "mysubstring"
Which is invalid, but you could do something like:
count_over_time({job=~"myjob"} |= `mysubstring` [15m]) > 0
I think I have read this document but somehow I thought that Grafana will count the series automatically. There is no mention of count_over_time() or any other operations in the docs. In fact there is no other way with logs?
BTW what about the interval I should set for the count_over_time() operation?
Yeah I found this quite confusing and insufficiently documented as well. Google results are littered with people running into this and being confused.
My impression is you do not have to use count_over_time
, but you do have to use something that results in a wide series versus a long. What all your options are there is unclear to me.
The interval is how big a window it will use when evaluating the query and is dependent on your use case.
> The interval is how big a window it will use when evaluating the query and is dependent on your use case
The case is there are several intervals and they should be consistent (see the screenshot). Is it correct to use the $__interval_
macro or it should be the _$__range
macro?
Your UI looks slightly different than mine but my understanding is as follows (given the data in ur screenshot):
• The rule will evaluate once every 1m.
• The rule must be true for 5m for the alert to be fired.
• Every time the rule is evaluated, your query is performed. And the time range the query looks at may be the same size as either or both of the prior bullet points, but it is not required to be. For example your query could look at the last 1 hour of logs, or the last 15 seconds. But imagine in the latter case (15s) in order for the rule to fire you would require an error log to occur every 15 seconds for 5 minutes for the alert to fire (or rather, within 15 seconds of every rule evaluation which would occur once a minute, I suppose). Whereas in the former case (1h), you could have 1 error log occur total, and after 5 minutes the alert would fire because it still exists within the 1hr window of the query.
Honestly in that UI context I’m not sure what $__interval
would evaluate to (best guess would be last 15 minutes with it reflecting the top red box). I enter my queries in raw to a UI that looks different so I can’t say for sure. In my case I don’t use that variable in this context but provide a hard coded range depending on the use case (15m, etc.).
But if the count_over_time()
range is more than the top red box, the count_over_time()
function will never see a full 1 hour range and thus will never be over 0 ?
The video attached to the docs is completely useless too. It just shows some very trivial things which are quite clear without any video, and does not cover the really difficult topics.