fbpx

Building Event Streaming Architectures on Scylla and Kafka | June 17, 2021

Save Your Seat >
See all blog posts

Prometheus Backfilling: Recording Rules and Alerts 

For many Prometheus users using recording rules and alerts, a known issue is how both are only generated on the fly at runtime. This limitation has two downsides. First of all, any new recording rule will not be applied to your historical data. Secondly and even more troubling, you cannot even test your rules and alerts against your historical data.

There is active work inside Prometheus to change this, but it’s not there yet. In the short term, to meet this requirement we created a simple utility to produce OpenMetrics data to fill in the gaps. I will cover the following topics in this blog post:

  • Generating OpenMetrics from Prometheus
  • Backfilling alerts and recording rules

Introduction

While Prometheus can load existing data, it does not fill recording rules and alerts. Starting from Prometheus release 2.25 it is possible to backfill Prometheus using OpenMetrics files.

To better understand the issue and the problem, consider the following case: One of our customers wanted tighter and more granular alerts on their p99 latencies.

Before applying those new rules, they wanted to understand what impact those alerts would have. In other words, if those rules were in place, would they have been used enough or would they have been used too much? To make things even more interesting, recording rules are used in the dashboards.

Since the customer was running a production system there was no problem obtaining a week’s worth of Prometheus data. Then all the experiments were done on a separate Prometheus server with those data copies and not on the production one.

I used promutil.py, a small utility python script that you can download from our repository here.

Setup

While I use Scylla Monitoring Stack for the testing because it’s easier for me, you don’t need to. I suggest using the Prometheus Docker container in general.

  1. Create a data directory — we’ll assume it’s called data.
  2. Place your Prometheus table data in that directory.
  3. Download promutil.py
  4. Have your Prometheus rule file as prometheus.rules.yml

Run Prometheus with Docker

docker run -d  -v "$PWD/data:/prometheus/data" -p 9090:9090 --name prom prom/prometheus:v2.25.0

This will run Prometheus using data in its data container; you can connect to the server via HTTP via port 9090 (e.g., http://{ip}:9090, where “ip” is your server’s IP address).

You can check that your data is there by looking at a known metric. Typically you would need to look back at a day or two, depending how old your data is.

Run promutil.py

The promutil.py utility can generate OpenMetrics output-file from Prometheus. You can run ./promutil.py help to see the different options.

We would use a range query. You can supply a specific query as a parameter but instead we would use the prometheus.rules.yml file, that would add to the output file each of the metric rules in that file.

A range query needs a start and end, the promutil.py would accept any two of: start, end and duration.

Start and end can either be absolute (example: 2021-01-29T23:58:55.980Z) or relative (examples: 8s, 10h, 3d). For example, this is how we generate 3 days of metrics that ends 2 days ago:

./promutil.py --rules prometheus.rules.yml rquery -d 3d --end 2d --out-file /data/metrics.txt

Note that we placed the output file inside the data directory. This is important because we can access it from inside the Prometheus Docker container.

Generate the Prometheus Blocks

Run promtool inside the container:

docker exec -it prom promtool tsdb create-blocks-from openmetrics /prometheus/data/metrics.txt /prometheus/data/

Restart the Prometheus server and that’s it — your rules are there!

Alerts

For testing, promutil.py can also be used to handle alerts. While it will not generate the alerts for historical data, it will create a metric for each of the alerts named alert:{alert_name}.

Labels in the alert will be added to the generated metric. You can now look at the graph of that metric, you will see the points that match the alert criteria.

Bonus – Ad Hoc Alerts

Unrelated to backfilling, if you regularly keep track of systems then you know the feeling of looking at a system and trying to figure out what’s going on. You can also use the promutil.py for such cases. Keep a rule file with as many alerts as you want. The output of running the promutil on this file will return only metrics that match those alerts. In other words, it’s like asking, “have any of these conditions been met in the last few days (or hours, etc.)?”

Take Away

Prometheus finally supports backfilling metrics through the use of OpenMetrics format. You can use our promutil.py to generate historical values for recording rules and alerts from a running Prometheus server. You can also use the promutil.py for ad hoc alerts, making it a handy new tool when diagnosing a system.

DISCOVER MORE IN OUR SCYLLA MONITORING HANDS-ON LAB

 

 

Amnon Heiman

About Amnon Heiman

Amnon has 15 years of experience in software development of large scale systems. Previously he worked at Convergin, which was acquired by Oracle. Amnon holds a BA and MSc in Computer Science from the Technion-Machon Technologi Le' Israel and an MBA from Tel Aviv University.