Scylla Monitoring Stack enables teams to gain better visibility into their high-performance Scylla cluster by providing real-time and historical trend information.
Scylla Monitoring Stack is a bundle of three components (a Prometheus metric collector, alert manager, and Grafana 6 dashboards) that can be deployed as containers or directly onto a host. It collects aggregated metrics and events through Scylla Manager. The stack empowers DevOps, Infrastructure Operations, and Database Administrators to quickly find and fix issues impacting the performance of their Scylla cluster. Teams can drill down from high-level dashboards to detailed metrics to determine next steps.
Scylla Monitoring Stack includes a set of pre-built dashboards to monitor your Scylla cluster in real time. Hundreds of different metrics populate dashboard components for your team to review historical trends and identify anomalous behavior in your cluster.
Complete visibility into your Scylla database cluster so you can quickly find and fix issues affecting your performance.
The CQL dashboard helps teams identify query issues, poor data models, and unexpected driver behavior. Teams can quickly see, for example, if their cluster is being hit by a lot of heavy queries with full table scans where “allow filtering” is enabled.
Quickly identify queue latency and performance for the commitlog, compaction, memtable, and more. The dashboard supports dynamic classes added to Enterprise releases.
Quickly identify nodes in your cluster and drill down to detailed OS level metrics such as CPU utilization, IO, and Errors. Teams can quickly decide if nodes need to get rebooted or if the team needs to perform a rolling upgrade on nodes running old versions.
Set conditional alerts for your Scylla cluster within the alert manager so your team knows when incidents arise. Out-of-the-box alert triggers are included for conditions such as:
Database administrators are able to annotate heavy tasks such as backup or repair start and finish times. This helps cross functional teams visually understand why there may be additional latency or reduced throughput during particular times.