Apr25

Announcing Scylla Manager 1.1, a Production-Ready Release of Scylla Manager.

Subscribe to Our Blog

Scylla Release

The Scylla team is pleased to announce the release of Scylla Manager 1.1, a production-ready release of Scylla Manager for Scylla Enterprise customers.

Scylla Manager adds centralized cluster administration and recurrent task automation to Scylla Enterprise. Scylla Manager 1.x includes automation of periodic repair, with future releases providing rolling upgrades, recurrent backup, and more. With time, Scylla Manager will become the focal point of Scylla Enterprise cluster management, including a GUI frontend. Scylla Manager is available for all Scylla Enterprise customers. It can also be downloaded from scylladb.com for a 30-day trial.

Related links

New features in Scylla Manager 1.1

 

Scylla Manager Metrics, Dashboard, and Alerts

Scylla Manager now reports metrics over the Prometheus protocol. You can use Scylla Manager metrics directly or with the Scylla Manager dashboard from the Scylla Monitoring Stack.

release

With the latest addition of alerts to the Scylla Monitoring Stack, an alert will be triggered in case a repair has failed, or if Scylla Manager exits for any reason. More on Alerts here.

Full List of Scylla Manager metrics:

Parameter Description
cluster Cluster unique identification (string)
host Host IP address
shard Shard (core) number
unit Repair Unit
quantile Histograms quantile

 

Metric Type Description
repair_duration_seconds summary repair_duration_seconds. The time repair has been running in seconds.
repair_segments_success gauge Number of repaired segment
repair_segments_error gauge Number of segments that failed to repair
repair_segments_total gauge Total number of segments to repair.
Where repair_segments_error + repair_segments_success = repair_segments_total
ssh_open_streams_count gauge Number of active (multiplexed) connections to Scylla node.
log_error_total counter counter Total number of ERROR messages
log_info_total counter counter Total number of INFO messages

Repair Retries

Starting from Scylla Manager 1.1, during a repair, should a segment fail for any reason, Scylla Manager skips the segment and continues until all segments have been repaired. It will then go back to the skipped segments and try to repair it a second time. Each repair attempt is considered a retry. The number of retries is configurable in the scylla-manager.yaml (see below)

You can follow the progress of retries using the sctool repair progress command or Grafana Manager dashboard.

Repair Configuration

The following new configuration parameters are now available:

sctool updates

  • New date format across the tool. The new format contains TZ (UTC) info and is easier to read ie. 13 Apr 18 00:00 UTC
  • Start/End/Duration info in the repair progress command
  • Task list can be used without a cluster argument to see tasks on all the clusters

REST API updates

  • New /ping service for testing manager availability
  • New /metrics service expose Manager metrics (also, see Scylla Manager Metrics above)
  • New /progress/{run_id} return status of a specific repair
  • Rest API Update:
    /cluster/{cluster_id}/repair/unit/{unit_id}/progress is now /cluster/{cluster_id}/repair/unit/{unit_id}/progress/{run_id} making the repair run ID mandatory.

Other improvements

  • New script scyllamgr_ssh_test performs a quick test of SSH connectivity between Scylla manager and Scylla nodes

Noteworthy bug fixes in Scylla Manager 1.1

  • All Time types are now represented as UTC (Coordinated Universal Time)
  • sctool: The task list command now displays duration instead of end time. In addition, any task list command that is run without a cluster name argument, reports a list of tasks from all clusters.
  • Scylla Manager log level is now configurable in the manager.yaml file
  • Scylla REST API was reporting SSH errors with confusing error messages, this has been fixed.
  • dist: A Restart directive was added to the scylla-mgmt service

Tzach LivyatanAbout Tzach Livyatan

Tzach Livyatan has a B.A. and MSc in Computer Science (Technion, Summa Cum Laude), and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier grade systems, signalling, policy and charging applications.


Tags: cluster management, manager, release, task automation