See all blog posts

Migrating to ScyllaDB Cloud

Migrating to ScyllaDB Cloud

In the past we wrote about getting onboard with ScyllaDB Cloud. Great! Now you have a cluster up and running. You’ve connected your app through VPC peering. Then what? Unless you are creating a greenfield application, you probably need to get your existing data migrated into ScyllaDB Cloud (NoSQL DBaaS).

We’ve written a pretty comprehensive blog about migrating to ScyllaDB in general, but not all of those migration strategies apply to a managed NoSQL DBaaS solution. So let’s drill down into specifics and step-by-step instructions targeted directly to your ScyllaDB Cloud success.

Also, we’re going to focus today’s migration considerations solely to SSTable-compatible databases, such as ScyllaDB Open Source NoSQL Database, Cassandra or DataStax Enterprise (DSE). Migrating from other databases requires remodeling your data, which we can cover in future articles.

Migrating Your Data

In order to perform data migration in an orderly fashion, one must assess the requirements, consideration, choose the best tool for the job and create a migration plan. This process is comprised of several stages / steps as described here:

  1. Assessments of
    1. Cluster sizing (total dataset , data per node)
    2. Throughput and latency requirements
    3. Data migration considerations (source cluster schema, features in use, etc.
  2. Taking a snapshot of the source cluster’s data and uploading it to an S3 bucket
  3. Downloading the data into a ScyllaDB mediator node and preparing the data for population
  4. Deploying the ScyllaDB Cloud cluster (including VPC peering setup, if needed)
  5. Populating ScyllaDB Cloud cluster with the source data (Data migration)
  6. Customer pointing his clients/drivers to the new Database

Let’s dive into each stage / step and see what’s actually needed to make it happen.

Assessment Questions (Migration Considerations)

Generic Questions (covered in the sizing form)

  1. Where are you currently deployed? Are you running on the cloud now, in a managed environment (dockerized, DC/OS or Kubernetes), or on bare metal?
  2. Current cluster node type and count per DC vs. ScyllaDB Cloud cluster node type and count (instance type and size) per DC?
  3. Total dataset / How much data per node (inc. Replication Factor)?
  4. Current Throughput / Latency numbers? Expectations?

Cloud-Specific Questions

  1. What is your current Cassandra/DataStax Enterprise or ScyllaDB version? This will determine your SSTable file format (ka, la, mc)
  2. What is the Replication Factor (RF) per keyspace / datacenter? (ScyllaDB Cloud currently supports RF=3/4/5, should you need a different RF, please contact us)
  3. Do you require, or are you using any of the following ScyllaDB/Cassandra features?
    1. Does your source DB schema uses Secondary Indexes (SI) or Materialized Views (MV)? (Reminder that ScyllaDB uses Global Secondary Indexes, while Cassandra’s are local.)
    2. Does your source DB uses Hinted Handoff (HH)?
    3. Does your source DB schema uses Counters prior to Cassandra 2.1?
  4. Which Region(s) should the cluster be deployed in? How many (and which) AZs are used in the Source cluster?
  5. Multi region / multi DC cluster?
  6. Do you have large partitions (over 100MB)? We have tooling to identify them. Running the following command will tell you that: nodetool cfstats | grep -i "Compacted partition maximum"
  7. What partitioner is used in the source cluster? (check scylla/cassandra.yaml file)
  8. Do you require VPC Peering or will direct access using allowed IPs (Specific / Ranges of Public IP) will suffice?
  9. Do you have any network routing constraint? (NAT etc.)

Snapshot your source cluster data

Run nodetool snapshot command on each of the source cluster nodes and upload all snapshots to an S3 bucket of your choice, each node in a separate folder. In case all keyspaces are replicated to all datacenters, a snapshot from all nodes in one of the datacenters will suffice.

We will need the access info to your S3 bucket. Alternatively, we can provide you with an S3 bucket to which we already have access (open a request with us via the ScyllaDB Cloud Zendesk support link).

Once the snapshots are done, any new data ingested in your source cluster will not be migrated. This can be handled by modifying your client’s code to perform dual writes to both clusters, prior to taking the snapshots.

Note: for more info on live migrations, see our documentation or read this migration blog.

Download the data from S3

If you are using your own S3, we need you to provide us the following details:

  1. S3_bucket_name = <customer to provide>
  2. aws_access_key_id = <customer to provide>
  3. aws_secret_access_key = <customer to provide>

Note: Downloading from S3 (EBS volumes) can take a while (hours or days) depending on the data size.

Preparing the data for population

The data downloaded from S3 is duplicated, assuming RF>1, so we will first handle any sstables naming issues (names can be duplicated), and then unify the data to create a single unified copy of it (RF=1). This is done by various scripts on our end.

Deploying ScyllaDB Cloud Cluster

Cluster can be either single/multi region, or single/multi-datacenter per your needs.
In each region where the cluster is deployed we will utilize all the Availability Zones (AZs) evenly, per the replication factor you select. For example when setting RF=3 each node will reside in a different AZ and the cluster size can be 3/6/9 nodes, while for RF=5 in a region with 5 AZs, the cluster size can be 5/10 nodes.

The RF will be equal to the AZs for ScyllaDB System tables, but for the data tables you can set a different RF factor on the relevant keyspaces, when you create your schema.

Populating ScyllaDB Cloud Cluster with data

Once we have a single copy of the data, VPC peering setup done, and the ScyllaDB Cloud cluster is up, all that’s left is just decide on the best fit strategy for Data Migration. Once the schema is created, we are ready to fire-up the loaders and push the data into the new cluster.

This can be done using various tools such as sstableloader and/or the ScyllaDB Spark Migrator, an Apache Spark-based application developed in-house. Essentially, both tools read the source data and perform CQL INSERT into the destination cluster. The Spark Migrator does it more effectively and with greater parallelism. We just recently posted 2 blog posts on the Spark Migrator, Part 1 (high-level) and Part 2 (deep-dive).

Another option is to perform a full table scan on your Source data and dump it into ScyllaDB Cloud using CQL.

There’s also an option to move the sstable files themselves into ScyllaDB Cloud, place them under the relevant keyspace -> table -> upload folder, and then use nodetool refresh command to load them. This off course will be done by ScyllaDB support staff, making sure it’s done properly.

Pointing the clients/drivers to the new DB

Time for your clients to start writing and reading from ScyllaDB Cloud. Let us know if everything is working as you expect and especially if you hit any issues. We are here to help!

About Tomer Sandler

Tomer Sandler joined ScyllaDB as a solution architect after a 12 year career in SW Quality Engineering, mostly in storage and telecom lawful interception domains. Prior to ScyllaDB, Tomer held various QA management roles at Dell EMC, leading a group of QA engineers and information developers for ScaleIO storage.