In the past we wrote about getting onboard with Scylla Cloud. Great! Now you have a cluster up and running. You’ve connected your app through VPC peering. Then what? Unless you are creating a greenfield application, you probably need to get your existing data migrated into Scylla Cloud.
We’ve written a pretty comprehensive blog about migrating to Scylla in general, but not all of those migration strategies apply to a managed cloud solution. So let’s drill down into specifics and step-by-step instructions targeted directly to your Scylla Cloud success.
Also, we’re going to focus today’s migration considerations solely to SSTable-compatible databases, such as Scylla Open Source, Cassandra or DataStax Enterprise (DSE). Migrating from other databases requires remodeling your data, which we can cover in future articles.
Migrating Your Data
In order to perform data migration in an orderly fashion, one must assess the requirements, consideration, choose the best tool for the job and create a migration plan. This process is comprised of several stages / steps as described here:
- Assessments of
- Cluster sizing (total dataset , data per node)
- Throughput and latency requirements
- Data migration considerations (source cluster schema, features in use, etc.
- Taking a snapshot of the source cluster’s data and uploading it to an S3 bucket
- Downloading the data into a Scylla mediator node and preparing the data for population
- Deploying the Scylla Cloud cluster (including VPC peering setup, if needed)
- Populating Scylla Cloud cluster with the source data (Data migration)
- Customer pointing his clients/drivers to the new Database
Let’s dive into each stage / step and see what’s actually needed to make it happen.
Assessment Questions (Migration Considerations)
Generic Questions (covered in the sizing form)
- Where are you currently deployed? Are you running on the cloud now, in a managed environment (dockerized, DC/OS or Kubernetes), or on bare metal?
- Current cluster node type and count per DC vs. Scylla Cloud cluster node type and count (instance type and size) per DC?
- Total dataset / How much data per node (inc. Replication Factor)?
- Current Throughput / Latency numbers? Expectations?
- What is your current Cassandra/DataStax Enterprise or Scylla version? This will determine your SSTable file format (ka, la, mc)
- What is the Replication Factor (RF) per keyspace / datacenter? (Scylla Cloud currently supports RF=3/4/5, should you need a different RF, please contact us)
- Do you require, or are you using any of the following Scylla/Cassandra features?
- Which Region(s) should the cluster be deployed in? How many (and which) AZs are used in the Source cluster?
- Multi region / multi DC cluster?
- Do you have large partitions (over 100MB)? We have tooling to identify them. Running the following command will tell you that:
nodetool cfstats | grep -i "Compacted partition maximum"
- What partitioner is used in the source cluster? (check scylla/cassandra.yaml file)
- Do you require VPC Peering or will direct access using allowed IPs (Specific / Ranges of Public IP) will suffice?
- Do you have any network routing constraint? (NAT etc.)
Snapshot your source cluster data
nodetool snapshot command on each of the source cluster nodes and upload all snapshots to an S3 bucket of your choice, each node in a separate folder. In case all keyspaces are replicated to all datacenters, a snapshot from all nodes in one of the datacenters will suffice.
We will need the access info to your S3 bucket. Alternatively, we can provide you with an S3 bucket to which we already have access (open a request with us via the Scylla Cloud Zendesk support link).
Once the snapshots are done, any new data ingested in your source cluster will not be migrated. This can be handled by modifying your client’s code to perform dual writes to both clusters, prior to taking the snapshots.
Download the data from S3
If you are using your own S3, we need you to provide us the following details:
S3_bucket_name =<customer to provide>
aws_access_key_id =<customer to provide>
aws_secret_access_key =<customer to provide>
Note: Downloading from S3 (EBS volumes) can take a while (hours or days) depending on the data size.
Preparing the data for population
The data downloaded from S3 is duplicated, assuming RF>1, so we will first handle any sstables naming issues (names can be duplicated), and then unify the data to create a single unified copy of it (RF=1). This is done by various scripts on our end.
Deploying Scylla Cloud Cluster
Cluster can be either single/multi region, or single/multi-datacenter per your needs.
In each region where the cluster is deployed we will utilize all the Availability Zones (AZs) evenly, per the replication factor you select. For example when setting RF=3 each node will reside in a different AZ and the cluster size can be 3/6/9 nodes, while for RF=5 in a region with 5 AZs, the cluster size can be 5/10 nodes.
The RF will be equal to the AZs for Scylla System tables, but for the data tables you can set a different RF factor on the relevant keyspaces, when you create your schema.
Populating Scylla Cloud Cluster with data
Once we have a single copy of the data, VPC peering setup done, and the Scylla Cloud cluster is up, all that’s left is just decide on the best fit strategy for Data Migration. Once the schema is created, we are ready to fire-up the loaders and push the data into the new cluster.
This can be done using various tools such as sstableloader and/or the Scylla Spark Migrator, an Apache Spark-based application developed in-house. Essentially, both tools read the source data and perform
CQL INSERT into the destination cluster. The Spark Migrator does it more effectively and with greater parallelism. We just recently posted 2 blog posts on the Spark Migrator, Part 1 (high-level) and Part 2 (deep-dive).
Another option is to perform a full table scan on your Source data and dump it into Scylla Cloud using CQL.
There’s also an option to move the sstable files themselves into Scylla Cloud, place them under the relevant keyspace -> table ->
upload folder, and then use
nodetool refresh command to load them. This off course will be done by Scylla support staff, making sure it’s done properly.
Pointing the clients/drivers to the new DB
Time for your clients to start writing and reading from Scylla Cloud. Let us know if everything is working as you expect and especially if you hit any issues. We are here to help!