Another week, another Spark and Scylla post! This time, we’re back again with the Scylla Spark Migrator; we’ll take a short tour through its innards to see how it is implemented. Read why we implemented the Scylla Spark Migrator in this blog. Overview When developing the Migrator, we had several design goals in mind. First, the Migrator should be highly efficient in terms of resource usage. Resource efficiency in the land of Spark applications usually translates to avoiding data shuffles between nodes. Data shuffles are destructive to Spark’s performance, as they incur more I/O costs. Moreover, shuffles usually get slower […]
Welcome to a whole new chapter in our Spark and Scylla series! This post will introduce the Scylla Migrator project – a Spark-based application that will easily and efficiently migrate existing Cassandra tables into Scylla. Over the last few years, ScyllaDB has helped many customers migrate from existing Cassandra installations to a Scylla deployment. The migration approach is detailed in this document. Briefly, the process is comprised of several phases: Create an identical schema in Scylla to hold the data; Configure the application to perform dual writes; Snapshot the historical data from Cassandra and load it into Scylla; Configure the […]
Following up on our previous post on saving data to Scylla, this time, we’ll discuss using Spark Structured Streaming with Scylla and see how streaming workloads can be written in to ScyllaDB. This is the fourth part of our four part series.
Last time, we discussed how Spark executes our queries and how Spark’s DataFrame and SQL APIs can be used to read data from Scylla. That concluded the querying data segment of the series; in this post, we will see how data from DataFrames can be written back to Scylla.
In part 2 of our Scylla and Spark series, we will delve more deeply into the way data transformations are executed by Spark, and then move on to the higher-level SQL and DataFrame interfaces.
Welcome to part 1 of an in-depth series of posts revolving around the integration of Spark and Scylla. In this series, we will delve into many aspects of a Spark and Scylla solution: from the architectures and data models of the two products, through strategies to transfer data between them and up to optimization techniques and operational best practices.