Greetings to all Scylla and Big Data aficionados! Today, we have a special treat for you. We know you’re also all fans of Apache Parquet for storing columnar data in data lakes. Given Scylla’s incredible resource efficiency and low latency queries and Parquet’s efficient storage format, it is obvious that these two great technologies belong together. So today, we’re happy to announce that Scylla’s Migrator can import your Parquet files directly into Scylla tables! Read on for the full details.
The Scylla Migrator
As a reminder for readers who have missed our previous posts (see here for the intro and a here for a deep dive), Scylla Migrator is an open source Apache Spark based project that efficiently loads data into Scylla. Using Spark’s distributed execution model, the Migrator inserts data to Scylla in parallel, taking full advantage of Scylla’s shard-per-core architecture.
Up till now, you could use the migrator to load tables from Cassandra or DynamoDB into Scylla, or even to migrate between one Scylla cluster to another, such as from Scylla Open Source to Scylla Cloud, by writing up your configuration file:
source: type: cassandra host: cassandra-server-01 port: 9042 keyspace: stocks table: stocks preserveTimestamps: true splitCount: 256 connections: 8 fetchSize: 1000 target: host: scylla port: 9042 keyspace: stocks table: stocks connections: 16 savepoints: path: /app/savepoints intervalSeconds: 300
Submitting the Migrator job to your favourite Spark cluster:
spark-submit --class com.scylladb.migrator.Migrator \ --master spark://spark-master:7077 \ --conf spark.scylla.config=config.yaml \ scylla-migrator.jar
And kicking back and relaxing while your data is swiftly transferred into Scylla. So that’s all great, but what about other data sources?
Loading Parquet into Scylla
With Spark’s handy DataFrame abstraction, we can load data from any source that can represent the data as a
DataFrame. We’re kicking off this new feature with support for Parquet files stored on AWS. Here’s how you would configure the Migrator’s
source section to load data from Parquet:
source: type: parquet path: s3a:///acme-data-lake/stocks-parquet credentials: accessKey: secretKey:
If you’re running on an EC2 machine with an instance profile that has permissions to access the S3 bucket and objects, you can skip the
credentials section entirely. Otherwise, supply an access key and secret key directly. (Assuming is currently not supported.)
The rest of the configuration file stays the same! We’re still writing to Scylla, just from a different source. With that configuration in place, we can execute the Migrator using spark-submit as before. Keep in mind that all of the columns in the Parquet file need to be present on the target table.
We’re committed to adding more data sources over time. If you’re interested in trying your hand at implementing another source for the Migrator, check out the source code on Github and send us a pull request.
Another interesting direction we’re considering is making the Migrator a more generic database-to-database data migration tool. That means adding more target types. Stay tuned for more developments in that area soon!