Live Test
Spin up a 3-node Scylla cluster to see our light-speed performance
Get Started >
See all blog posts

Migrate Parquet Files with the Scylla Migrator

Greetings to all Scylla and Big Data aficionados! Today, we have a special treat for you. We know you’re also all fans of Apache Parquet for storing columnar data in data lakes. Given Scylla’s incredible resource efficiency and low latency queries and Parquet’s efficient storage format, it is obvious that these two great technologies belong together. So today, we’re happy to announce that Scylla’s Migrator can import your Parquet files directly into Scylla tables! Read on for the full details.

The Scylla Migrator

As a reminder for readers who have missed our previous posts (see here for the intro and a here for a deep dive), Scylla Migrator is an open source Apache Spark based project that efficiently loads data into Scylla. Using Spark’s distributed execution model, the Migrator inserts data to Scylla in parallel, taking full advantage of Scylla’s shard-per-core architecture.

Up till now, you could use the migrator to load tables from Cassandra or DynamoDB into Scylla, or even to migrate between one Scylla cluster to another, such as from Scylla Open Source to Scylla Cloud, by writing up your configuration file:

source:
  type: cassandra
  host: cassandra-server-01
  port: 9042
  keyspace: stocks
  table: stocks
  preserveTimestamps: true
  splitCount: 256
  connections: 8
  fetchSize: 1000

target:
  host: scylla
  port: 9042
  keyspace: stocks
  table: stocks
  connections: 16

savepoints:
  path: /app/savepoints
  intervalSeconds: 300

Submitting the Migrator job to your favourite Spark cluster:

spark-submit --class com.scylladb.migrator.Migrator \
  --master spark://spark-master:7077 \
  --conf spark.scylla.config=config.yaml \
  scylla-migrator.jar

And kicking back and relaxing while your data is swiftly transferred into Scylla. So that’s all great, but what about other data sources?

Loading Parquet into Scylla

With Spark’s handy DataFrame abstraction, we can load data from any source that can represent the data as a DataFrame. We’re kicking off this new feature with support for Parquet files stored on AWS. Here’s how you would configure the Migrator’s source section to load data from Parquet:

source:
type: parquet
path: s3a:///acme-data-lake/stocks-parquet
credentials:
accessKey:
secretKey:

If you’re running on an EC2 machine with an instance profile that has permissions to access the S3 bucket and objects, you can skip the credentials section entirely. Otherwise, supply an access key and secret key directly. (Assuming is currently not supported.)

The rest of the configuration file stays the same! We’re still writing to Scylla, just from a different source. With that configuration in place, we can execute the Migrator using spark-submit as before. Keep in mind that all of the columns in the Parquet file need to be present on the target table.

Looking Forward

We’re committed to adding more data sources over time. If you’re interested in trying your hand at implementing another source for the Migrator, check out the source code on Github and send us a pull request.

Another interesting direction we’re considering is making the Migrator a more generic database-to-database data migration tool. That means adding more target types. Stay tuned for more developments in that area soon!