See all blog posts

Migrate Parquet Files with the ScyllaDB Migrator

Greetings to all ScyllaDB and Big Data aficionados! Today, we have a special treat for you. We know you’re also all fans of Apache Parquet for storing columnar data in data lakes. Given ScyllaDB’s incredible resource efficiency and low latency database queries and Parquet’s efficient storage format, it is obvious that these two great technologies belong together. So today, we’re happy to announce that ScyllaDB’s Migrator can import your Parquet files directly into ScyllaDB tables! Read on for the full details.

The ScyllaDB Migrator

As a reminder for readers who have missed our previous posts (see here for the intro and a here for a deep dive), ScyllaDB Migrator is an open source Apache Spark based project that efficiently loads data into ScyllaDB. Using Spark’s distributed execution model, the Migrator inserts data to ScyllaDB in parallel, taking full advantage of ScyllaDB’s shard-per-core architecture.

Up till now, you could use the migrator to load tables from Cassandra or DynamoDB into ScyllaDB, or even to migrate between one ScyllaDB cluster to another, such as from ScyllaDB Open Source NoSQL Database to ScyllaDB Cloud (NoSQL DBaaS), by writing up your configuration file:

source:
  type: cassandra
  host: cassandra-server-01
  port: 9042
  keyspace: stocks
  table: stocks
  preserveTimestamps: true
  splitCount: 256
  connections: 8
  fetchSize: 1000

target:
  host: scylla
  port: 9042
  keyspace: stocks
  table: stocks
  connections: 16

savepoints:
  path: /app/savepoints
  intervalSeconds: 300

Submitting the Migrator job to your favourite Spark cluster:

spark-submit --class com.scylladb.migrator.Migrator \
  --master spark://spark-master:7077 \
  --conf spark.scylla.config=config.yaml \
  scylla-migrator.jar

And kicking back and relaxing while your data is swiftly transferred into ScyllaDB. So that’s all great, but what about other data sources?

Loading Parquet into ScyllaDB

With Spark’s handy DataFrame abstraction, we can load data from any source that can represent the data as a DataFrame. We’re kicking off this new feature with support for Parquet files stored on AWS. Here’s how you would configure the Migrator’s source section to load data from Parquet:

source:
type: parquet
path: s3a:///acme-data-lake/stocks-parquet
credentials:
accessKey:
secretKey:

If you’re running on an EC2 machine with an instance profile that has permissions to access the S3 bucket and objects, you can skip the credentials section entirely. Otherwise, supply an access key and secret key directly. (Assuming is currently not supported.)

The rest of the configuration file stays the same! We’re still writing to ScyllaDB, just from a different source. With that configuration in place, we can execute the Migrator using spark-submit as before. Keep in mind that all of the columns in the Parquet file need to be present on the target table.

Looking Forward

We’re committed to adding more data sources over time. If you’re interested in trying your hand at implementing another source for the Migrator, check out the source code on Github and send us a pull request.

Another interesting direction we’re considering is making the Migrator a more generic database-to-database data migration tool. That means adding more target types. Stay tuned for more developments in that area soon!