
Greetings to all ScyllaDB and Big Data aficionados! Today, we have a special treat for you. We know you’re also all fans of Apache Parquet for storing columnar data in data lakes. Given ScyllaDB’s incredible resource efficiency and low latency database queries and Parquet’s efficient storage format, it is obvious that these two great technologies belong together. So today, we’re happy to announce that ScyllaDB’s Migrator can import your Parquet files directly into ScyllaDB tables! Read on for the full details.
The ScyllaDB Migrator
As a reminder for readers who have missed our previous posts (see here for the intro and a here for a deep dive), ScyllaDB Migrator is an open source Apache Spark based project that efficiently loads data into ScyllaDB. Using Spark’s distributed execution model, the Migrator inserts data to ScyllaDB in parallel, taking full advantage of ScyllaDB’s shard-per-core architecture.
Up till now, you could use the migrator to load tables from Cassandra or DynamoDB into ScyllaDB, or even to migrate between one ScyllaDB cluster to another, such as from ScyllaDB Open Source NoSQL Database to ScyllaDB Cloud (NoSQL DBaaS), by writing up your configuration file:
source:
type: cassandra
host: cassandra-server-01
port: 9042
keyspace: stocks
table: stocks
preserveTimestamps: true
splitCount: 256
connections: 8
fetchSize: 1000
target:
host: scylla
port: 9042
keyspace: stocks
table: stocks
connections: 16
savepoints:
path: /app/savepoints
intervalSeconds: 300
Submitting the Migrator job to your favourite Spark cluster:
spark-submit --class com.scylladb.migrator.Migrator \
--master spark://spark-master:7077 \
--conf spark.scylla.config=config.yaml \
scylla-migrator.jar
And kicking back and relaxing while your data is swiftly transferred into ScyllaDB. So that’s all great, but what about other data sources?
Loading Parquet into ScyllaDB
With Spark’s handy DataFrame abstraction, we can load data from any source that can represent the data as a DataFrame
. We’re kicking off this new feature with support for Parquet files stored on AWS. Here’s how you would configure the Migrator’s source
section to load data from Parquet:
source:
type: parquet
path: s3a:///acme-data-lake/stocks-parquet
credentials:
accessKey:
secretKey:
If you’re running on an EC2 machine with an instance profile that has permissions to access the S3 bucket and objects, you can skip the credentials
section entirely. Otherwise, supply an access key and secret key directly. (Assuming is currently not supported.)
The rest of the configuration file stays the same! We’re still writing to ScyllaDB, just from a different source. With that configuration in place, we can execute the Migrator using spark-submit as before. Keep in mind that all of the columns in the Parquet file need to be present on the target table.
Looking Forward
We’re committed to adding more data sources over time. If you’re interested in trying your hand at implementing another source for the Migrator, check out the source code on Github and send us a pull request.
Another interesting direction we’re considering is making the Migrator a more generic database-to-database data migration tool. That means adding more target types. Stay tuned for more developments in that area soon!