Migrating SQL Schemas for ScyllaDB: Data Modeling Best Practices

13 minutes

In This NoSQL Presentation

To maximize the benefits of ScyllaDB, you must adapt the structure of your data. Data modeling for ScyllaDB should be query-driven based on your access patterns– a very different approach than normalization for SQL tables. In this session, you will learn how tools can help you migrate your existing SQL structures to accelerate your digital transformation and application modernization.

Pascal Desmarets, Founder & CEO, Hackolade

Pascal is the Founder and CEO of Hackolade, the pioneer for data modeling of NoSQL databases, storage formats, REST APIs, and JSON in RDBMS.

Hackolade is the only data modeling tool for ScyllaDB, Cassandra, MongoDB, Neo4j, ArangoDB, Couchbase, Cosmos DB, Databricks, DocumentDB, DynamoDB, Elasticsearch, HBase, Hive, BigQuery, Firebase/Firestore, JanusGraph, MarkLogic, Neptune, Redshift, Snowflake, Synapse, TinkerPop, etc. It also applies its visual design to Avro, JSON Schema, Parquet, Protobuf, Swagger, and OpenAPI, and is rapidly adding new targets for its physical data modeling engine.

Video Transcript

Hello, everyone. In the next 20 minutes, you will learn how to do the data modeling required to be successful with ScyllaDB. Most ScyllaDB solution architects and project sponsors will tell you that proper data modeling is a key success factor when using ScyllaDB. Our tool, Hackolade, has pioneered data modeling for NoSQL databases in general and ScyllaDB in particular so companies could modernize applications while also ensuring proper data management and governance.

Hackolade was built from the ground up specifically to handle embedded structures and schema evolution of ScyllaDB. This is done by extending visual entity relationship diagram capabilities to accommodate the representation of hierarchical structures with nested objects and arrays. The tool automatically generates CQL scripts including create statements and outer statements during schema evolution. We’re featured in the reference book on Cassandra and on the Apache Cassandra website. We do this data modeling for all the popular modern technologies for data at rest, databases as well as data in motion with model-driven API generation, Avro for Kafka event-driven architectures, data pipelines with data lakes, data warehouses and self-service analytics. Let me introduce myself. I’m Pascal Desmarets, CEO and founder of Hackolade. I’ve been in data my entire career. Designing the proper architecture and proper handling for data has been my passion. Even when I’m racing sailboats for pleasure, I’m sort of the CIO for the boat, integrating weather information, tides and currents as well as competitive data to feed the tactics for the race. So why is data modeling a key success factor when using ScyllaDB? ScyllaDB provides dynamic schema evolution. Plus, we’re in an era of agile development, so why would we need data modeling? Isn’t that outdated? Well, the answer is simple: technical debt. If you don’t build the foundations correctly, it will be hard to build the house and make it evolve over time. But don’t take my word for it. The most successful companies on the Web today and that includes the GAFAM are all doing it, and they’re writing about it too, stating that data modeling is perhaps the most important part of developing software if you want to be successful. Like any state-of-the-art technology, it is possible to fail when using it. Sorry to say, but this could also happen with ScyllaDB, that is if you want to use it for the wrong use case or if data modeling is not done properly. There is a sweet spot for ScyllaDB where it shines the most, and that’s not in a use case where you need to change and update the same data points over and over again or if you need to do lots of joins when reading the data. You can see here a list of nonexhaustive list of excellent use cases for ScyllaDB. This includes the types of applications where you collect enormous amounts of data, particularly when globally distributed, and when you know ahead of time the types of questions you will want to be answered. A second reason that could be a challenge when implementing ScyllaDB has to do with data modeling. If like me you come from years and years of relational databases, it’s very tempting to want to apply our vast experience and do it again in ScyllaDB. What we were told to do with the rules of normalization, and a contributing factor is that CQL and ScyllaDB tables look so much like SQL, but don’t get fooled by this. The underlying characteristics of the technology are so different that you need to use it in a different manner. You just don’t drive a stick shift Ferrari like you would drive an automatic transmission pickup truck, so you need to recognize these differences and leverage the strong advantages that ScyllaDB provides. You must force yourselves to think differently. Again, forget what you learned about the normal form and the usual process of conceptual, logical and physical data modeling. As you will see, you must approach the problem from a different angle, the angle of the user experience and the application workflow. But let’s first go back to basics. The terminology is, again, fairly similar to relational databases, but the details are very different. For example, while there are rows and columns, each row does not need to contain the same column information as other rows. The primary key is generally made of two segments, a partition key and a clustering key. The purpose of the partition key is to distribute data across the nodes where you store information while the purpose of the clustering key is to sort the rows for fast access. The data types are specific to ScyllaDB, and they include nonscalar collection data types. While secondary indexes may be tempting to avoid data duplication, you should be careful as they are recommended only in special circumstances. Besides, storage is cheap these days, and data duplication is okay. Finally, as stated before, you should not do joins when reading the data. Possibly, you could handle joins at the application level, but even that’s not ideal. Instead, you should perform your joins when writing the data so data retrieval is fast and easy. Data modeling for ScyllaDB can be an art as much as a science. It is a science in the sense that it should be data driven, but it is an art because it is hard to foresee accurately what the application volumes are going to be. Plus, you need to balance contradictory needs, namely the fact that partitions should be fairly equal in size while read operations should ideally retrieve all the data from a single partition. Now let’s go through a data modeling process. ScyllaDB has provided a simple example to illustrate the data modeling process. Let’s go through each of the steps listed here. The first step is to go through the application design. Here, you should create screen mock-ups and reports, so every screen, every report should be drawn and prototyped. Plus, you should draw the application workflow as is shown on the screen here. Each screen and process step will require a query on the database, and you should design in detail each one of these queries and the structure of the expected results. That means identifying each entity and its attributes plus the relationships and cardinality between these entities. Some people are tempted to call this a conceptual model, but that’s not exactly true because it’s not a high-level abstract exercise like an enterprise model. This is … Instead, it must be driven by the access patterns and queries of the application running on ScyllaDB. This should lead to the creation of ScyllaDB tables. Remember that in these tables, your primary key can be made of just a partition key, or your table can have a primary key which is made of the combination of both a partition key and a clustering key, so this is used to create your tables. A tool such as Hackolade lets you do all of this easily with a few clicks of a mouse, and mostly it lets you collaborate between team members, the stakeholders of your application. You can do all of this without writing a line of code, so you can go through what-if scenarios and shaping your data according to the exact needs of the application. As noted earlier, the primary key exercise is of particular importance. You should document the metrics that you use to justify your design. Later on as you’re operating your application, you may realize that the access patterns don’t have the volume or velocity that you originally envisioned, and this may lead to the evolution of your schema and of your primary keys, and having documented your original thinking for your original design will help you shape the evolution of your schema as it meets the reality of being run in your application. ScyllaDB gives you a vast choice of data types. They’re all listed here. These data types are used to shape your data correctly and enhance the data quality. Data types includes very powerful and flexible nonscalar data types. These are extremely useful when denormalizing the data to satisfy the fast access patterns of your ScyllaDB application. Finally, with the application, you generate the CQL scripts to create keyspaces, tables and columns and their data types, constraints and indexes and of course the partition keys and clustering keys optimized for your access patterns. Of course, developers can type all this up, but if you have an application that generates this automatically, this will lead to higher quality and more efficient processes. To succeed with ScyllaDB, it’s critical that you design your model so each query in your application is served by a single table. That way, you don’t have to create joins when reading the data, and we all know that joins can be very expensive, and they impact performance. So this approach implies that your joins are made or created through denormalization when storing the data. Your partitions must be balanced while letting you retrieve the data from a minimal number of partitions, ideally a single one. Tools should allow you to easily denormalize the schema of your relational databases as you modernize your applications to run them on ScyllaDB. The benefits of data modeling are at all levels, in the perspective of end users because the delivered application will more closely match their expectations, for management because it reduces risks and is more productive and efficient and for the developers because collaboration brings clearer requirements, less frustrating rework and better performance. There was a lot of information in this short session, so this should encourage you to learn more about the details to be successful in your ScyllaDB implementation. There are plenty of great sessions to attend during this ScyllaDB Summit. Also, don’t hesitate to contact us directly if you have any questions. Thank you.

Read More