An Odyssey to ScyllaDB and Apache Kafka

36 minutes

Register for access to all 30+ on demand sessions.

Enter your email to watch this video and access the slide deck from the ScyllaDB Summit 2022 livestream. You’ll also get access to all available recordings and slides.

In This NoSQL Presentation

At Confluent we focus on helping our customers move to an event driven data in motion architecture. This is not something achieved in a vacuum but highly interwoven with an ecosystem of data technologies that fit the problems our customers are solving. The variety of data technologies in active use today are astonishing to those of us who grew up in the RDBMS hammer days. Helping our customers make this transition means touching and working with all of these. In my own database odyssey, ScyllaDB has shown itself to be particularly well suited for the modern event driven architecture. In this presentation, I will cover why ScyllaDB is a good fit for people using Apache Kafka in event driven architectures, review customer examples, and discuss my usage of ScyllaDB in a high velocity data sharing effort.

Will LaForest, Public Sector CTO, Confluent

Will is the Public Sector CTO for Confluent. In his current position, Will evangelizes the benefits of Apache Kafka, event-driven data in motion architecture, and open-source software is addressing mission challenges in the Government. He has spent 25 years wrangling data at massive scale.


His technical career spans diverse areas from software engineering, NoSQL, data science, cloud computing, machine learning, and building statistical visualization software but began with code slinging at DARPA as a teenager. Will holds degrees in mathematics and physics from the University of Virginia.

Video Transcript

Hey, everyone. I’m excited to have the opportunity to participate along with all of you here at ScyllaDB Summit. For my session, I’m going to talk about what an event-driven architecture is and why ScyllaDB and Kafka are a very good combination for that approach along with a little bit about my path to this perspective. But first, I have to admit I am just giddy with enthusiasm over naming my talk because I’m a huge classics nerd, and, in fact, my last four cats have all been named after characters from the “Odyssey,” thankfully for my four kids.

Before I scintillate you with that subject, let me take a moment to introduce myself. To my great regret and probably yours, I am not Tim Berglund. I am Will LaForest, and currently I am working at Confluence, the company behind Apache Kafka, as CTO for Americas and Public Sector. I’ve had the pleasure of working here for 5 years now, but this is not my first technology startup around data. Previously, I was an early-stage employee at MongoDB when it was above the nail salon and on through some pretty amazing growth, and before that, at MarkLogic prior to people getting hit with these NoSQL trends. I also had a brief stint at Red Hat where I learned to appreciate the works of Avi and Dor around KVM, but aside from that, my professional career has really revolved around managing and using data. Part of that has been working independently to advise various companies and government agencies on solving data problems, and, in fact, I’ve run into ScyllaDB in both capacities. So with that out of the way, let me quickly talk about event-driven architectures, and I’m going to start with our beloved database, an age-old way of handling data. So for many decades, the focus of IT has really been on how to build a given app or a capability centered around some sort of data store where data sits at rest and is either a queried post-facto or analyzed periodically in batches. And for a long stretch, this was mostly just relational databases with varying schemas optimized or, at least, hopefully, optimized for the specific problems they were trying to solve. Eventually, we rediscovered that there wasn’t just one kind of data storage technology that could be used but actually many different kinds of databases or data stores that were, frankly, better than relational, many of them for specific problems. And, nowadays, there are countless different kinds of databases in use. But the focus is still largely the same. They accumulate sets of data and efficiently answer questions against that data at some point in time or do some sort of analytics on a set of data. And this approach is referred to as data at rest because date gets stored, and then the way I put it is, the queries, the questions, the analysis are actually brought to this setup state, these stores, and run against ones currently there and then returns. And this has worked remarkably well for a long time, and as everyone listening knows, we still need this. But there has been sort of a fundamental shift in the role of software and data, and it’s no longer ancillary to the business but central or, in many cases, the business itself. And in this new reality, data really needs to flow across all the functions of an organization, and the have to be able to rapidly react to the events in their business. It’s no longer really sufficient to treat data as disconnected silos with makeshift integration sort of stitching it together. In the last slide, I just dropped in the term events. Probably a good portion of you know what I meant, at least from a technical level, but if you don’t, events are really .. . They’re the things that happen in your business. And, of course, this differs depending upon what the organization actually does and cares about. So for transportation, events could be sensor measurements, vehicles that you want to analyze in real time in order to warn about malfunctions. So you can think predictive maintenance. For banking, example events could be financial transactions, retail, could be orders and shipments, and all businesses .. . They all have customers, and customers have actions. They buy stuff. They view things. They go place. So truthfully, an event is really any piece that states with a time stamp, “At this day and time, we recognize this to be true.” A data in motion platform thinks about data as streams, so you can think of sequences of events, and focuses on acting upon, operating, analyzing all these events as they occur. The act of doing this on Event Stream is called stream processing, and the results itself, of course, is more events, which, in turn, can be consumed by other stream processors or picked up by any number of technologies like microservices or sent to an awesome database like ScyllaDB. But let me take a closer look with the hypothetical sales transaction process just to sort of illuminate this. The first event could be a purchase request. So an inventory process handles this event and then, in turn, produces an item reservation events, probably after verifying that there’s actually inventory. And account service is waiting to handle reservation events, which it does by taking funds from the customer and then producing a purchase event. And there may now be three or more separate services all waiting to handle this purchase event regardless of where it came from. So a UI service knows to update the UI based upon this event, and logistics knows to ship the products, and a notification service will send an e-mail. And, of course, they themselves probably produce events, which could then, in turn, be consumed by others. So a few observations around this, you could, of course, avoid thinking about events at all, right? There’s others ways of doing this. People have done it for many years. So instead, you could think of this as a number of commands that are fulfilled by services. So a purchase request command is handled by a purchase service that knows that it needs to issue a command to an inventory, management service, to an account management service, then another command to a logistics service, maybe finally a separate command to a notification service. When you conceptually move to an event-oriented approach, however, you naturally start thinking in a persistent manner. At Timex, a purchase reservation event occurred, and I want to record this happening for it to be persistent. If the service crashes, machines, power goes out, it’s durably persistent, and you’re fine. It follows that in order to make that actually work effectively, you need to handle events asynchronously. Any processes that need to handle these events and act on them will do so when and at the rate that they’re able. If they go down, they will just pick back up by reading the next event that they haven’t yet handled, and if they can’t keep up with the rate of events, then they won’t crash or become overwhelmed with the number of incoming connections. They just continue to work as fast as they can. And if you need to, you should be able to bring up more services that can help handle these events. Asynchronous approaches are also vastly more efficient, as well. Don’t allocate and hold onto resources synchronously waiting for other things to happen. Use what you need only when you need it. And as many of you know, this is one of the principles that actually makes ScyllaDB so efficient. It uses asynchronous approach wherever it can. In the C++ library, a use for this is Seastar, which is also open-source, so if you’re coding in C++ and interested in doing asynchronous, that’s just sort of the way programming languages are going in general. Check that out. In the case of the event-driven approach, you’re, of course, demanding that the persistence can keep up with the event rates. But that’s a better and a more controllable known, especially when you use something like Kafka that is designed for event streaming. You can’t easily know how fast all of the downstream receivers of commands are actually going to process them. They do very different things and are written to different standards and are likely to change, or new ones come on board. And this sort of leads to the last important observation, and that is, when you move to a model where you’re asynchronously handling events, it’s natural to move to a paradigm that decouples event producers from event consumers. And with Kafka, for instance, that’s a publish-and-subscribe model. Or maybe the end-node association seems natural to me, but regardless, decoupling could, in fact, be one of the most important aspects of the event-driven approach. An event producer’s job is to produce events. It shouldn’t need to know or care who receives them with the exception of some sort of security requirements not tied to specific consumers. With a command pattern, my purchase processor needs to know what other services to issue commands to. With an event-driven approach, if you want to add some sort of fraud detection, or maybe it’s just an analytic, the producers of the events .. . They don’t need to change. They’ve done their job. I can simply add another consumer for the right kind of event. In a coupled approach, I would have to additionally change at least one upstream service, maybe more, depending on if my feature was going to be used in other purchasing processes or other processes in general. And that’s just at the technical level. Then, there’s sort of the organizational drag you avoid from being decoupled. I’m the group building the analytic. I don’t have to have several meetings and coordinate with one or more upstream develop teams and then make sure new versions are rolled out at the same time and coordinate that, et cetera. It just adds up tremendously. So going from there, I want to step back to the bigger picture. A modern architecture, one that’s event-driven, really uses both data in motion and at rest. And combining them effectively is important. All those event changes affects the accumulated state that’s needed for the business to answer questions and analyze the big picture. In this diagram, I’ve sort of superficially separated out the data production and consumption tiers, but clearly there are many cases where something is both a producer and a consumer. Pretty much any application is, really, if you think about it. But the point is that the data is produced from a variety of sources like sensors or applications, external data APIs, OIT devices, whatever, and these events either flow into a database or into an event streaming platform. In the database world, this takes the form of updates or inserts. In the event streaming world, these are typically published in, natively, a raw form of how the data was produced, right? So, often, it’s JSON records, but we see Protobuf and AVRO, as well, or whatever. Each of these two platforms are optimized for entirely different usage patterns. The Data in Motion platform focuses on handling events, so event-driven actions, for instance. It also enables trending analytics, so in real time, I want a 30-second window to risk assessment to trades based on a given stock, or I want to know when I see more than X number of authentication failures across all users in a 5-minute period. These sorts of real-time approaches are very powerful, but at the end of the day, you still need to query or analyze large sets of data. For a long time, before event streaming platforms, if you wanted to handle sequences of events sort of as a special case, you would effectively roll your own solution to support these real-time needs. So it would mean potentially leveraging a database or even using some of the old-school messaging. And, frankly, I know that, right, because before I discovered Kafka, that’s what I did for these sorts of situations. And that’s a good segue. If I’m going to be honest, my guess is, a good deal of the people attending this session are already familiar with Kafka. I heard from my colleagues from ScyllaDB, it’s actually more common to see ScyllaDB with Kafka than without it, but for those of you who may not be familiar, just a quick review, it was originally created at LinkedIn by the founders of Confluent in 2010. It was open-sourced in 2011 sometime and now has become pretty ubiquitous in industry. And so if people did have a way to piece together ways of handling events, what’s actually different about Kafka that makes it really well-suited for this workload? Why was it created? First, event data records are arranged into topics. You can think of this as our equivalent of a column family. And the records can easily be in any format you want, so from a Kafka perspective, a record is just a sequence of bites, a key, a time stamp and some headers. Kafka may not care what the bites are, but you probably do, right? And this is where Confluent adds something called the Schema Registry, also an open source, which provides you with a catalog of topics, the serialization formats you use then and associated schemas, but I’m not really going to cover that here in this talk. Needless to say, it is typical to serialize data into JSON AVRO or Protobuf, but you can plug in whatever format you really wanted to. Each topic is then broken into partitions, which are distributed across the cluster of brokers. When new records are published into Kafka, they are mutable and distributed across partitions and appended to the end, so it’s append-only rights. Immutable means, once it’s in there, you can’t change it, right, because all the rights are append-only. There are, and there are no indexes. Performance doesn’t change as a function of total data size, and this is really, really important. Subscribers consume data by sequentially moving through the records in partitions based upon an offset, and often consumers .. . They’re processing data in real time, so their offsets are appointing to the end of the partitions, which is the most recent data. But they can start any way they want, right? They can start at the beginning or at the middle at some point in time. If there’s a failure, and you want to restart processing for an amount, a window of time, you can do that. If you’re consuming in real time, when a producer adds more data, the consumer just increments their offsets and reads the latest to keep up with that. The offsets themselves are persistent in Kafka in special topics, and this is really important, as it allows consumers to sort of fail and pick back up where they left off, never missing any records, any events. But also it enables Kafka consumers to scale horizontally with Kafka automatically handling rebalancing of partitions across consumer groups as they grow or shrink, or if one of them fails, it can be rebalanced against active ones. And while it’s not depicted here in this picture, these partitions are actually replicated end times, whatever you want, for fault tolerance and high availability. Of course, Confluent continues to be the primary engineering force behind Apache Kafka but has developed an entire Data in Motion platform with Kafka at its core. It’s available on premium or as a fully managed cloud offering. Again, not going to dive into the details of the entire platform, as it’s not central to the talk, but, of course, I encourage you to visit our website for more details. So just looking at this diagram again, how do we connect Kafka and the data stores? Well, there is nothing like an arrow to magically substantiate my commentary, right? But for this sort of architecture to work, getting that connection, that line between the event-streaming platform and your data stores, is actually .. . It’s important. The method of doing so kind of depends upon the entry point for events. If the entry point for the events are Kafka, then the most typical way is to use a Kafka Connect Sync Connector. And there is well over 100 of those for various different stores, and on top of that, actually, lots of databases and stores out there actually have native support for receiving data from Kafka anyways, right? They’re just built. They know how to pull it out of us, and that’s just one of the benefits we get from being, as I mentioned, fairly ubiquitous. Events into Kafka First tends to be really the easiest path, I guess you could say, from an engineering perspective and a net new system, but, of course, you’re still going to find events coming in through the databases. That’s bound to happen. And also, the truth is, the vast majority of the pre-existing systems are built around transacting directly with a database or multiple databases. So in these cases, right, the data store is the entry point. And Kafka needs to be made aware of changes so that downstream consumers can pick up on those events, as well, and do stream processing on them or handle those events, and you get all the goodness of the asynchronous, decoupled nature. The ideal way to do this is through something called CDC, change data capture. But a couple things to be aware of is that, one, not all data stores have a CDC connector. And, two, there’s an honestly huge variance in features and quality across CDC connectors. I would say, thankfully, ScyllaDB has a true CDC connector, and it works well. I’m not going to cover this, either, as this was actually the subject of Tim’s talk last year at ScyllaDB Summit, so if you’re interested, go back, and we’ll talk about how that works. But in the event there is no CDC connector for a store, or a CDEC connector doesn’t do what it’s required, sometimes a traditional JDBC source connector will work. And then there is finally always the old standby of, build your own connector, which the connect framework allows, or build your own service to produce events if it’s easier. And then another way that I’ve actually seen this done in these architectures, it happens at the application tier, is through dual emission, right? So an event occurs. Now, I want to both send to my database, but I also want to log it into Kafka, as well. And this approach isn’t without its pitfalls, since there is no easy-to-make dual-emission atomic. One thing that can be done to make this easier is to use item potent consumer patterns. There’s a whole lot that could be talked about just on the subject of connecting them, but I’ve got to keep on moving along, and actually I want to get to what probably most of you are here to hear about, and that is ScyllaDB. So as you can see from my background, I’ve spent many a year working with NoSQL databases, more than just the two I worked at and many other ones, as well, and, of course, plenty of relational databases, as well, and at Confluent, we touched countless databases across our customers. It’s just sort of the nature of the fact that we sort of tend to make this central nervous system, and we distribute these events across all the different data stores that they’d use. And so many of them are very good and work well, but a couple observations as to why I believe ScyllaDB is great and an event-driven architecture, first, something that I think is probably obvious to many of you attending, which is that ScyllaDB is ultra-performant and scalable. Okay, so, yeah, that’s great, but why does that matter for event-driven? It’s because real-time event-driven systems in organizations often tend to see much higher data rates, and the most to an event-driven architecture tends to result in actually more discrete transactions occurring. I’ve experienced firsthand databases actually struggling to keep up with the data coming out of Kafka. Of course, the truth is that most horizontally scalable databases can, right, with the right configuration and design decisions, scale to meet whatever the requirements are. But how much compute is required to do so? How easy is it to expand? And for me, that sort of is one of the things that makes ScyllaDB special, is how incredibly efficient it is, and this makes a difference in what is actually feasible, especially with resource and financial constraints, and let’s be honest. Who doesn’t have that, right? Still, another factor is, I actually work with a lot of customers that are still not in the cloud, right, especially in the public sector, and I think efficiency in these cases is even more appreciated there, right, because it’s not as easy to just bring up a few more EC2 instances. Also, latency is often really, really important when you’re building a system that needs to rapidly react to events. We have plenty of customers who require high throughput and low latency, right? They need to receive events rapidly and then act on them rapidly, and that can sometimes mean queries to a database for each and every event, and having a couple-millisecond latency from ScyllaDB makes a big difference over time. Another thing that ScyllaDB gets is really event-driven approach, right? If I’m going to be honest, there are plenty of database vendors out there that are sort of reluctant to actually develop an effective CDC source connector, since their job is to get the data in, right, not to get the data out. But as mentioned earlier, ScyllaDB has done just that, and there is a Debezium connector with really the mentality of making it easy to exchange data, and that’s sort of central to this concept and really modern architecture in general, right? There’s not just one big, honking relational database that everyone is grouped around. There’s lots of different technologies in play, so that’s really, really important. So, okay. At this point, what I want to do is, I want to talk about a specific project that I worked on recently that combined Kafka and ScyllaDB into an event-driven architecture. It was a government program designed to improve the capabilities for a broad set of analysts and leaders, something in our space called an all-domain system, which basically means that it combined a wide variety of different information and data from a number of different source types. The task was to create a new system that more effectively supported the analysts that needed to piece together data and information from all these different domains to help make a decision. And the existing system could really only handle a fraction of the data that’s now available, right? When it was built, there wasn’t as much, and now, there’s a lot more. The volumes are higher. The frequency is higher. And then the other problem was that the lag between the data being captured and actually making it into the hands of the analysts was way too high if it made it to them at all, right, because the other problem was that the existing system was pretty brittle and suffered frequent outages. There was also a lot of manual processes involved for users in working with the existing system, so in any event, the idea was to automate as much as possible and enable rapid, data-driven decision making and action. And while we were building something from scratch, a new system from scratch, it actually still had a dependency on data coming from a couple of existing legacy databases that were not easily replaced based upon other external dependencies, other people that were using it and were not going to stop using it any time soon. And, oh, by the way, this is yet another reason why decoupling is great, right? That sort of problem goes away, but anyways, we inherited it. One of the complicating factors in this sort of ambitious project was really the variety of the analysis and queries required to support the analysts. And so from the get-go, for me, it was clear we were going to need to have a polyglot persistence approach, as no single database, I felt, would really be optimal for all the different workloads. But at the same time, I wanted to minimize operational complexity. So we had to build for multiple ones, but we didn’t want to .. . as few as possible, clearly. We knew that graph was going to be important from the beginning, just because a large set of the requirements, and we knew search indexing was going to be important because there was a lot of unstructured content, but also for the usual reasons. And actually, they needed to build a data catalog, as well, since there were so many different data sets, and they needed to support rapid onboarding of new ones. So a new data set comes on board. How do we integrate it and make these analysts aware and start finding the important feeds and pieces of information they needed? Kafka, an event-driven approach was obvious since, really, it aligned with mission goals and would help data flow across the stores including the legacy databases, which are still being actively used. So the new event data generally was going to come into Kafka from novel sources, but events were also going to be produced by transactions to existing legacy databases. Before I came onboard to advise the project, they had actually already begun a bunch of sort of independent prototyping efforts to address the various requirements. For graph questions, which, as I mentioned earlier, was pretty important to this, for graph questions and analysis, they actually started with with Neo4J like many people do. The results weren’t good. Queries took too long. Ingestion rates were poor, and, in fact, actually, the data sets were really a fraction of what they were going to see in production. And then they actually had another group that was attempting to satisfy some of the same requirements using PostgreS actually with a specialized schema for graph. And this can actually work for several requirements where there’s no deep traversal, but that just simply wasn’t the case here, and that didn’t really go anywhere, as well. So, and then another piece was that Apache Atlas was brought up as a possible approach for data lineage, so they had been running them some proto efforts with this, as well, and that requires a graph, as well. So for those of you who weren’t aware, Apache Atlas depends upon JanusGraph. JanusGraph itself depends upon a storage back end and an indexing back end, and the diagram you see here, I just ripped off from the JanusGraph site. But they sort of found Atlas very difficult to stand up and use. I had actually seen JanusGraph itself used with success at a banking customer, and my suggestion was, “Jettison Atlas, and just build exactly what you need from a lineage perspective and nothing more,” right? You don’t need every single bells and whistle. You just need it to satisfy your requirements, and you need it to work well. When we started focusing on JanusGraph for both the efforts, general graph requirements and the data lineage, as well, ones as well, it was initially set up with Cassandra. And we actually decided to take a sprint to test out ScyllaDB instead, and immediately we saw a noticeable difference, even at our small prototyping scale. Queries resolved faster. Ingestion rates were better. It just worked well, and honestly, we really had to do minimal tuning to achieve that. With the success we were seeing there, we decided to also start using it for transactional workloads alongside PostgreS, and this was particularly important for an event-handling service that needed to perform queries based upon events. I talked about that earlier. This is one of those cases, right, and we needed to both update lineage information. We needed to look up other information when we were enriching events, and ScyllaDB was great for that. Originally, there was a thought on the program to use MongoDB and PostgreS, but as of the time I rolled off the project, there was really no need, and ScyllaDB and PostgreS really was satisfying all the functional requirements, and performance and efficiency was very good. And, by the way, as it turned out, ScyllaDB worked very well for supporting our Spark jobs, as well. So in a sense, in this case, ScyllaDB was being used for three different kinds of workloads, and this actually enabled a simplified architecture because, again, I mentioned this, but at one point, I think the vision was to have another couple databases in the mix. This diagram here is sort of my recreated, sanitized version using the ScyllaDB Summit color palette, no less, and that gives you an idea of what it looked like in each of the environments. Data collection mostly entered the system through Kafka either by producers or through restful API, got published into Kafka. Multiple applications were being built by independent teams to support specific analyst requirements rather than one massive application unit and unite them all. But regardless of which approach they would have taken, they would have been going, accessing the data through APIs, which would then use the appropriate data stores. Data made its way into data stores mostly through Kafka Connect syncs, but novel data was also created in the databases based upon the analyst interaction with the applications, as well. And some portion of that, not all of it, at least, when I was there, was getting pulled back into Kafka for other downstream consumers. So if I’m an analyst, I draw a conclusion. I join some things. I make a correction. That was an event that then needed to be propagated for the event stream processors, the event handlers, et cetera. Legacy applications continue to be used, as it was going to take some time, honestly, to replace their existing capabilities, and really the focus was on bringing new ones to bear. And those changes from the original legacy ones were getting sucked into this architecture via Connect, as well. So in summary, Kafka and ScyllaDB combined very nicely on this project. As I mentioned earlier, there is a lot of other examples of the two being used in conjunction. I’m sure there’s some other case studies out there that you can probably read up on, but that’s what I want to talk about today, and I’d like to thank you for your time. Since you’re attending the ScyllaDB Summit, I’m confident you know where to go for the scoop on ScyllaDB. But for those of you who like to learn more about Kafka and Confluent, I encourage you to head over to our developer site at developer.confluent.io. Thank you very much.

Read More