fbpx

FLiP Into Apache Pulsar Apps with ScyllaDB

20 minutes

Register for access to all 30+ on demand sessions.

Enter your email to watch this video and access the slide deck from the ScyllaDB Summit 2022 livestream. You’ll also get access to all available recordings and slides.

In This NoSQL Presentation

In this session, Timothy Spann, Developer Advocate at StreamNative will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming applications with a variety of OSS libraries, schemas, languages, frameworks, and tools against ScyllaDB. He will show you all the options from MQTT, Web Sockets, Java, Golang, Python, NodeJS, Apache NiFi, Kafka on Pulsar, Pulsar protocol, and more. You will FLiP your lid on how much you learn in a short time. He will share instructions on the few steps you need to get an environment ready to start building awesome apps. You will also learn how to quickly deploy an app to a production cloud cluster with StreamNative.

Timothy Spann, Developer Advocate, StreamNative

Tim is a Developer Advocate at StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming.

 

Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData, and a Senior Field Engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.

Video Transcript

Hi. I’m Timothy Spann, Developer Advocate at StreamNative. Today I’m going to talk to you about FLiP into Apache Pulsar apps with ScyllaDB. I’ve been working on what I call the FLiP stack or variant to that for a little while. That is a combination of Flink, Pulsar, ScyllaDB, NiFi Spark, some things in modern data development that work really well together.

In the past, I was a principal engineer at Cloudera, senior engineer at Hortonworks, field engineer at Pivotal, always working with data, streaming, messaging, things like that. Today everything’s in the cloud, so we’re going to be talking about cloud streaming, and something that enables that really well, it’s an open-source project, Apache Pulsar. This gives us a number of features that work really well when you need to do real-time messaging, things that are very scalable. It includes some features that make this really powerful and pretty straightforward. There’s a built-in serverless computing framework, and today in my demo, I’m using that to run sentiment analysis as data enters topics and goes into our output before it gets landed in ScyllaDB, and what’s nice is, we’ve got unbounded storage. That’s important because, in cloud use cases, really hard to decide how big these applications are going to be, how much data you get, especially when you start bringing in things like live data, logs, sensor data. Lots of data out there fills the storage up pretty quickly. We’re multitiered, so we could scale things independently wherever that needs to run. Our storage layer is isolated from the computing layer for our messaging, and that lets us scale things out, and as well as storing things in these individual nodes, we’ve got a tiered storage system, so we could store it out to AWS, Google Cloud, whatever cloud storage you have out there. To support all the different messaging types, we support streaming like you used to with Kinesis and Pub/Sub that you might be doing now with things like RabbitMQ or ActiveMQ. You could pick and choose how you want to interact with the messaging system, and we’ve got support for other protocols out there depending on what you’re using already, so you don’t have to rewrite everything to move over to Apache Pulsar.

Now if that wasn’t enough, there’s some features underneath there that are really important for modern data application, especially once you’re talking cloud, hybrid cloud. You’re in different regions. Having this unified messaging platform so you don’t have to have 12 different ways to do your messaging. We can support JMS. We support Kafka, support MQTT. Whether you need to do it streaming style exactly once or you have .. . using your messaging system as a work queue, all that’s supported. We guarantee that all your messages get delivered. They never get lost, and they can be stored forever, so you don’t have to worry about it. Very resilient architecture with those multitiers, keep things isolated, keep things alive, very easy to scale up and down automatically with things like Kubernetes. And as large as you need to make these clusters, wherever it needs to run, geo-replication is built in, so if I need to run tens of thousands of nodes spread around the world, not a problem. There’s large Pulsar users in Asia now. It’s scaled up to tens of petabytes without a problem, thousands of nodes, so your scale is as big as you need to be. There’s a lot use cases where that applies pretty commonly, and these are use cases often connected with ScyllaDB. Having that unified messaging platform is nice, so you don’t have to worry about, “Oh, I have JMS over here and RabbitMQ over here and MQTT over here and Kafka over here and Kinesis over here, and I’ve got to use different libraries, and they can’t talk to each other, or I’ve got legacy apps that are talking this way or another,” having one system that handles them all, scales out modern data infrastructure, runs on all the clouds, runs in Kubernetes, all those cloud-native features you need, AdTech, huge one where you need low latency. You’ve got a lot of data coming in very quickly. You need to have different clients access it. AdTech is an easy one for us. Fraud detection, while data is coming in, I need to make decisions on it right away. I need to look at data that I have stored, combine that with data that’s happening now, very good one. Connected car, when you’ve got a ton of data feeds, then you may need decisions made right away. Maybe those decisions need to be made at the edge. That’s something we do with IoT Analytics because I can have my Pulsar clusters run small. I can have an edge one that’s running in a small edge gateway, have that geo-replicate automatically to my larger clusters that are hosted in various cloud regions and not have to really do any complex setup or management to make that work and work scalably and solid, very easy. And to support things like microservices, very easy, we’re using that today in the demo.

This is my architecture that I commonly use in a FLiP app, and this is the one I’m using today in my demo. I’ve got some sources of data. I’ve got a interactive single-page web application that is letting me type in questions. There’s how many people are using the app. Type in questions and send them via web sockets to Pulsar. Pulsar supports web sockets natively, so it’s very easy for you to make communications there without custom libraries, or I could’ve communicated via Kafka or MQTT or REST or the Pulsar-native application, lots of options there. Also have a edge device pushing in sensor data, again, could also run in the near term CDC-type application, so if need to get data in and out of ScyllaDB that way, I could do that. Data comes in, and like I said for the FLiP part, Pulsar is usually backed up by Flink because Flink gives us some extra horsepower for computing to do things like continuous queries, continuous ETL, and whether it’s batch or stream, same paradigm, very easy to combine these together, and Pulsar could automatically offload that data to that cheap, massively storable, massive sized S3 and other cloud storage. And then today, very easy for me, which is the couple lines of configuration to stream that data via Pulsar sink, which is a built-in connector you can enable, and go right into a ScyllaDB table, and I could also do things like run Spark against anything while it’s still on a topic or run .. . I’ll show you little Pulsars running against it with Pulsar SQL so that you can look at the data in your topics right away.
But let’s connect here, lots of different options, so it’s not just sitting in this message queue. It’s live data, and while as soon as an event or a message or a log or a record, regardless of where it comes from, regardless of that protocol at the beginning, once it’s in, I’m in a topic. I can have it execute lightweight code. This is that stateless compute layer that I talked about, and that supports Java, Python and Go. I’ve got a Python app that’s doing NLP on it to analyze the text as it comes in, great way to do functionality that’s a microservice and not too hard to deploy. It gets deployed to a cluster or to Kubernetes on its own and then gets each message as it happens, very nice. Then we have connectors. We’re using that today. We’re using the connector to act as a sink. I put in a little connectivity, and it goes right to there as it enters that topic, very easy for you to get that data into something like ScyllaDB right away. We mentioned there’s different protocol handlers, so if I need to get data in and out of Kafka, I’ll just look like I was Kafka, no extra headache there. Once the data is in Pulsar, it supports any way, in and out. I could use Kafka in, MQTT out, makes no difference to Pulsar. We got those processing engines I mentioned. Pulsar SQL, which is our look at a topic, see what data is there right now. That’s done with Presto/Trino, pretty easy to do that, simple select star or select whatever fields you want, and we mentioned those data offloaders, so I could just tier out that data when I get to a certain size or time, age that data out, but still have that available for things like Pulsar SQL or Flink or Spark or just ready consumers. Just because it’s not in the immediate storage of Apache BookKeeper doesn’t mean it’s not available.

So as I mentioned, we have that sink there. It’s very easy to set up, so we can build and run an example very easily. This is all I need to connect to ScyllaDB. Obviously if I have different configuration needed for log-in, password, different log-in mechanisms, but this is the basics. I’m running ScyllaDB in a single docker just as a easy-to-use demo that I’ll give you the source code from, put in my connection. It’s going to look pretty familiar to you with that familiar port of 9042. Got a keyspace and a column there. I could do it depending on what my schema is, but I keep it pretty simple for these examples. I created a keyspace just for this demo, and then from ScyllaDB, this is what I could do from the command line, or I could do this from the UI, pretty easy to do it from the command line and then connect it with my DevOps tool. I’m going to create this in a different tenant and namespace. With the short talks, I didn’t mention that. One of the nice features of Pulsar is, it supports multitenancy, and it does this by letting you create tenants and underneath them namespaces, so you keep all your applications, different developers, different things together, so I’ve got a tenant, a namespace and then my topic, so it keeps things separate, lets you scale up to millions of topics and not have to worry about creating these massive weird named things. For this, it’s just chatresult2, and you’ll see how we get data into there. So this is all I have to do. I point to that config file, give it a name, give it a sync type, press Return. In a few seconds, it’s built and deployed that. You’ll see that in the logs while it’s running. Now I mentioned these functions, so it’d be good enough that you throw data into a topic. It’s automatically in that ScyllaDB table. That’s great. But I wanted to do a little work on it, so I have a function here that does sentiment analysis as that chat data is coming in real-time, drops it into chatresult2, then drops it into ScyllaDB, pretty straightforward.

I’ve got a single-page web app, very simple. I built this so I can do things at a conference or at a meet-up, give everyone access to the URL, and everyone can ask questions. Right now, the only thing it does is, there’s that function that does the sentiment analysis, but these functions are chainable, so I can add one that maybe tries to answer the question using something like BERT Q and A or maybe does a lookup at the documentation or maybe posts a Slack out there, and someone can interactively push something back from our engineering community or do some other NLP to break it up and maybe supply some best-case lookups or maybe results of a web search, point to different news feeds, search Twitter. You get the idea, pretty easy to do. Again, nothing in the app is Pulsar-specific. This app just talks to web sockets and gets data back from a different web socket. That just happens to be native Pulsar, and Pulsar uses those calls to push data into a topic and pull something out of another topic, gives you a very easy way to have live data happening on a web page that doesn’t have any back end, pretty straightforward.
Now I’ll show you a little bit in a live demo since we have a couple minutes here, but I want to tell you something we’re looking at in the community in the future. We want to make that connector faster. Now it’s pretty fast now, but we want to take advantage of some of the performance enhancements that ScyllaDB has, so we’re working on a shard-award Pulsar connector. If you’re interested in this, if you’ve got some psychos you want to help out with documentation, testing, engineering, definitely reach out. I hope more people are interested in this more advanced connector, but for now, that’s future work. If you’re interested and you like what you’re going to see in the demo in Pulsar, we’ve got free training online. We’ve got a nice webinar that show you three different ways that we built microservices and then add some Fling SQL at the end. That’s prerecorded, easy to do. If you’re especially interested in that connector, definitely join our Slack. It’s a pretty active developer community, and we’re always eager to help new people or help people who maybe want to get involved and working on things like that connector.

Thank you, but before we go, we have time to look at the demo, so let’s pop into that. This is that single-page application, and it’s using some simple libraries here. I am not a front-end developer, so maybe this isn’t the prettiest thing you’ve ever seen, but what’s cool is it works, so I like that, and it’s got some nice features in this data table library, so it can do things like search, find some of the latest messages, do a sort, maybe ask another question of, is ScyllaDB awesome? I don’t think I need to ask that. You know the answer to that. And then we get the result for that, and we have different things here, so if we had a bunch of different people doing this and everyone was asking questions, we’d get them back live, pretty straightforward. Underneath the covers, that’s talking to our topics, and as you see here, we’ve got full ability to see everything going on. What’s going on beneath the covers is pretty advanced. It’s breaking up data into segments, so it’s stored in different layers. It’s keeping track of different people who are consuming this data here. You can see there’s our ScyllaDB sink that is grabbing this data really fast as it’s coming in, sending it through, and I’ve got a couple of different other consumers out there reading this data. I don’t have any policies on here, but we could make different people able to see different parts of it. I could break up this topic into partitions, lots of different things we could do here. I could add new subscriptions to see what’s going on. If there’s a schema with the data, I can see that schema, and there’s a built-in schema registry, makes that pretty straightforward for different things we’re doing, and if you’re interested in seeing that, pretty cool. See here we could see our data came back from a topic, so what happens is, I sent it over web sockets, went into a regular Pulsar topic, activated that microservice. It ran some sentiment analysis, pushed it into another topic, and this guy has subscribed to that topic, getting new messages as they come in, and I could sort by the time that it got published, pretty straightforward, and at that same time, that same result got pushed into ScyllaDB. I hate to show command line, but I want to show you that that data is there as it’s coming through. It’s very nice being able to talk to you about these combinations. If you’re interested in some of the stuff under the covers, we could dive a little deeper. Like I said, with that multitenancy, could support multiple tenants in one cluster, so I set one up for meetups. I have a public one that’s there by default where everyone could put common stuff, and there’s some internal ones that are used. In my meetup, I just have one namespace for my New Jersey meetup that I’m doing pretty soon and then just the topics that are there. I only have one topic, and I’ve got two subscribers here, and you can see what type they are. We mention different subscriptions. They could be things like exclusive, which means only I get them. They could be shared, or multiple people can have them. It depends on how you want to get the data. Do you need it exactly once? You want it just fast. You need it to work like a work queue. You pick your different subscription types and go from there, pretty straightforward there, not the most difficult thing in the world. What else can we show you real quick? That’s pretty much it. What’s kind of cool here is, this is a way to consume your data form a live message queue from a single-page application where there’s no mention of Pulsar. If you look through the source code, which I’ll give you the source-code link when we’re at the end of this along with the slides, you can see it’s just a standard web-socket call from jQuery. You make that call. You get back data as JSON, parse it out, and you’re ready to go, easy to apply that to something like a live data table or something else.

So I’m Timothy Spann, and thanks for coming to my talk. Hopefully you enjoyed it and enjoy the rest of your conference. Thank you.

Read More