See all blog posts

Streaming Real-Time Chat Messages into ScyllaDB with Apache Pulsar

At Scylla Summit 2022, I presented “FLiP Into Apache Pulsar Apps with ScyllaDB”. Using the same content, in this blog we’ll demonstrate step-by-step how to build real-time messaging and streaming applications using a variety of OSS libraries, schemas, languages, frameworks, and tools utilizing ScyllaDB. We’ll also introduce options from MQTT, Web Sockets, Java, Golang, Python, NodeJS, Apache NiFi, Kafka on Pulsar, Pulsar protocol and more. You will learn how to quickly deploy an app to a production cloud cluster with StreamNative, and build your own fast applications using the Apache Pulsar and ScyllaDB integration.

Before we jump into the how, let’s review why this integration can be used for speedy application build. ScyllaDB is an ultra-fast, low-latency, high-throughput, open source NoSQL platform that is fully compatible with Cassandra. Populating ScyllaDB tables utilizing the Scylla-compatible Pulsar IO sink doesn’t require any complex or specialized coding, and the sink makes it easy to load data to ScyllaDB using a simple configuration file pointing to Pulsar topics that stream all events directly to ScyllaDB tables.

Now, let’s build a streaming real-time chat message system utilizing ScyllaDB and Apache Pulsar!

Why Apache Pulsar for Streaming Event Based Applications

Let’s start the process to create a chat application that publishes messages to an event bus anytime someone fills out a web form. After the message is published, sentiment analysis is performed on the “comments” text field of the payload, and the result of the analysis is output to a downstream topic.

Event-driven applications, like our chat application, use a message bus to communicate between loosely-coupled, collaborating services. Different services communicate with each other by exchanging messages asynchronously. In the context of microservices, these messages are often referred to as events.

The message bus receives events from producers, filters the events, and then pushes the events to consumers without tying the events to individual services. Other services can subscribe to the event bus to receive those events for processing (consumers).

Apache Pulsar is a cloud-native, distributed messaging and event-streaming platform that acts as a message bus. It supports common messaging paradigms with its diverse subscription types and consumption patterns.

As a feature required for our integration, Pulsar supports IO Connectors. Pulsar IO connectors enable you to create, deploy, and manage connectors utilizing simple configuration files and basic CLI tools and REST APIs. We will utilize a Pulsar IO Connector to sink data from Pulsar topics to ScyllaDB.

Pulsar IO Connector for ScyllaDB

First, we download the Cassandra connector to deploy it to my Pulsar cluster. This process is documented at the Pulsar IO Cassandra Sink connector information.

Next, we download the pulsar-io-cassandra-X.nar archive to our connectors directory. ScyllaDB is fully compatible with Cassandra, so we can use that connector to stream messages to it.

When using a Pulsar IO connector like the ScyllaDB one I used for my demo, you can specify the configuration details inside a YAML file like the one shown below.

configs:
roots: "172.17.0.2:9042"
keyspace: "pulsar_test_keyspace"
columnFamily: "pulsar_test_table"
keyname: "key"
columnName: "col"

The main configuration shown above is done in YAML format and lists the root server with port, a keyspace, a column family, keyname, and column name to populate.

First, we will need to create a topic to consume from.

bin/pulsar-admin topics create persistent://public/default/chatresult2

When you deploy the connector you pass in these configuration properties by command line call as shown below.

bin/pulsar-admin sinks create --tenant public --namespace default --name "scylla-test-sink" --sink-type cassandra --sink-config-file conf/scylla.yml --inputs chatresult2

For new data, create a keyspace, table and index or use one of your existing ones.

CREATE KEYSPACE pulsar_test_keyspace with replication = {‘class’:’SimpleStrategy’, ‘replication_factor’:1};
CREATE TABLE pulsar_test_table (key text PRIMARY KEY, col text);
CREATE INDEX on pulsar_test_table(col);

Adding ML Functionality with a Pulsar Function

In the previous section, we discussed why Apache Pulsar is well-suited for event-driven applications. In this section, we’ll cover Pulsar Functions–a lightweight, serverless computing framework (similar to AWS Lambda). We’ll leverage a Pulsar Function to deploy our ML model to transform or process messages in Pulsar. The diagram below illustrates our chat application example.

Keep in mind: Pulsar Functions give you the flexibility to use Java, Python, or Go for implementing your processing logic. You can easily use alternative libraries for your sentiment analysis algorithm.

The code below is a Pulsar Function that runs Sentiment Analysis on my stream of events. (The function runs once per event.)

from pulsar import Function
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import json

class Chat(Function):
    def __init__(self):
        pass

    def process(self, input, context):
        logger = context.get_logger()
        logger.info("Message Content: {0}".format(input))
        msg_id = context.get_message_id()

        fields = json.loads(input)
        sid = SentimentIntensityAnalyzer()
        ss = sid.polarity_scores(fields["comment"])
        logger.info("Polarity: {0}".format(ss['compound']))
        sentimentVal = 'Neutral'
        if ss['compound'] == 0.00:
            sentimentVal = 'Neutral'
        elif ss['compound'] < 0.00:
            sentimentVal = 'Negative'
        else:
            sentimentVal = 'Positive'
        row = { }

        row['id'] = str(msg_id)
        row['sentiment'] = str(sentimentVal)
        row['userInfo'] = str(fields["userInfo"])
        row['comment'] = str(fields["comment"])
        row['contactInfo'] = str(fields["contactInfo"])
        json_string = json.dumps(row)
        return json_string

Here, we use the Vader Sentiment NLP ML Library to analyze the user’s sentiment on the comment. We enrich our input record with the sentiment and then write it in JSON format to the output topic.

I use the Pulsar context to do logging. I could also push data values to state storage or record some metrics. For this example, we will just do some logging.

Deploy Our Function

Below is the deployment script where you can find all of the options and tools in its github directory. We have to make sure we have our NLP library installed on all of our nodes.

bin/pulsar-admin functions create --auto-ack true
--py pulsar-pychat-function/src/sentiment.py --classname "sentiment.Chat" --inputs "persistent://public/default/chat2" --log-topic "persistent://public/default/chatlog2" --name Chat --namespace default --output "persistent://public/default/chatresult2" --tenant public

pip3 install vaderSentiment

Let’s Run Our Chat Application

Now that we have built our topic, Function, and sink, let’s build our application. The full web page is in the github directory, but I’ll show you the critical portions here. For this Single Page Application (SPA), I am using JQuery and DataTables that are included from their public CDNs. Datatable.html

<form action="/datatable.html" method="post" enctype="multipart/form-data" id="form-id">
<div id="demo" name="demo"></demo>
<p><label>User: </label><input name="user" type="text" id="user-id" size="75" value="" maxlength="100"/></p>
<p><label>Question: </label><input type="text" name="other-field" type="text" id="other-field-id" size="75" maxlength="200" value=""/></p>
<p><label>Contact Info: </label><input name="contactinfo" type="text" id="contactinfo-id" size="75" maxlength="100" value=""/></p>
<p><input type="button" value="Send to Pulsar" onClick="loadDoc()" /></p>
</form>

In the above HTML Form, we let users add a comment to our chat.

Now we are using JavaScript to send the form data as JSON to a Pulsar topic via WebSockets. WebSockets are a supported protocol for Apache Pulsar. The WebSocket URL is ws://pulsar1:8080/ws/v2/producer/persistent/public/default/chat2.

Where ws is the protocol, pulsar1 is the Pulsar server, port 8080 is our REST port, producer is what we are doing, persistent is our type of topic, public is our tenant, default is our namespace and chat2 is our topic: We populate an object and convert it to a JSON String and encode that payload as a Base64-encoded ASCII string. Then, we add that encoded String as the payload in a new JSON string that includes payload, properties and context for our Pulsar Message. This format is required for the WebSocket protocol to convert to a regular message in our Pulsar topic.

​​<script>
function loadDoc() {
cxvar xhttp = new XMLHttpRequest();
  xhttp.onreadystatechange = function() {
    if (this.readyState == 4 && this.status == 200) {
      document.getElementById("demo").innerHTML = '';
    }
  };
var wsUri = "ws://pulsar1:8080/ws/v2/producer/persistent/public/default/chat2";

websocket = new WebSocket(wsUri);

const pulsarObject = {
  userInfo: document.getElementById('user-id').value.substring(0,200),
  contactInfo: document.getElementById('contactinfo-id').value.substring(0,200),
  comment: document.getElementById('other-field-id').value.substring(0, 200)};
const jsonStr = JSON.stringify(pulsarObject);
var payloadStr = btoa(jsonStr);
const propertiesObject = {key: Date.now() }
var data = JSON.stringify({ "payload": payloadStr, "properties": propertiesObject, "context": "cs" });

websocket.onopen = function(evt) {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.send(data);
  }
};
websocket.onerror = function(evt) {console.log('ERR', evt)};
websocket.onmessage = function(evt) {}
websocket.onclose = function(evt) {
  if (evt.wasClean) { console.log(evt);
  } else { console.log('[close] Connection died');
  }
};
}
var form = document.getElementById('form-id');
form.onsubmit = function() {
  var formData = new FormData(form);  
  var action = form.getAttribute('action');
  loadDoc();
  return false;
}
</script>

In the above code, we’ll grab the value of the fields from the form, stop the form from reloading the page, and then send the data to Pulsar.

Now, let’s consume any messages sent to the result topic of our Sentiment Pulsar function.

In the below code we consume from a Pulsar topic: ws://pulsar1:8080/ws/v2/consumer/persistent/public/default/chatresult2/chatrreader?subscriptionType=Shared&receiverQueueSize=500.

In this URI, we can see this differs some from the producer URI. We have a receiverQueueSize, consumer tag and a subscription Type of Shared.

JavaScript:

$(document).ready(function() {
   var t = $('#example').DataTable();

var wsUri = "ws://pulsar1:8080/ws/v2/consumer/persistent/public/default/chatresult2/chatrreader?subscriptionType=Shared&receiverQueueSize=500";
websocket = new WebSocket(wsUri);
websocket.onopen = function(evt) {
console.log('open');
};
websocket.onerror = function(evt) {console.log('ERR', evt)};
websocket.onmessage = function(evt) {

   var dataPoints = JSON.parse(evt.data);
   if ( dataPoints === undefined || dataPoints == null || dataPoints.payload === undefined || dataPoints.payload == null ) {
      return;
   }
   if (IsJsonString(atob(dataPoints.payload))) {
      var pulsarMessage = JSON.parse(atob(dataPoints.payload));
      if ( pulsarMessage === undefined || pulsarMessage == null ) {
         return;
      }
      var sentiment = "";
      if ( !isEmpty(pulsarMessage.sentiment) ) {
         sentiment = pulsarMessage.sentiment;
      }
      var publishTime = "";
      if ( !isEmpty(dataPoints.publishTime) ) {
         publishTime = dataPoints.publishTime;
      }
      var comment = "";
      if ( !isEmpty(pulsarMessage.comment) ) {
         comment = pulsarMessage.comment;
      }
      var userInfo= "";
      if ( !isEmpty(pulsarMessage.userInfo) ) {
         userInfo = pulsarMessage.userInfo;
      }
      var contactInfo= "";
      if ( !isEmpty(pulsarMessage.contactInfo) ) {
         contactInfo = pulsarMessage.contactInfo;
      }
         t.row.add( [ sentiment, publishTime, comment, userInfo, contactInfo ] ).draw(true);
   }
};

} );

For messages consumed in JavaScript WebSockets, we have to Base64-decode the payload and parse the JSON into an object and then use the DataTable row.add method to add these new table rows to our results. This will happen whenever messages are received.

Conclusion

In this blog, we explained how to use Apache Pulsar to build simple, streaming applications regardless of the data source. We chose to add a ScyllaDB compatible sink to our Chat application; however, we could do this for any data store in Apache Pulsar.

You can find the source code in the Github repo Scylla FLiPS The Stream With Apache Pulsar.

If you’d like to see this process in action, view the original on-demand recording.

WATCH THE ON-DEMAND RECORDING

Resources & References

More on Pulsar

  1. Learn Pulsar Fundamentals: While this blog did not cover Pulsar fundamentals, there are great resources available to help you learn more. If you are new to Pulsar, we recommend you to take the on-demand self-paced Pulsar courses or test your Pulsar knowledge with the Fundamentals TestOut.
  2. Spin up a Pulsar Cluster in Minutes: If you want to try building microservices without having to set up a Pulsar cluster yourself, sign up for StreamNative Cloud today. StreamNative Cloud is the simple, fast, and cost-effective way to run Pulsar in the public cloud.
  3. Continued Learning: If you are interested in learning more about Pulsar functions and Pulsar, take a look at the following resources:

This article original appeared on the StreamNative Engineering blog here and is republished with permission.

 

About Tim Spann

Tim Spann is a Developer Advocate for StreamNative covering Apache Pulsar, Apache Flink and Enterprise Messaging and Streaming. He focuses on the US and leads, educates, develops demos and curriculum, speaks to dozens of Fortune 500 companies, speaks at conferences and is a Developer Advocate for all developers. He loves developing, building and giving workshops to clients, partners, employees and prospects. He has been running well-attended meetups for years and will continue enabling others to run them as well. He has a passion for running AI/ML/Deep Learning at the edge and personally purchased, developed, architected, built, demoed and broadcast solutions for devices such as NVidia Jetson Nano, NVidia Jetson Xavier, Raspberry Pi with Movidius 1 and 2, Google Coral, Garden Hat, many things from Adafruit.