Stateful Streaming Applications with Apache Spark | Scylla Summit 2017

Burak YavuzDatabricks

28:41November 16, 2017

When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.

Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.

In this talk, we focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.


Share this


Let’s do this

Getting started takes only a few minutes. Scylla has an installer for every major platform. If you get stuck, we’re here to help.