When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.
Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.
In this talk, we focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
Ted Chang, Chin Huang, Software EngineersIBM Graph
Avi Kivity, CTOScyllaDB
Miguel Martinez Pedreira, Computer engineerCERN
Holden Karau, Developer Advocate, Google
Alexys Jacob, CTO, Numberly
Eyal Gutkind, Head of Solution Architects, ScyllaDB