Spark Powered by Scylla – Your Questions Answered
Spark workloads are common to Scylla deployments. Spark helps users conduct analytic workloads on top of their transactional data. Utilizing the open source Spark-Cassandra connector, Scylla enables access of Spark to the data.
During our recent webinar, ‘Analytics Showtime: Spark Powered by Scylla’ (now available on-demand), there were several questions that we found worthy of additional discussion.
Do you recommend deploying Spark nodes on top of Scylla nodes?
In general, we recommend separating the two deployments. There are several reasons for this recommendation:
- Resource Utilization. Both Scylla and Spark are resource hungry applications. Co-deploying Scylla and Spark can cause resource depletion and contentions between Scylla and Spark.
- Dynamic allocations. Scylla in most cases is a static deployment — you deploy a number of nodes as planned for your capacity (throughput and/or latency SLA). Spark jobs, on the other hand, have ad-hoc characteristics. Users can benefit from deploying and decommissioning Spark nodes without accruing money or performance costs.
- Data locality impact in minimal. Since Scylla hashes the partition keys, the probability of a continuous placement of multiple Scylla partitions that are part of an RDD partition is slim. Without the collocation of data on the same node, users utilize the network to transfer data from the Scylla nodes to the Spark nodes, whether collocating or not collocating the data.
What tuning options would you recommend for very high write workloads to Scylla from Spark?
We recommend looking at the following settings:
- The number of connections opened between your Spark executors and Scylla. You can monitor the number of open connections using Scylla’s monitoring solution.
- Look at your data model. Can the Spark connector utilize its batch processing efficiently? Make sure that the Scylla partition key used in the batch is always the same. On the other hand, if you are about to create a huge partition, also consider the amount of time it will take Scylla to fetch that large partition
- In the case of changing the buffer behavior to not batch, you should tune: output.batch.grouping.key to none
- If your Spark nodes have enough power and network bandwidth available in your Spark and Scylla instances is 10gb or higher, increase the number of concurrent writes by changing “output.concurrent.writes” from its default of 5.
Does Scylla send the data compressed or uncompressed over the network to Spark?
We recommend compressing all communication through the Cassandra-spark connector.
Users can define the compression algorithm used in the configuration file of the connector.
Set “connection.compression” to LZ or Snappy to achieve the desired compression and reduction in network traffic.
What does the setting input.split.size_in_mb help with?
This is the default setting for the input split size and is set to 64MB by default. This is also the basic size of an RDD partition. This means that every fetch of information from Scylla will “fetch” 64MB. Such a setting is less efficient for Scylla’s architecture, as it means a single coordinator will have to deal with a fairly sizable read transaction while its counterparts are sitting idle. By reducing the split size to 1MB, we achieve several benefits:
- Coordination and data fetching is distributed among several coordinators
- Higher concurrency of requests, which translates to a higher number of connections between the Spark nodes and Scylla nodes
Does your demo use Spark standalone?
Yes, in our demo we are using Spark standalone. In most installations, in which Scylla is involved, we see Spark installed in standalone mode.
Miss the webinar or want to see it again? ‘Analytics Showtime: Spark Powered by Scylla’, is available for viewing on-demand.