Spark workloads are common to ScyllaDB deployments. Spark helps users conduct analytic workloads on top of their transactional data. Utilizing the open source Spark-Cassandra connector, ScyllaDB enables access of Spark to the data.
During our recent webinar, ‘Analytics Showtime: Spark Powered by ScyllaDB’ (now available on-demand), there were several questions that we found worthy of additional discussion.
Do you recommend deploying Spark nodes on top of ScyllaDB nodes?
In general, we recommend separating the two deployments. There are several reasons for this recommendation:
- Resource Utilization. Both ScyllaDB and Spark are resource hungry applications. Co-deploying ScyllaDB and Spark can cause resource depletion and contentions between ScyllaDB and Spark.
- Dynamic allocations. ScyllaDB in most cases is a static deployment — you deploy a number of nodes as planned for your capacity (throughput and/or latency SLA). Spark jobs, on the other hand, have ad-hoc characteristics. Users can benefit from deploying and decommissioning Spark nodes without accruing money or performance costs.
- Data locality impact in minimal. Since ScyllaDB hashes the partition keys, the probability of a continuous placement of multiple ScyllaDB partitions that are part of an RDD partition is slim. Without the collocation of data on the same node, users utilize the network to transfer data from the ScyllaDB nodes to the Spark nodes, whether collocating or not collocating the data.
What tuning options would you recommend for very high write workloads to ScyllaDB from Spark?
We recommend looking at the following settings:
- The number of connections opened between your Spark executors and ScyllaDB. You can monitor the number of open connections using ScyllaDB’s monitoring solution.
- Look at your data model. Can the Spark connector utilize its batch processing efficiently? Make sure that the ScyllaDB partition key used in the batch is always the same. On the other hand, if you are about to create a huge partition, also consider the amount of time it will take ScyllaDB to fetch that large partition
- In the case of changing the buffer behavior to not batch, you should tune: output.batch.grouping.key to none
- If your Spark nodes have enough power and network bandwidth available in your Spark and ScyllaDB instances is 10gb or higher, increase the number of concurrent writes by changing “output.concurrent.writes” from its default of 5.
Does ScyllaDB send the data compressed or uncompressed over the network to Spark?
We recommend compressing all communication through the Cassandra-spark connector.
Users can define the compression algorithm used in the configuration file of the connector.
Set “connection.compression” to LZ or Snappy to achieve the desired compression and reduction in network traffic.
What does the setting input.split.size_in_mb help with?
This is the default setting for the input split size and is set to 64MB by default. This is also the basic size of an RDD partition. This means that every fetch of information from ScyllaDB will “fetch” 64MB. Such a setting is less efficient for ScyllaDB’s architecture, as it means a single coordinator will have to deal with a fairly sizable read transaction while its counterparts are sitting idle. By reducing the split size to 1MB, we achieve several benefits:
- Coordination and data fetching is distributed among several coordinators
- Higher concurrency of requests, which translates to a higher number of connections between the Spark nodes and ScyllaDB nodes
Does your demo use Spark standalone?
Yes, in our demo we are using Spark standalone. In most installations, in which ScyllaDB is involved, we see Spark installed in standalone mode.
Miss the webinar or want to see it again? ‘Analytics Showtime: Spark Powered by ScyllaDB’, is available for viewing on-demand.