CASE STUDY

Scylla Shines in IBM’s Performance Tests for JanusGraph

About IBM Graph and JanusGraph

IBM Graph is an enterprise graph DBMS service based on open-source graph technology. It is built with Apache TinkerPop3 as a managed service, supporting high availability and elastic scalability for real-time analytics and recommendation engines. It is available on IBM Cloud. IBM’s work on graph is moving from Graph to Compose for JanusGraph, although both are currently available.

JanusGraph is a scalable open source graph database that’s optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. It stores graphs in adjacency list format, which means that a graph is stored as a collection of vertices with their edges and properties.

In the context of graph databases, the performance of the storage backend is critical. IBM had previously used only Apache Cassandra and HBase as storage back-ends for JanusGraph. Having heard about the advantages of Scylla, IBM’s Open Tech and Performance teams conducted a series of tests to compare Scylla with HBase and Cassandra.

Convincing Results

The team conducted three performance tests: Inserting vertices (write only), inserting edges (read and write) and queries (mainly reads). The test environment was in 3-node clusters.

Server Specifications:

  • Physical servers: x3650 M5, 2 sockets x 1 cores, 384 GB (12 x 32G) memory
  • CPU: Intel Xeon Processor E5-2690 v4 14C 2.6GHz 35MB Cache 2400MHz
  • Network Interface: Emulex VFA5.2 ML2 Dual Port 10GbE SFP+ Adapter
  • Disk: 720 GB SSD, RAID 5
  • Operating System: Ubunto 16.04.2 LTS

Test 1: Inserting Vertices
For this test, the IBM team used its own tool to generate 40,000,000 vertices, each of which had two properties. This data was saved into CSV files. The team then used their own driver to insert the vertices as quickly as possible into JanusGraph.

Scylla displayed nearly 35% higher throughput than HBase and almost 3X Cassandra’s throughput. As Ted Chang, Performance Engineer at IBM, explained, ”Scylla performed very well out of the box, we really didn’t have spend much time tuning it. We did spent quite a lot of time tuning the other two databases, but they still couldn’t come close to Scylla.”

“Once you get to 400 concurrent users, it really looks like Scylla is the only choice.”

Chin Huang, Software Engineer, IBM

Test 2: Inserting Edges
In the second test scenario, the team randomly picked 30,000,000 pairs of vertices and entered 30,000,000 edges into it. Each edge had one property. Again, this data was saved to CSV files and the edges were imported as quickly as possible into JanusGraph.

Once again, Scylla far outpaced the other two databases. Scylla’s throughput was 160% better than HBase and more than 4X that of Cassandra.

Test 3: Query Performance
Here the purpose was to simulate a real-life application. The team wanted to see under the real use of a complex query the number of transactions per second each database could handle in their 3-node configuration.

And once again Scylla came out heads and shoulders above HBase and Cassandra. At high volumes, Scylla performed 72% better than Cassandra and nearly 150% better than HBase.

The team also measured latency in order to see how well the databases could meet SLAs. At higher traffic volumes, Scylla’s latency was roughly half that of Cassandra and almost a third of HBase. As Chin Huang, Software Engineer at IBM, expressed, “once you get to 400 concurrent users, it really looks like Scylla is the only choice.”

Lessons Learned

The IBM team learned quite a lot from their performance tests. In addition to its performance advantages, Scylla was also the easiest database to cluster, especially when adding multiple nodes to a cluster. They were also very pleased to see Scylla’s self-tuning capabilities, load balancing and its ability to fully utilize the available system resources. Lastly, their test environments confirmed that Scylla works with existing Cassandra utility clients.