P99 CONF is the event on all things performance. Join us online Oct 23-24 — Registration is free

x86-less ScyllaDB: Exploring an All-ARM Cluster

Keith McKayMike Bennet14 minutes

In this session, we explore an all-ARM ScyllaDB cluster with Ampere ARM-powered servers and fast NVMe SSD storage from ScaleFlux. We describe the hardware setup and evaluate ScyllaDB performance and efficiency to uncover what's possible in a completely x86-less cluster.

Share this

Video Slides

Video Transcript

Welcome to our talk x86-less ScyllaDB exploring an all-arm cluster my name is Mike Bennett and I’m a solution architect here at Ampere Computing.

I’ve been working there for the past 18 years so I’ve seen a lot of different database technologies and for the past nine years I’ve been working in solution development for oems where I’ve helped make ways to run those kind of applications faster for that reason I enjoy servers with many cores because they help things run faster and I recently moved to California from Texas being a reverse migrator against the trend hi I’m Keith I’m responsible for apps engineering at scale flux I’ve been in the memory and storage business for close to 20 years now I really enjoy it I was born in Mountain View so just a little bit before Google so I’ve yeah enjoyed kind of growing up in Silicon Valley so our agenda today we’re going to talk a bit about uh the cluster that we built using 100 arm cores no xtd6 was harmed in the building of this cluster um we’re going to introduce ampere and Scout flux or two companies then we’re going to look at the benchmarking setup and talk about uh some of the results we’ve achieved which I think are are quite impressive and then lastly uh we’ll wrap up with some some ways to get into contact with us in case uh you want to learn more going into our cluster configuration you can see here are all arm powered cluster not a single x86 processor was needed to build this cluster we started with three servers those are Mount Collins platform from Foxconn each of these systems has a ultramax m28 30 processor which is our way of saying it’s 128 cores running at three gigahertz we had 256 gigs of RAM per node but if you went to a dual socket configuration you could have up to a terabyte of memory um inside of each of these servers we had four scale flux ssds which have their own cortex cores on side inside of them as arm processors and we used three client machines running ampere Ultra q8030 which is an 80 core version of our three gigahertz processor pushing the different test configurations that we’ll go over later to our ScyllaDB instance

um for the operating system inside of the cluster we went with Susa Enterprise 15.4 and because they contributed some to helping us get the hardware resources that we needed for this testing um and we also used 100 gigabit ethernet networking powered by melanox connect X6 Nix so in this entire cluster we didn’t need a single x86 instruction to help us out

so and the reason that we think it’s important to do this test and to demonstrate in all our empowered ScyllaDB cluster is because we think that it will provide a lower power solution than equivalent clusters available today and that you’ll see a higher CPU density per rack and per even per node with up to 256 cores you know available in ampere systems and that all of this is going to come together to give you a more highly performant Silo GB instance in cluster

so before we go into the results we’ll give you a little bit of an introduction on our two companies first we have ampere who I work for we developed the world’s first Cloud native processor the Ampere Ultra which is a available in up to 80 cores at three gigahertz and has eight memory controllers for ddr4 up to 3200 and um then we follow that on with our ultramax line of CPUs which are the same neoverse and one core as the Ampere Ultra but you can get up to 128 cores per socket and at three gigahertz and again with uh eight memory controllers um some of the things that differentiate our processors from the competition is that we are single threaded so um One Core is one thread and that helps us run have the ability to run all of our cores at Max frequency so there will be less uh frequency scaling up and down based off of TDP and other factors as some systems do with ampere servers you can run all cores at three gigahertz you know without hitting your TDP

um and we have large low latency private l1l2 caches as you know opposed to Legacy architectures that use a large L3 cache to share information among the cores

um and yeah with our maximum core count you can get up to 256 cores in a 1u server and that’s using the ultramax m128 30. yeah so at scale folks we like to make what we call a better SSD so we take a data center class nvme SSD in common form factors such as u.2 or u1.s at four terabyte to 16 terabyte physical capacity points and we marry those with compute engines for acceleration sometimes this is called computational storage so some of those compute capabilities can be a transparent compression decompression data filtering or security acceleration so you combine the performance and all the goodness of a really fast nvme SSD with these rocket engines that can can really boost performance in in certain applications

so let’s see what what the benchmarking results look like I think this is the the kind of highlight of the show here

so first things first we have to load a database I want to highlight some of the choices that we made in loading data first we chose a replication factor of three we didn’t have to specify a compaction strategy because by default we’re going to get incremental compaction with uh Cella Enterprise uh we are using the shardware driver driver from the Cassandra stress utility provided by oscilla and we’re going to use a variable column size we’re not going to just do a fixed column slide size here and we also updated a little bit uh after the sequential population of the drive so a little bit of overfill so the first scenario that we’re we’re going to look at is 100 read with the gaussian distribution so this is going to touch memory it’s going to touch a little bit of disk and we chose to look at 100 read because we wanted to see what kind of latency relates you can get out of the cluster without the rights getting in the way in the second scenario we want to look at a more of a realistic real-life kind of workload uh we do 75 read and 25 right again we can use a gaussian distribution so we’re touching memory we’re touching a little bit disk um and in the last scenario we did something really interesting We Do 50 50 read write but we’re going to use a really small data set so we just fit in memory and see if we can find any bottlenecks in our cluster

so first looking at read I think the first thing to highlight here is that we achieved 1.4 million operations per second and then the second thing to look at is we did that at P99 latencies of less than a millisecond so this is a very very very high performance cluster when it comes to to read uh the last thing to note here is that CP utilization is quite low because the read operation was in the CPU intensive as we’ll see in when we have rights foreign

so in our 7525 quote-unquote real world uh test scenario here uh everything’s the same as is the reader stuff we’re adding in those rights we’re still using a gaussian distribution uh we’re still looking at the whole data set with that distribution but now we see a couple things we see that we’re down you know where we’re still over a million iops we’re at 1.1 million and part of the reason for that is is that now we have that right activity that’s happening and we have other things to manage with the with database we are doing about 100 running compactions during this Benchmark on average we saw um and of course uh with that right to read interference and all that activity going on relations are you know did go up a little bit but um still fairly respectable um and we could have backed off on the performance a little bit if you wanted to so remembering these tests were just really trying to to put the the pedal of the metal here and push as much as we can out and if we wanted to try to back off on the aggressiveness of the operation for a second we would see those latencies kind of uh fall proportionally with that

and the last slide we’ll look at is the 50 50 case and here we we kept the data set in memory uh and really tried to torture the system and here we did hit uh 80 over 80 CPU load but we got to 1.4 million operation for a second again and in this case we don’t see so many compactions because our data size were writing is so small right so basically this consists of of you know mem table flushing as the sole write activity going on in the disk uh we do see that that slightly improved the the read latency here um but the results very similar actually to the 7525 case which I think is quite interesting thanks Keith for going over those results earlier after we did the cluster setup we gave some of the reasons why we were doing this experiment and so now looking back at those we can see that we did have a low power utilization especially relative to the performance that we got with 1.4 and 1.1 million operations per second depending on the test and that was using under four watts per CPU total system power draw was between 410 and 490 Watts including storage Network memory

um and so when you think about you know your existing rack footprint and how many more of the Next Generation processors you might be able to get you know it’s common to see uh other systems coming out that advertise power draws of 8 to 10 watts per core and you know there are different cores than our cores and so performance is going to be uh somewhat you know it’s not going to be exactly the same but we draw half or less of the power that comparable CPU cores that your solid DB instance might be running on now use and we do it with a higher CPU density which means you need fewer servers in order to achieve this performance and you lower your you know your costs and your man hours that you have to spend troubleshooting servers that have issues because we put 128 cores in each socket and we can have up to two sockets per system so you can have 256 cores per server and easily over um 10 000 cores per rack if you use 1u servers and so all this you know is going to give you a really highly performant um system and it’s gonna do it in a smaller footprint then you’re gonna see with comparable x86 systems

so if you want to give running ScyllaDB on arm a shot you know you can use our ampere developer Access program to get access to some Hardware you can either use bare metal servers that we host in our lab or send to you and then alternatively we can give you access to a cloud environment hosted on Azure oci or one of the other many public clouds and you can run your workload there my team also has a bunch of solution Architects like myself available and so we’ll provide resources to help you get up and running and any um overcome any problems that you face along the way thanks Mike yeah and here’s cuff let’s we also have a POC program you can request samples set the email address here info scale flux.com be sure to mention you saw us at the ScyllaDB Summit I can learn more about our company our website scalefox.com and then lastly feel free to reach out to me if you have any questions or just want to talk storage I do love talking about storage so be warned thank you and thanks for coming to our talk if you want to stay in touch you can see all of our information below

[Applause]

Read More