Choosing EC2 instances for NoSQL
Amazon EC2 is a virtual computer store with all sizes and types of server on display. We researched the top choices to find the best balanced, best-performing server for NoSQL.
The Amazon data center in Boardman, Oregon. Photo: Visitor7 for Wikimedia Commons
Amazon makes many storage options available for EC2. The two that are relevant for NoSQL are EBS and instance storage.
- Elastic Block Store (EBS) offers lots of space, and large volumes. Volumes can persist after the instance they’re connected to is stopped. One use case is to use EBS to persist snapshots. You would need to use an external tool to make a copy of snapshots onto an EBS volume. This is not automatic, but could be done in future, just as today we back up SSD volumes to spinning disk. (Like backing up a spinning disk to tape, back in the day)
- Instance storage survives reboots, but does not survive stopping the instance. If you change instance type, or stop an instance in order to save on your bill, the instance storage also vanishes. On the positive side, instance storage uses SSD devices, and performance is good. For your active data directory, instance storage is a must.
Not all instance storage is the same
Classes of instances (such as “i2”) include multiple instance types (such as “i2.8xlarge”). Between classes of instances the instance storage changes dramatically, but storage is similar within an instance type. For example, the “c3” class: all have two, similar SSDs. The differences among c3 instance types are in the sizes of SSDs.
All c3 systems have 2 SSDs, but they’re limited in performance, and don’t support the TRIM command. Larger i2 instances have more SSDs available, and can be combined with RAID for more parallelism. More devices don’t make throughput scale linearly, because the controller is the bottleneck but you can get more total throughput from more disks. Best of all the i2 SSDs are very good, and support TRIM.
The difference is clear in throughput and latency.
In both graphs, the x axis is concurrent requests. The i2 instances have more than twice the throughput.
On a c3 instance, with only 10 concurrent requests, latency starts to increase. On an i2 instance, the latency does not start to increase until around 40 concurrent requests. The 95th percentile latency on the c3 instance is much higher. This results in higher latency at the application level.
Specialized workloads: are c3 instances ever a good idea?
Some workloads can perform acceptably on c3 instances—if they don’t push enough data to overload the disks but do need a high number of operations. On c3 instances, the individual CPUs are better. If workload is mostly in memory, then you can get more operations per second using c3. However this is a specialized case for some workloads. Only do this after profiling.
Just because your workload is CPU-bound on Cassandra does not mean it will be CPU-bound on Scylla. Scylla is much more efficient in using the CPU, so differences in I/O between instance types are more obvious.
Fast and slow instances within the same type
Individual EC2 instances vary greatly in how well the storage works. You can get good and bad nodes within the same instance type. This is the Amazon “noisy neighbor” problem, which is mostly a storage problem. CPUs are easier to isolate between customers, but the storage controller is harder to isolate. When sharing a controller with someone who is doing a lot of I/O, the best thing to do is stop the node and try for another one.
Contending for storage means that you have to keep monitoring your instances, to know when to stop one and replace it. This is a fundamental problem with the technologies used in the cloud. You can’t do anything when you launch the instance to know if you’ll get a good one. It’s a matter of luck. (It seems that you are only guaranteed to get a good machine when you try to get a bad one for testing!) Depending on configuration, read latency can be immediately impacted. For writes the situation is more complicated. Scylla acknowledges writes before data is flushed to disk – future requests will see more latency, and peak ops/s will fall. The read use case is easier to understand.
If you know what percentage bad ones you’re getting, you know how many extras to plan to start. It’s hard to automatically classify a machine as bad. None of the disks are very predictable. Any machine can have a period of time where it behaves badly. Use
iostat and friends to check performance. Brendan Gregg at Netflix has some good advice: “Linux Performance Analysis in 60,000 Milliseconds”
What instance type is the best bargain for NoSQL?
If you do the math on SSD storage per dollar, and I/O throughput per dollar, the i2 pricing tends to be the best bargain. When choosing instance types for Scylla, go big. The i2.8x is a great machine because lots of CPUs and good storage. It comes with a high price tag, but Scylla scales linearly.
In general, look for
- Instances with “enhanced networking” have lower, more consistent network latency.
- Instance storage is a must for high-performance storage. For NoSQL in production, avoid low-end instance types without it.
- No GPU: Scylla does not take advantage of the GPU, so GPU instances are not needed.
New instances types are released regularly, so be sure to check back for followup info.
Coming soon: How does Scylla need to be tuned to take advantage of available storage, and how much does it adjust automatically?