ScyllaDB V brings new performance, resilience, and ecosystem advantages that resolve longstanding challenges of legacy NoSQL databases.ScyllaDB V is Here
Harshad S. Sane is a Principal Software Engineer at Intel Corporation.
Kshitij Doshi works at Intel Corporation in the Data Center and AI group, where he focuses on performance optimization of workloads and cloud instances. He obtained his undergraduate degree in Electrical Engineering from IIT Mumbai (1982) and his M.S. and Ph.D. degrees in Electrical and Computer Engineering from Rice University (1985 and 1989, respectively). His research interests span distributed systems, memory and storage architectures, and resource management.
Our talk today is about how can your ScyllaDB based solution deliver the best possible responsiveness where the database itself is not a source of latency spikes. So, who am I? I work in Intel’s Data Center and AI group.
however uh let’s consider the the next slide yeah this slide so here here uh very often uh when activity resumes uh from a sleep State the uh the uh processor that just came out of the sleep can take a fair number of micro seconds maybe tens of microseconds in order to resume uh its execution and after it resumes its execution it can get delayed even further because of because of many cache misses uh as it has just come out of sleep so that can delay the execution further um in the next uh we will a similar kind of effect happens with transitions from low frequency states to higher frequency States in that these these transitions will not take place immediately they they go through a series of steppings and as a result whatever threads are scheduled on those uh CPUs can experience longer times than normal to complete so uh let me illustrate the these effects with with a with a cartoon diagram here we have got a hypothetical situation consisting of three different activities that are stationed on three different threads and um thread a let’s click uh once so so so 3D produces an event that triggers thread B thread B uh executes for a while and triggers um uh thread C and uh uh thread C in turn triggers thread a and so on and in uh good conditions this is an ideal progression with the button passing from you know from from the threads as soon as the events are ready now let’s consider in the next slide what happens if for example uh the the thread C got scheduled on a processor that just came out of sleep recently in that case uh if you click once uh the thread c will run slower and so it it will produce the event for thread a uh after a longer time and in the in the meantime if uh thread a the you know got got bumped from its CPU because that CPU was put to sleep again it can it can run slower as a result and so these these kinds of delays can Cascade and eventually they can they can fade out but uh depending upon the application they can that can take some time so ordinary uh one would uh want to diagnose these things by collecting let’s go to the next slide uh by collecting various uh various kinds of Diagnostics um but what happens is that uh through either lightweight sampling or or counting of events it becomes very difficult to to to find these events um because they happen at runtime due to due to uh various unforeseen combinations of events and so one one way to deal with that is to collect traces because traces allow you to to examine the the the precise flow of events as as they happen and and therefore help you reason however there are certain challenges with the traces first of all if you’re if you’re collecting traces uh frequently and analyzing them in in real time then uh security uh issues can can come into picture because of side Channel effects um you have to have the right privilege in order to collect these traces um second thing is that if if you’re collecting a lot then um the amount of pressure that it puts on CPUs caches and in the uh the bandwidth to various devices where you save these traces can can become an issue analyzing traces or sifting through them in order to find what may be going on is a little bit like looking for uh a a problem that uh you know without really knowing whether the problem even exists right so it’s like trying to solve a crime scene problem without knowing that a crime has even been committed
uh so one one solution to this is to collect events that can be collected at lower head and which can be used essentially as harbingers of uh uh either of these two sets of problems uh those indicating power transitions are shown on the left uh various kinds of tools collect uh at very very low uh overhead uh indications of what may be going on in the scheduling system with uh and in power management systems with thermostat powertop core frequency utility run run monitoring the Run queue lens and so on and then on the right are in a number of issues that can number of collections that can indicate if there are any cash coherence issues for example one would expect that uh that we have a fall sharing issue if there’s a sharp instruction per cycle drop with uh increasing concurrency a because that means that the more more CPUs are fighting with each other in order to get to the same cache lines at the same time one would expect that there isn’t any kind of a memory or disk data bottleneck that is that’s that’s uh causing this uh this instruction per cycle drop and uh processors are highly utilized there there is very little time that threads are spending on run queues and and yet if you see uh a drop in free drop in uh processor efficiency that would be good indications as good as these Clues are nevertheless they are they are circumstantial and they well and then they can illustrate that that a problem may be happening and next our shot will walk us through uh some of the uh details on how you might collect these events and and how you how you can zero in on uh the the courses at runtime thank you Lucy uh now that Doshi has set up set us up for uh you know the two issues to reiterate fault sharing and uh power management in the CPUs uh let’s look at the ways we can um throw observability at these uh make sense of when they are happening as as it was mentioned before they can be circumstantial evidence let’s try to go deep down and figure out um when these occur how these occur and what are the possible solutions to mitigate them right so let us start with capturing information that talks to the processor about its power management Now power management was of in in was seen in two types there were C States and P States so let’s talk talk about C States first and how to visualize in them and how how their impact and control uh can be formulated right so C states are essentially idle states in the processor where the CPU likes to turn off things in order to save power whenever it is not active right so their name has idle States the higher the number of the C State they go from c0 1 to 6 uh all and they go deeper right so what’s Illustrated in the diagram on the left hand side just shows that different components within the CPU its execution engine the caches and um how they are turned off as we go deeper into the sea States um and the the more the components are getting turned off or the larger the components being turned off more power is safe but one has to note that when we have to come back from these uh higher C States back into c0 which is the running state of the processor um then we have to turn all these components off on so the deeper you go down and then you have to come out you are going to incur a higher latency and that’s the latency we were talking about on the right what we see is a way and we have used most of standard drivers such as Linux puff through which this visibility can be observed if you follow the command given on the top right there you can actually see a multiple uh events that are associated with C States some of them are at a package level some of them at a core level and they are indicated by different numbers
now we’ve got all these events but how do I actually make sense of it and how do I interlace them with what’s happening with my application so perf also provides something known as time charting where you can collect above record that has scheduler events along with the application and then it can be visualized as shown in the diagram in this in this slide we’ve got two sets of information here if you look at the top slide it has got four threads that are going into intermittent sleeps as well as on the bottom you can see a lot of sleeps occurring the red color indicating when the CPU is not doing something whereas the blue indicating with the application is running on the top what’s happening is a lot of memory allocations and deallocations are happening which means the CPU has to go to dram or to access caches so there is going to be a amount of time which is indicated by Blue which is some amount of time where there is actual work going to happen and then the CPU is going to go to sleep it’s going to access that data again you have to go to sleep because of these latencies these are the times when the CPU is actually inactive now if you go to IO you are actually going at a much much further out so your latencies are going to be much higher so unless you have immense amount of parallel parallelism to cover for these you are going to have to switch between these states and that is what is indicated by um the red versus versus the White and the and the blue so with the help of this we are able to look at multi-threaded applications and figure out if um the reds are causing delays uh to uh the application latencies as indicated by doshi’s example earlier the other way to look at this data is instead of looking at it in a visual manner if there is just immense amount of threads that are there today in processors right 128 or you know twice of that onto Socket systems um it may be a little visually challenging to observe all that data in which cases you would want to look at tables or information that is collapsed into things such as histograms which is what’s seen on the left hand side so poor shed has added timing histograms in which the same recorded data can be actually shrunk such that you can look at it in a table format over time and figure out which are the processes running on what cores or CPUs they are running what are their wait times um that um uh scheduling delay which means how much time is the scheduler actually spending you know the issue could be with the scheduler as well as well as its run time uh and with which you know you can sort them in descending order figure out which is the lowest hanging fruit so and so forth on the right hand side it’s a similar approach but it is actually um you know it’s it’s rotated by 90 degrees from the earlier visual that we saw the good part about this one is we can track a container let’s say container D and we can track it over time and we can find over its lifetime where it has been switching and switched out by which process in a very easy manner by going through the flow in a vertical way so there are different ways of visualizing these but essentially getting to the same point that migrations happen uh due to certain cases voluntarily or involuntarily and you’re going to see latency chunks getting inserted because of those and we want to identify them quickly
now let’s move from C state so States on sales are processors actually active and it is not idle right um and when the processor is not idle even then it can be in various stages of being active these are known as P States uh P States uh you know uh go all the way from you know different levels just like we saw in the sea states with P1 being the guaranteed frequency or the frequency that you see when a processor shipped to you it comes up with a guaranteed frequency and that’s known as its base or uh P1 frequency and then there are other frequencies as the number increases show Energy Efficiency or you know you want to save power so you go down in those frequencies whereas the blue region that is shown on the right those are the turbo frequencies that’s the opportunistic frequency that a processor will allow you to go based on many constraints including thermals and its uh TDP or 10 or its power power available now let’s talk about control uh P States can be controlled either from the BIOS from the OS they are Hardware control P States and then there’s turbo that can be also controlled either from the BIOS or the OS by default most of these are on typically vendors decide whether the you know the BIOS will control it or the OS controls these uh P States and when the OS controls them it controls them through scaling drivers which users are able to now manipulate tools used to monitor P States can be such as turbo Stat or free profiling tools such as perf and v-tune as well but to control P States and not just visualize them you would want to use scaling drivers so it is preferable to have OS control such that one can control the scaling drivers um such as using core free to decide whether you want to be in a state of Energy Efficiency in a power saving State and extreme power saving status you want to be in a performance mode where you do not care about power but you want to have the utmost frequency to get the best latencies possible
so considering this now um consider you know collecting analyzing the scheduler traces on a running basis it can affect performance so here’s a strategy a possible strategy to do this first on the bottom uh we can have C State monitoring and P State monitoring with the tools that I mentioned in the previous slides and it can be done over different Windows of time with um with knobs that Define its thresholds right so we can we can monitor you know even if we miss one or two transitions but we can we can collect them along with the run queues over time and be able to understand when the CPU is going to sleep how much it is active and so and so forth now we will detect that with the overlap based upon what’s coming from the application now let’s say application has a kpi in terms of its response time now that is what we are looking for so the so that’s what we’ll be monitoring at the application Level now this response time can be looked at in different intervals it can be looked over short ranges it can be look over exponentially weighted moving window average and the reason to look at it in different two cases is because we want to find out those anomalies or those long tail latencies or those spikes as opposed to an average high latency which is much easier to be detected and when such events occurred there would with the help of these two long range and the short range um you can apply mathematical formulas to detect an upward heave when that happens and when such as five occurs we can correlate it with the thresholds that we monitored with c and P State monitoring and snapshot the last n seconds of uh the Run queue latencies uh the idle you know the processor when it was idle you know the first time charts that we that we indicated and in an automated way one would be able to detect you know latency spikes or the p99s that tend to cause um you know Mischief in in the applications this is just an illustration of an algorithm method process we definitely can you know make make further amends to such now let’s move on to the second problem which was fault sharing and let’s look at the cases when fault sharing occurs and then let’s see how we can apply a strategy uh to uh to mitigate it first observe it and then mitigate it now when is the likelihood when fault sharing occurs now fault sharing because it is occurring because of um coherency between caches and application uh creating variables that share those cache lines the coherency causes a higher increase in latency but when does it affect the application it affects the application when when it has much higher concurrency so as law more and more threads are thrown at the same it’s going to see higher latency It’s Going to Show Low scaling characteristics higher response times really low IPC and even if you see lower cash misses because it’s a coherency issue you will still see high latency so that’s a good indicator along with that it’s it’s going to be insensitive to run Q lens because even if you have a lot of work piled up because the coherency is gonna of latency is going to take over then that of scheduling latencies always subscription so there are a lot of Hardware counters that allow you to look at coherence misses that are happening at L1 and L2 level and these are Snoop events and we look at some of these um through the hardware counters um and and along with that the things can get worse imagine if cores on the same socket are having cash coherency issues versus scores on different sockets now you have to go through the processor through a qpi link to the other processor go back through the caches so the effect of latency just multiplies um uh significantly
now let’s go past step one and let’s figure out how how we can drill down for collecting evidence for such I don’t you know one can’t have a tool that just says here’s fault sharing but but because it is such a complicated process we can get several indications of where it might be occurring so that we do have a pro an event in the processor which is in its performance monitoring unit called as a hitem event a hitam essentially means that read and write accesses are happening from one thread which which happened to hit in the cache of another thread right but the but the cache when it hit the cache it was in a modified State a modified State means that that cash that thread had last changed the value in that cache line so for the reason for having consistency um you know this this event this is going to Flat be flagged as and hit anything and you can collect all these events with the help of both uh so perf will exactly know it will collect the pmu it will program the pmu and collect these events to let you know when these are occurring now conditionally upon this step occurring we can collect pores C to C C to C is stands for cash to cash sharing you can download buff or have it on you know any Linux system and start collecting profiles to produce this information on your application today let’s look at a perv C2C record followed by above C2C report looks like on the right hand side figure basically it shows you what fault sharing is where you see thread 0 in red accessing the red variable on the same cache line thread one is accessing the second variable let’s say it’s eight bytes apart so because it’s on the same cache line every time changes are made to this for consistency it keeps ping pong in between these caches for um as these threads access these variables now when perf collects information on such a use case on the left hand side you can see you can see addresses or virtual addresses associated with those cache lines the amount of access is happening the total number of hitem events right you see 27 percent at the top and then you see these numbers associated with these records whether it was a local hatam whether it was a remote time indicating it was going across sockets and when you it sorts them in a descending way so you can quickly pick the top cache lines which would be your um you know the the reasons for sharing as you go down this data you can actually see exactly which cache line whether it was a remote or local hitem percentages are of the same and more importantly the cycle spend doing this because you can generally there is a lot of fault sharing happening in many applications today but not all of them actually end up causing an issue so you have to look at their weights in terms of the latency that it causes as well as if you move towards the right unfortunately pulse data did not let me go beyond the records but you can actually see source of the cache lines so you can go back to your source code and then we’ll talk about the mitigations we can make to avoid avoid these issues yeah so the line in yellow uh relates to the cache line and that indicates 50 of the time was being spent um just accessing these cache lines which which was quite a bit for this application so let’s get into the tools and how we can get into the solution space for this so first let’s look at power management since we started with C States and P States and we were we collected all this information the most simplest of things that we can start with is based upon what is the need of the application whether we want um quickly we are more latency sensitive or we are power sensitive there are multiple thresholds that can be applied we can be in a power save mode we can be in energy efficient modes in addition to this with the help of the scaling drivers you can actually fix frequencies or you can go in and change turbo or you can say I can cap my frequency to a certain level right so based upon what the requirement we can use the power performance settings through the drive drivers to to uh to alleviate the effects of uh the latency that we are seeing in the applications other other things we can do uh Beyond C States and P states are the scheduler tunings so besides the hardware the the Linux scheduler or CFS let’s say or in Windows um what happens is that the scheduler tries to be very fair but it’s quite possible that in your application there could be multiple processes that are not really critical but there are certain processes that are really critical to your application and typically Fair schedulers don’t do a good job in that case so uh changing schedulers settings for quicker preemption changing time slicing um and you know sysconfig allows you to change a multiple amount of knobs to change migrations and wake UPS uh one can tune the scheduler tunings in a way that it’s more appropriate and prioritizes your applications better
for fall sharing uh which provoke you know really high latency is when coherence occurs um the first thing or or the bigger thing that that you know we can do with after we have figured out throughput C2C that we have these cache lines in this source and we want to mitigate the effects of fall sharing is data structure layout let’s say we have two data structures that are put side by side uh you know they are just declared one after the other as typically any programmer would do but they end up sharing cache lines because let’s say on a typical x86 it’s 64 bytes long and your data structures just have a couple of ins they’re going to they’re going to lie on the same cache lines um many applications many many programming languages such as Java and so forth allow you to add padding to these cache lines so that you can separate them on the intended architecture by padding cache lines they they end up being on different uh you know the data structures end up being on different cache lines and then you will completely negate the effects of fault sharing Additionally you can change the affected data structure or split it into several substructures to to mitigate the issues of fault sharing the other computation strategies in the cases of like really difficult places where you can’t change the data structures or in case of true sharing where you know you have to be on the same cache line and still access data we can do rate limiting of writers to The Cash Line uh or the frequency of reading you can limit the amount of threads uh co-location of data either on the same socket so that local latencies are much better than having remote latencies are strategies more complex strategies to make code bimodal right by that we mean that we use the perf observability to trigger signals that tell us that hey coherence is going out of whack and now we want to bring it back into uh you know so then there there we can have triggers and we can through observability and say now we want to do rate limiting or we want to do better co-location so these are some of the strategies you know we can apply in both cases so as we have seen in summary we see latency instrumentation needs to be closer to the application it needs to be real time because if we don’t catch those spikes we don’t catch those p99s um you know they’re they’re very important kpis to the application it’s not just average latency tracing is as we have shown is a very good method but it needs to be combined with something more ah more closer to the application something that uses sampling so that the overheads are reduced as well as it can be using triggers from the application uh so that we can know when to trace and when not to trace as opposed to tracing for all the time and then you’re looking for a needle and a haystack so as we have outlined two issues one was false sharing a very difficult uh issue to detect in applications and power management transitions that behind the scenes cause high latency effects and we have also shown that even though they don’t arise frequently but when they do they can have really high effects and these two components cause a lot of Havoc when they do occur so their detectability and solutions as shown in this presentation can possibly help you figure out uh better ways of observability and finding performance so thank you I hope this talk gave you more uh deeper information into places where your application can be tuned for better performance
if you wish to contact us we’ve got our emails listed in the slide which should be available later on thank you