Automating the Hunt for Non-Obvious Sources of Latency Spreads

Harshad S. SaneKshitij Doshi35 minutes

Modern multicore cloud servers are complex beasts. False sharing references and power management can trigger wide latency spreads; but are neither directly observable nor easily traced to causes. This talk describes how to diagnose CPU and low-level software problems quickly, and outlines several remedies.

Share this

Video Slides

Video Transcript

Our talk today is about how can your ScyllaDB based solution deliver the best possible responsiveness where the database itself is not a source of latency spikes. So, who am I? I work in Intel’s Data Center and AI group.

my responsibility is to maximize performance of software often I deal with storage memory and distributed systems issues my colleague harshad is a deep expert in architecture software and Performance challenges that crop up especially in complex software Stacks here is our agenda latency spreads are Troublesome so in the first two parts we will discuss why they are hard to diagnose and discuss two sources that lie hidden from most Developers then in remaining sections we describe how to know if a given scenario has either one of these two causes and then what to do in order to mitigate them uh spikes in latencies often arise due to unforeseen combinations of conditions a good software hygiene and quality software Assurance processes May reduce their likelihood but often these are latent vulnerabilities that break into existence under the right conditions so what can we do then to serve around them by spotting them in time so why this topic at cldb Summit cdb is purpose built for high throughput and low latencies um it does this by minimizing waiting whether for Io or for access to data by mapping uh streamlined code to data and by using a copy-free asynchronous framework so already it uses the key features in each new hardware generation maximizes the use of CPUs and achieves High cash hit rates however and then in the next yes thanks so however what the database layer delivers in its excellent design can be lost in other solution layers such as in a three-tier architecture further so uh in a in a web of microservices a latency issue can affect one component and from there delays can Cascade to others in complex collection you have emergent behaviors and so production systems can still show effects that are not flagged in Canary testing so on the next slide we talk about um how production systems can experience latency spikes and uh when in what makes them uh difficult to investigate so if we have an extremely well tuned application with small ups and downs in response time and then when it suddenly shows a few spikes as shown on the right generally we have something that has thrown something suddenly out of balance and in doing so it has distorted the normal streamline execution of high frequency uh Fast Parts so next let’s take a cartoon illustration so here we have a streamlined flow of execution where activities are executing without a waste of time as soon as their precursor is complete now these hot paths execute repeatedly or iteratively in many applications and each repetition takes a fairly predictable amount of time from the leftmost layer to the right most rightmost layer here shown in the orange node now consider that this node in the middle so so if you can click next yeah so this node in the middle is forced to wait a bit uh say because one of the CPUs became temporarily available for whatever reason and so it had to wait until and next and so until it it until sap was freed up now that hiccup right which we see here uh Cascades out in time uh and and delays the final result which is the which is our orange node now if this was a hot loop with values or events flowing from one iteration to the next then this disturbance travels across the loop and Travis is not just across one or two few Loops but it can it can travel across multiple loops and even after whatever it has whatever started this this particular problem has gone away it takes time for the waves of delay to ramp out so we want to describe two such issues that crop up in the field next the first issue is called for sharing and it’s shown in a hypothetical setting here so you have two modules in an address space first module on the left is got uh you know some thread that is infrequently once in 10 seconds uh setting uh sorry 10 times a second setting a a random seed value and multiple different things are consuming that value and the second module in the same application has got a producer consumer flow with the producer producing events uh putting them in the queue and the consumer dequeuing those events so next ordinarily uh what what you would find is that this this should not be a problem right because these are these are completely independent variables uh the the threads on the left are not really um you know they are only reading and uh the the the the two activities on the right are arranged by a scheduler so that the producer and consumer do not get in each other’s way very frequently on On Cue pointers however consider a situation where the cue pointers and the variable s in this case happen to share a cache line and now you have a sport of activity once in a while where either the producer gets a producer produces a lot of items or the consumer consumes a bunch of items in which case uh you know there can be significant amount of writing to the queue pointer which can which can delay the readers on the left hand side by significant uh by a significant amount uh you know from from if a few Cycles uh at a time to to many tensor Cycles so ordinary there should be uh you know this issue should not ordinarily arise but um it is very difficult to catch in Canary testing so let’s go to the next issue uh on the on the next slide uh some background uh for for this issue um ordinarily what happens is that modern CPUs uh try to save power whenever the amount of activity uh reduces and they do so primarily either by reducing frequencies or going into a kind of an animated suspension where they are available but but they are they are relaxing and they are not consuming power and operating system and Hardware algorithms work in order to bring them out of suspension whenever necessary or or making them step up the frequency ladder as necessary and these algorithms are fairly sophisticated and so most of the time you know one should not see a problem

however uh let’s consider the the next slide yeah this slide so here here uh very often uh when activity resumes uh from a sleep State the uh the uh processor that just came out of the sleep can take a fair number of micro seconds maybe tens of microseconds in order to resume uh its execution and after it resumes its execution it can get delayed even further because of because of many cache misses uh as it has just come out of sleep so that can delay the execution further um in the next uh we will a similar kind of effect happens with transitions from low frequency states to higher frequency States in that these these transitions will not take place immediately they they go through a series of steppings and as a result whatever threads are scheduled on those uh CPUs can experience longer times than normal to complete so uh let me illustrate the these effects with with a with a cartoon diagram here we have got a hypothetical situation consisting of three different activities that are stationed on three different threads and um thread a let’s click uh once so so so 3D produces an event that triggers thread B thread B uh executes for a while and triggers um uh thread C and uh uh thread C in turn triggers thread a and so on and in uh good conditions this is an ideal progression with the button passing from you know from from the threads as soon as the events are ready now let’s consider in the next slide what happens if for example uh the the thread C got scheduled on a processor that just came out of sleep recently in that case uh if you click once uh the thread c will run slower and so it it will produce the event for thread a uh after a longer time and in the in the meantime if uh thread a the you know got got bumped from its CPU because that CPU was put to sleep again it can it can run slower as a result and so these these kinds of delays can Cascade and eventually they can they can fade out but uh depending upon the application they can that can take some time so ordinary uh one would uh want to diagnose these things by collecting let’s go to the next slide uh by collecting various uh various kinds of Diagnostics um but what happens is that uh through either lightweight sampling or or counting of events it becomes very difficult to to to find these events um because they happen at runtime due to due to uh various unforeseen combinations of events and so one one way to deal with that is to collect traces because traces allow you to to examine the the the precise flow of events as as they happen and and therefore help you reason however there are certain challenges with the traces first of all if you’re if you’re collecting traces uh frequently and analyzing them in in real time then uh security uh issues can can come into picture because of side Channel effects um you have to have the right privilege in order to collect these traces um second thing is that if if you’re collecting a lot then um the amount of pressure that it puts on CPUs caches and in the uh the bandwidth to various devices where you save these traces can can become an issue analyzing traces or sifting through them in order to find what may be going on is a little bit like looking for uh a a problem that uh you know without really knowing whether the problem even exists right so it’s like trying to solve a crime scene problem without knowing that a crime has even been committed

uh so one one solution to this is to collect events that can be collected at lower head and which can be used essentially as harbingers of uh uh either of these two sets of problems uh those indicating power transitions are shown on the left uh various kinds of tools collect uh at very very low uh overhead uh indications of what may be going on in the scheduling system with uh and in power management systems with thermostat powertop core frequency utility run run monitoring the Run queue lens and so on and then on the right are in a number of issues that can number of collections that can indicate if there are any cash coherence issues for example one would expect that uh that we have a fall sharing issue if there’s a sharp instruction per cycle drop with uh increasing concurrency a because that means that the more more CPUs are fighting with each other in order to get to the same cache lines at the same time one would expect that there isn’t any kind of a memory or disk data bottleneck that is that’s that’s uh causing this uh this instruction per cycle drop and uh processors are highly utilized there there is very little time that threads are spending on run queues and and yet if you see uh a drop in free drop in uh processor efficiency that would be good indications as good as these Clues are nevertheless they are they are circumstantial and they well and then they can illustrate that that a problem may be happening and next our shot will walk us through uh some of the uh details on how you might collect these events and and how you how you can zero in on uh the the courses at runtime thank you Lucy uh now that Doshi has set up set us up for uh you know the two issues to reiterate fault sharing and uh power management in the CPUs uh let’s look at the ways we can um throw observability at these uh make sense of when they are happening as as it was mentioned before they can be circumstantial evidence let’s try to go deep down and figure out um when these occur how these occur and what are the possible solutions to mitigate them right so let us start with capturing information that talks to the processor about its power management Now power management was of in in was seen in two types there were C States and P States so let’s talk talk about C States first and how to visualize in them and how how their impact and control uh can be formulated right so C states are essentially idle states in the processor where the CPU likes to turn off things in order to save power whenever it is not active right so their name has idle States the higher the number of the C State they go from c0 1 to 6 uh all and they go deeper right so what’s Illustrated in the diagram on the left hand side just shows that different components within the CPU its execution engine the caches and um how they are turned off as we go deeper into the sea States um and the the more the components are getting turned off or the larger the components being turned off more power is safe but one has to note that when we have to come back from these uh higher C States back into c0 which is the running state of the processor um then we have to turn all these components off on so the deeper you go down and then you have to come out you are going to incur a higher latency and that’s the latency we were talking about on the right what we see is a way and we have used most of standard drivers such as Linux puff through which this visibility can be observed if you follow the command given on the top right there you can actually see a multiple uh events that are associated with C States some of them are at a package level some of them at a core level and they are indicated by different numbers

now we’ve got all these events but how do I actually make sense of it and how do I interlace them with what’s happening with my application so perf also provides something known as time charting where you can collect above record that has scheduler events along with the application and then it can be visualized as shown in the diagram in this in this slide we’ve got two sets of information here if you look at the top slide it has got four threads that are going into intermittent sleeps as well as on the bottom you can see a lot of sleeps occurring the red color indicating when the CPU is not doing something whereas the blue indicating with the application is running on the top what’s happening is a lot of memory allocations and deallocations are happening which means the CPU has to go to dram or to access caches so there is going to be a amount of time which is indicated by Blue which is some amount of time where there is actual work going to happen and then the CPU is going to go to sleep it’s going to access that data again you have to go to sleep because of these latencies these are the times when the CPU is actually inactive now if you go to IO you are actually going at a much much further out so your latencies are going to be much higher so unless you have immense amount of parallel parallelism to cover for these you are going to have to switch between these states and that is what is indicated by um the red versus versus the White and the and the blue so with the help of this we are able to look at multi-threaded applications and figure out if um the reds are causing delays uh to uh the application latencies as indicated by doshi’s example earlier the other way to look at this data is instead of looking at it in a visual manner if there is just immense amount of threads that are there today in processors right 128 or you know twice of that onto Socket systems um it may be a little visually challenging to observe all that data in which cases you would want to look at tables or information that is collapsed into things such as histograms which is what’s seen on the left hand side so poor shed has added timing histograms in which the same recorded data can be actually shrunk such that you can look at it in a table format over time and figure out which are the processes running on what cores or CPUs they are running what are their wait times um that um uh scheduling delay which means how much time is the scheduler actually spending you know the issue could be with the scheduler as well as well as its run time uh and with which you know you can sort them in descending order figure out which is the lowest hanging fruit so and so forth on the right hand side it’s a similar approach but it is actually um you know it’s it’s rotated by 90 degrees from the earlier visual that we saw the good part about this one is we can track a container let’s say container D and we can track it over time and we can find over its lifetime where it has been switching and switched out by which process in a very easy manner by going through the flow in a vertical way so there are different ways of visualizing these but essentially getting to the same point that migrations happen uh due to certain cases voluntarily or involuntarily and you’re going to see latency chunks getting inserted because of those and we want to identify them quickly

now let’s move from C state so States on sales are processors actually active and it is not idle right um and when the processor is not idle even then it can be in various stages of being active these are known as P States uh P States uh you know uh go all the way from you know different levels just like we saw in the sea states with P1 being the guaranteed frequency or the frequency that you see when a processor shipped to you it comes up with a guaranteed frequency and that’s known as its base or uh P1 frequency and then there are other frequencies as the number increases show Energy Efficiency or you know you want to save power so you go down in those frequencies whereas the blue region that is shown on the right those are the turbo frequencies that’s the opportunistic frequency that a processor will allow you to go based on many constraints including thermals and its uh TDP or 10 or its power power available now let’s talk about control uh P States can be controlled either from the BIOS from the OS they are Hardware control P States and then there’s turbo that can be also controlled either from the BIOS or the OS by default most of these are on typically vendors decide whether the you know the BIOS will control it or the OS controls these uh P States and when the OS controls them it controls them through scaling drivers which users are able to now manipulate tools used to monitor P States can be such as turbo Stat or free profiling tools such as perf and v-tune as well but to control P States and not just visualize them you would want to use scaling drivers so it is preferable to have OS control such that one can control the scaling drivers um such as using core free to decide whether you want to be in a state of Energy Efficiency in a power saving State and extreme power saving status you want to be in a performance mode where you do not care about power but you want to have the utmost frequency to get the best latencies possible

so considering this now um consider you know collecting analyzing the scheduler traces on a running basis it can affect performance so here’s a strategy a possible strategy to do this first on the bottom uh we can have C State monitoring and P State monitoring with the tools that I mentioned in the previous slides and it can be done over different Windows of time with um with knobs that Define its thresholds right so we can we can monitor you know even if we miss one or two transitions but we can we can collect them along with the run queues over time and be able to understand when the CPU is going to sleep how much it is active and so and so forth now we will detect that with the overlap based upon what’s coming from the application now let’s say application has a kpi in terms of its response time now that is what we are looking for so the so that’s what we’ll be monitoring at the application Level now this response time can be looked at in different intervals it can be looked over short ranges it can be look over exponentially weighted moving window average and the reason to look at it in different two cases is because we want to find out those anomalies or those long tail latencies or those spikes as opposed to an average high latency which is much easier to be detected and when such events occurred there would with the help of these two long range and the short range um you can apply mathematical formulas to detect an upward heave when that happens and when such as five occurs we can correlate it with the thresholds that we monitored with c and P State monitoring and snapshot the last n seconds of uh the Run queue latencies uh the idle you know the processor when it was idle you know the first time charts that we that we indicated and in an automated way one would be able to detect you know latency spikes or the p99s that tend to cause um you know Mischief in in the applications this is just an illustration of an algorithm method process we definitely can you know make make further amends to such now let’s move on to the second problem which was fault sharing and let’s look at the cases when fault sharing occurs and then let’s see how we can apply a strategy uh to uh to mitigate it first observe it and then mitigate it now when is the likelihood when fault sharing occurs now fault sharing because it is occurring because of um coherency between caches and application uh creating variables that share those cache lines the coherency causes a higher increase in latency but when does it affect the application it affects the application when when it has much higher concurrency so as law more and more threads are thrown at the same it’s going to see higher latency It’s Going to Show Low scaling characteristics higher response times really low IPC and even if you see lower cash misses because it’s a coherency issue you will still see high latency so that’s a good indicator along with that it’s it’s going to be insensitive to run Q lens because even if you have a lot of work piled up because the coherency is gonna of latency is going to take over then that of scheduling latencies always subscription so there are a lot of Hardware counters that allow you to look at coherence misses that are happening at L1 and L2 level and these are Snoop events and we look at some of these um through the hardware counters um and and along with that the things can get worse imagine if cores on the same socket are having cash coherency issues versus scores on different sockets now you have to go through the processor through a qpi link to the other processor go back through the caches so the effect of latency just multiplies um uh significantly

now let’s go past step one and let’s figure out how how we can drill down for collecting evidence for such I don’t you know one can’t have a tool that just says here’s fault sharing but but because it is such a complicated process we can get several indications of where it might be occurring so that we do have a pro an event in the processor which is in its performance monitoring unit called as a hitem event a hitam essentially means that read and write accesses are happening from one thread which which happened to hit in the cache of another thread right but the but the cache when it hit the cache it was in a modified State a modified State means that that cash that thread had last changed the value in that cache line so for the reason for having consistency um you know this this event this is going to Flat be flagged as and hit anything and you can collect all these events with the help of both uh so perf will exactly know it will collect the pmu it will program the pmu and collect these events to let you know when these are occurring now conditionally upon this step occurring we can collect pores C to C C to C is stands for cash to cash sharing you can download buff or have it on you know any Linux system and start collecting profiles to produce this information on your application today let’s look at a perv C2C record followed by above C2C report looks like on the right hand side figure basically it shows you what fault sharing is where you see thread 0 in red accessing the red variable on the same cache line thread one is accessing the second variable let’s say it’s eight bytes apart so because it’s on the same cache line every time changes are made to this for consistency it keeps ping pong in between these caches for um as these threads access these variables now when perf collects information on such a use case on the left hand side you can see you can see addresses or virtual addresses associated with those cache lines the amount of access is happening the total number of hitem events right you see 27 percent at the top and then you see these numbers associated with these records whether it was a local hatam whether it was a remote time indicating it was going across sockets and when you it sorts them in a descending way so you can quickly pick the top cache lines which would be your um you know the the reasons for sharing as you go down this data you can actually see exactly which cache line whether it was a remote or local hitem percentages are of the same and more importantly the cycle spend doing this because you can generally there is a lot of fault sharing happening in many applications today but not all of them actually end up causing an issue so you have to look at their weights in terms of the latency that it causes as well as if you move towards the right unfortunately pulse data did not let me go beyond the records but you can actually see source of the cache lines so you can go back to your source code and then we’ll talk about the mitigations we can make to avoid avoid these issues yeah so the line in yellow uh relates to the cache line and that indicates 50 of the time was being spent um just accessing these cache lines which which was quite a bit for this application so let’s get into the tools and how we can get into the solution space for this so first let’s look at power management since we started with C States and P States and we were we collected all this information the most simplest of things that we can start with is based upon what is the need of the application whether we want um quickly we are more latency sensitive or we are power sensitive there are multiple thresholds that can be applied we can be in a power save mode we can be in energy efficient modes in addition to this with the help of the scaling drivers you can actually fix frequencies or you can go in and change turbo or you can say I can cap my frequency to a certain level right so based upon what the requirement we can use the power performance settings through the drive drivers to to uh to alleviate the effects of uh the latency that we are seeing in the applications other other things we can do uh Beyond C States and P states are the scheduler tunings so besides the hardware the the Linux scheduler or CFS let’s say or in Windows um what happens is that the scheduler tries to be very fair but it’s quite possible that in your application there could be multiple processes that are not really critical but there are certain processes that are really critical to your application and typically Fair schedulers don’t do a good job in that case so uh changing schedulers settings for quicker preemption changing time slicing um and you know sysconfig allows you to change a multiple amount of knobs to change migrations and wake UPS uh one can tune the scheduler tunings in a way that it’s more appropriate and prioritizes your applications better

for fall sharing uh which provoke you know really high latency is when coherence occurs um the first thing or or the bigger thing that that you know we can do with after we have figured out throughput C2C that we have these cache lines in this source and we want to mitigate the effects of fall sharing is data structure layout let’s say we have two data structures that are put side by side uh you know they are just declared one after the other as typically any programmer would do but they end up sharing cache lines because let’s say on a typical x86 it’s 64 bytes long and your data structures just have a couple of ins they’re going to they’re going to lie on the same cache lines um many applications many many programming languages such as Java and so forth allow you to add padding to these cache lines so that you can separate them on the intended architecture by padding cache lines they they end up being on different uh you know the data structures end up being on different cache lines and then you will completely negate the effects of fault sharing Additionally you can change the affected data structure or split it into several substructures to to mitigate the issues of fault sharing the other computation strategies in the cases of like really difficult places where you can’t change the data structures or in case of true sharing where you know you have to be on the same cache line and still access data we can do rate limiting of writers to The Cash Line uh or the frequency of reading you can limit the amount of threads uh co-location of data either on the same socket so that local latencies are much better than having remote latencies are strategies more complex strategies to make code bimodal right by that we mean that we use the perf observability to trigger signals that tell us that hey coherence is going out of whack and now we want to bring it back into uh you know so then there there we can have triggers and we can through observability and say now we want to do rate limiting or we want to do better co-location so these are some of the strategies you know we can apply in both cases so as we have seen in summary we see latency instrumentation needs to be closer to the application it needs to be real time because if we don’t catch those spikes we don’t catch those p99s um you know they’re they’re very important kpis to the application it’s not just average latency tracing is as we have shown is a very good method but it needs to be combined with something more ah more closer to the application something that uses sampling so that the overheads are reduced as well as it can be using triggers from the application uh so that we can know when to trace and when not to trace as opposed to tracing for all the time and then you’re looking for a needle and a haystack so as we have outlined two issues one was false sharing a very difficult uh issue to detect in applications and power management transitions that behind the scenes cause high latency effects and we have also shown that even though they don’t arise frequently but when they do they can have really high effects and these two components cause a lot of Havoc when they do occur so their detectability and solutions as shown in this presentation can possibly help you figure out uh better ways of observability and finding performance so thank you I hope this talk gave you more uh deeper information into places where your application can be tuned for better performance

if you wish to contact us we’ve got our emails listed in the slide which should be available later on thank you

[Applause] foreign

Read More