Maximum Uptime Cluster Orchestration with Ansible

Ryan Ross21 minutesFebruary 14, 2023

Ansible is a flexible orchestration tool for ScyllaDB clusters. Learn tried and tested patterns with Ansible to maximize the uptime of your ScyllaDB clusters when making changes.

This talk will go over tangible code snippets to help operators and developers safely make changes to their ScyllaDB clusters. These are tips learned the hard way in production so you don't have to.

Come along and learn how to orchestrate your ScyllaDB clusters with confidence and not have to take a maintenance window in the middle of the night.

Share this

Video Slides

Video Transcript

Hello, my name is ryan Ross and today I will be covering maximum uptime cluster orchestration with ansible. I work as a site reliability engineer at DBT Labs but today with this talk I’m not here to talk about DBT or DBT Labs, but my experience as an engineer working with ansible and ScyllaDB at a previous company that was a marketing DSP.

so my talk objective is to provide actionable tips and patterns to build confidence in orchestrating cylindb in production with ansible so here’s our agenda just to keep me a little organized is going to be we’re going to talk about the use case for ScyllaDB at the DSP I would like to Define cluster uptime in this context with using ansible and then what you came here for the tips and using against facilityb and the demo let’s get after it facilityb use cases as I mentioned this was at a demand side platform or a marketing DSP Wikipedia defines this as a system that allows buyers of digital advertising inventory to manage multiple ad exchange and data Exchange accounts through one interface sound a little confusing it is um so right here we have users at one end internet users just trying to consume content online and then we have agencies and brands that are trying to Market and advertise and sell them um you know goods and services so my definition is going to be the DSP connected advertisers and agencies with audiences that allow real-time bidding and displaying of online advertising so with those use cases we needed a back end that could scale to billions of requests per day we’re talking about the open web here uh we had equal read and write access patterns and accessing our data we need to be up all the time people are on the internet all the time we need to be up all the time and then we need a millisecond response times remember that crazy long graph all of that stuff happens there’s a whole auction for advertising space on the media buying all of that happens in less time that it takes for a user’s browser to load a web page so we need to be monstrously fast as they say and we need to get a huge data sets we’re talking about the open web here so we needed to be able to have a lot of data but also access it quickly to make decisions so of course still is perfect fit and we love Zilla so let’s Define cluster uptime um in the context of as an operator orchestrating a cluster or multiple clusters using ansible so I would like to Define for the rest of this talk to provide a cluster that is available for dependent applications right um co-db by itself doesn’t make our business money it allows applications that are solving customers problems and making the business money there so we need to make sure that ScyllaDB is up so that our applications are up and can do their job and solve real problems for our real users so it looks like this we can have a cluster but then maybe this node has a problem Oh wait and at the same time this node has a problem but we fixed those and then another it pops up and has a problem and then maybe we lose that node entirely and then we gotta grow the cluster completely um to just scale to demand this entire process this nodes are coming up and down um there was at least some subset of a cluster that was up and available and if we can structure our playbooks in different parts of ansible correctly um that is going to just help us provide the maximum uptime so here’s what you’re here for the tips for using ansible to orchestrate still a DB um for the whole maximum cluster uptime so the first part right here ansible has the concept if you’re not familiar has a concept of an inventory and that’s just a way that we can structure our Target machines that we want to communicate with in this case we can use DNS names we can see cylinder7. company.com or we can also use IP addresses down there at the bottom anything working on layer 3 will work for us to communicate there at least service Discovery there as well and structuring our inventories in a certain way really provide us at the DSP with increasing our uptime so what I ended up doing is I had one inventory per cluster we had multiple clusters with multiple data centers in them and having multiple ansible inventories basically these yaml list of servers um helped us be able to to differentiate which node belonged to which cluster and we could apply different configurations to different clusters within those clusters we had inventory groups so we can see those right here that’s that little indention in yaml and here we can carve up our clusters even more we had and it may be in a Maya Dennis data center Us East US West and we can even apply different configurations to different data centers within a cluster so once we Define our declarative execution that we want ansible to do then we can have polymorphic Behavior based on the inventory and the inventory group and here’s what multiple inventories could look like we have our inventories directory and then within that we can Define any number of direct inventories that we want

now you might be thinking I don’t want to have to manually manage these files or are environment so Dynamic to have nodes coming up and coming down all the time a human can’t keep up as soon as we commit that version to get it’s going to be out of date I think an underused superpower of your cloud provider is going to be using them as an asset management database your cloud provider has all the information that you give it and more about your instances and ansible can tie into that API and use that metadata and it’s called Dynamic inventory scripts let’s take a look at an example so here we’re again the finding in yaml but here um we are telling how to configure the dynamic inventory script what project you want to connect to in this case we’re using Google Cloud as an example so the Google Cloud project doing a service account how to authenticate what zones to look in and really what I want to key in here is these keyed groups pieces so what this Dynamic inventory script is going to do is it will look for any labels that you have put on your instances and build inventory groups based on those labels so far we have terraform and we this is part of what would be a terraform configuration and we are defining um some nodes for us and we apply a whole bunch of labels to those nodes ansible can build this file for US based on those labels and in this case we can apply different you know parts of execution against something that say the USC’s Data Center and it’s going to provide that for us so we don’t need to know the three nodes or the seven nodes however many nodes you have in that data center we just need to know okay this node when it was created it was labeled this so therefore ansible make it a part of this data center and we also get even more polymorphic Behavior with this because in our other example we didn’t Define an all ScyllaDB group or maybe an all production or an all-service owner group so using Dynamic inventory scripts we get a little bit more flexibility what if we want to apply changes to all nodes that are a part of the component ScyllaDB label where all nodes that are owned by the U.S production operations team we can do that with Dynamic inventory Scripts the next thing I would like to talk about and this will be more we talked about the inventory piece now let’s talk more about the actual configuration and execution piece um so we’ll get to take a look at some tasks here in just a second

so ansible has the concept of tasks and we see a couple of names here we have wine and file copy and template and you’ll see examples of these here so what we can do that you’ll see a lot of documentation a lot of examples would be to use the template module where you can take a Jinja file and you can template that out with variable interpolation so I took just the first few lines of asila.yaml config file and converted it into a Jinja template and then which is basically those handlebars down there at the bottom and so when I use the template module basically what I’m saying is take this file send it to the destination path on line three so Etsy silos.yaml give it some permissions and oh by the way if you find any curly braces in there do some invariable variable interpolation and fill in what I want filled in there assuming that I’ve defined still a cluster name somewhere else and for most of the time uh this would work great it becomes problematic when we copy our template files out um for files that are owned by the project when you act install Silva DB you get that zilla.yaml file with it but what ends up happening is that if you are creating a copy of that putting in template variables and then trying to template that out to your machines then that means you are now the owner of that file if that file changes from the Upstream project so ADB adds a new configuration options or maybe they change something around and if you’re running on a file that you pulled down from 4.3 and now you’re trying to upgrade to 5.0 you could run into some issues there line and file is going to help you with this so it ensures that we have the most recent and valid file from the Upstream project and we only change what we need we preserve project defaults here’s what that could look like so we have our line and file module within this task and so think of the modules as like your um your standard Library it’s just code that I didn’t have to write that I can pull in and I pass arguments to it and this case down there line eight I’m passing in the Army search for this string so I’m looking for lines in the file that have this string in it and line seven I’m replacing those lines with this line so before we have the the uh the brackets that could put in say it may have production to you in this case we just need to replace only that line so anything below that that maybe doesn’t change from the project and we don’t need to change it we don’t have to

my next thing is more about on execution and run playbooks serially this is something that we did a lot at the DSP remember we have all of our nodes running and we need to be able to bring them up and down uh in a particular order so by default ansible is going to try to run on a task on one at a time on all nodes in your inventory before it moves on to the next one so here’s what that looks like so this is an execution of just some arbitrary tasks on uh three nodes so we see here task one task two and task three um and it tries to run task one on all nodes and then task two on all nodes and then task three on all nodes for most things in most Linux deployments this is probably fine but let’s take the silly.yaml example to change that file we need to restart Silla DB for it to take effect those changes well we can’t just restart Sila from the cluster without notifying anyone because that could you know cause some definitely some issues so we need to drain the node first so let’s take if we if we have task execution we need to drain the node then update the Sila file and then finally restart the service what we can’t do is with default execution we can’t drain all the nodes in our cluster that’d be a catastrophic outage so we need to make sure that we drain one node update the service or update the file and then restart the service using serial within our clay will allow us to do that where it’ll run on one host at a time so at the top level our play we just define serial equals any valid integer in this case one but depending how big your clusters are you might want to do more than one at a time actually don’t do that and then so here we have the same test test one two and three but it’s going to run task one on node one two and three and then it’s going to run task one two and three on node two and I cut off the screenshot but it also did it on node three so then we take node one we can drain the node update the file restart the node everything’s good move on to the next one we have this rolling restart of the cluster that is safe before we do anything with changing cluster membership when we drain nodes and restart them that does change cluster membership they’re taken out of raft so here we need to make sure that before we make any changes that those nodes need to be in an up and normal State I wrote a role um within our playbooks that we could use in multiple different places outside the scope of this but it essentially did this and we’ll see this in the demo so again I’m using another module called The Command Module so I can just pass any valid Linux command to the Command Module and it will run the Linux command as is on the system so neutral status we’ll see this later in the demo I’m I register the output so any taking that return saving it to a variable and then in this case if I and I can look for anything in that you know that dot standard out so if I find anything a standard out that does basically does not equal up in normal I’m going to fail this task because I can’t make changes to the cluster if any one of those nodes are down I have a bigger problem hopefully I’m monitoring or something else in place that’s telling me those nodes are down I’m gonna go work on those notes bring them up and then I can go working with my regular maintenance tasks here

uh ansible is a great ecosystem there’s tons of automation there’s all kinds of things that you can do at ansible and that’s my next tip is when you’re developing playbooks initially or you’re starting out using ansible running the open source CLI is fine I’m about to demo the CLI um but in production workloads especially you have a lot of machines and you need to really scale or maybe of compliance issues or concerns finding some way to run ansible and automation is really going to help you out you can run playbooks on a schedule you can have auditable output and this is something that I think is really important when we talk about environment observability if I run a Playbook locally and maybe making changes in the cluster but the rest of my team is not going to have a visibility they don’t know when I ran or what I ran what was the output of that if we use another tool that we can go and take a look and say oh this job reading here and here’s the output that’s just really going to help the entire team to see Health observability into the system so lots of options out there you can kind of build your own with your current CI tool you know Jenkins and GitHub actions are also great examples uh are great um ways to use this when I first started using ansible years ago we just had a whole new stage in our ansible or in our Jenkins file that just caught a whole bunch of ansible playbooks to get work done um the ansible automation platform from Red Hat is red Hat’s paid version of ansible and so you can get subscriptions for it there’s different things that you get with as far as support or anything like that and then awx is like the free and open source version of that um so if you think about um red hat or the automation platform as like red hat Enterprise Linux and AWS is like Fedora um I have used automation platform at a previous employer it was awesome it’s very easy to use aw actually had at the DSP and they’re just like kubernetes operator so we had a whole bunch of clusters laying around carving out a new namespace and just installing the operator and using awx was a really great way to use ansible and the automation around it then there’s a whole ansible ecosystem that I haven’t even had a chance to dig into uh in this talk so there’s the collection index and there’s ansible galaxy um so be sure to you pull in other code that you can use this is just the top of the list for the ansible collections index we see AWS we see Windows there’s Azure components there’s Cisco routers if something talks about SSH or winrm or has an API ansible probably can talk to it and you can use ansible and then here ansible Galaxy we can pull down other packages from the community demo time

all right

so we have here and vs code and I’m using uh a task Runner called go task so we see here we have a bunch of containers that are running so I have a three node Sila cluster I have my actual gogrpc service that’s my demo app Robo client is exactly what it sounds like I’m just simulating user traffic it is calling the RPC methods from the cyla or excuse me from the server itself and it’s just hitting that server as fast as it can the server exposes two RPC methods that essentially come down to a read request and a write request to the database and then finally this controller up here this is where I’ve ansible installed and where I will run it so I can take advantage of Docker networking and service discovery the repo is pretty simple I have a three node cluster we can see them here I’m using local cluster DNS names to access them and then here’s our Playbook and I’ll be demoing exactly what we’ve talked about we’re going to run a play with type 0 so I’m going to run against all hosts in my inventory I’m gonna run it one at a time and so first thing we’re gonna do is just like I said we can’t make changes unless all the nodes are up and available then we’ll just debug the output so we can see what that looks like um if we made it this far and have it failed um we’ll drain the node make sure the node is actually drained then update the config file so in this case we’re just assuming that there is a line that begins with Foo and then we’ll just replace that line with Foo equals bar we’re just assuming that this config file config item is there and if not we just create it then we’ll restart the server and then make sure it’s up and make sure everything’s good before we move on to the next ones all right so actually up here so let’s take a look at the output um of the server so you see the log output so it’s moving very quickly as fast as that client can run it that’s how we’re responding maybe we might see some errors in here we might not um so we just try to look for the ears flowing through there

so I’m going to call ansible we need ansible

Playbook Dash I so the ansible dash Playbook is our intric band Dash I and we have inventories production and then we’re going to call that update Playbook

so it’s going to run this on the first node it’s going to check the cluster membership all the nodes are up we see three up in normal so it’s going to drain the node at this point I’m looking to see if we see any errors in the log output this is when they would happen update the config file restarted the node check that it’s up

checking on some membership again

and looks like it’s up moving on to node two check cluster membership make sure all the nodes are up they are

just going to drain the node

check the node status

there we go we updated the config we’re gonna restart the service not seeing any errors in the log output

and then finally running this on node three so what we did is we checked make sure all the nodes are up brought one node down updated the config brought the node back up made sure that everything was healthy and moved on with the rest of our work

thank you very much for attending uh this talk please reach out to me with these places and also that QR code will take you to the code repo where you can see all of this thank you very much for your time and I hope you enjoy the summit [Applause]

Read More