Worldwide Local Latency With ScyllaDB

Carly Christensen20 minutes

When processing sports betting data, latency matters. Content must be processed in near real-time, constantly, and in a region local to where the customer and the data are. ZeroFlucs uses ScyllaDB in order to provide optimized data-storage local to the customer. In this session, Director of Software Engineering Carly Christensen takes you through how we distribute data, and how we use our recently open-sourced package, Charybdis to facilitate this.

Share this

Video Slides

Video Transcript

Thanks for joining me my name is Carly Christensen and today I’ll be telling you about ZeroFlucs, a company that provides same game pricing technology to the wagering industry and how we use ScyllaDB to ensure that our customers have low latency no matter where they are in the world.

so if we look at our agenda for the day first I’ll be covering off who’s a reflux are and the processes that ZeroFlucs follows the challenges that we faced and how we use solar DB to overcome those challenges finally I’ll be covering some of curb discs uh open source golang package that we’ve developed to help us maintain our key spaces

so I’ve been in the IT industry for about 20 years now primarily as a software developer I also spent a few years as a SQL Server consultant at what EIT Solutions and prior to xeriflux I was the head of trading Solutions at antenna Australia which is a team that processes all of their incoming data feeds

so what is xeroflux and and what do we do

ZeroFlucs primarily provide same game pricing technology to the wagering industry we allow customers to price bets on correlated outcomes within the one match and this allows them to explore their theory of the game it’s much more exciting than just placing teammate to win bets

so if we were to look at the same game example we have three markets the first is the match winner will team a or Team B win the second is which player will score the first touchdown and the third is the total score Market will the combined scores of team A and B be over or under 45.5 points

if I wanted to place a bet on team a to win and B Boomer to score the first touchdown but I thought it would be a low scoring game you could have a look at these outcomes and multiply the prices together and get a price of around 28 dollars but in this case the correct prices actually more like 14.50

and this is because these are correlated outcomes so we need to use a simulation-based approach to more effectively model the relationships between those outcomes if team a wins it’s much more likely that they will score the first touchdown or any other touchdown in that match so we run simulations and each simulation models a game end-to-end play-by-play and we run tens of thousands of these simulations to ensure that we cover as much as the proper of the probability space as possible

so we have a look at the ZeroFlucs process and architecture

our platform has been designed to be Cloud native from the ground up our software stack is hosted in is run on kubernetes and is hosted in Oracle Cloud infrastructure at the moment we have over 130 microservices but that’s growing every week and a lot of our environment can be managed through custom resource definitions and operators for example if we want to add a new sport we just Define a new instance of that resource type and deploy that yaml file out to all of our clusters

our services are primarily golang and we use Python for our modeling and simulation services and we use grpc for communications between all of those services

we utilize Apache Kafka to ensure at least once delivery of all of our incoming and outgoing updates and we use graphql for our external facing apis

so our data flow process starts with receiving content feeds from multiple third-party sources those content items can then be combined bind into booked events and whenever we receive an update of a Content item we’ll trigger a reprocess of that booked event once the update is applied we will start running our simulations again and if we’re complete we can use the results of those simulations to generate hundreds to thousands of new markets that are stored back on the original event

our customers can interact directly with that booked event or they can use our API to request prices for custom combinations of outcomes via our query engine and we use our stored results from our simulations to answer those questions

so what have been our main challenges

our ultimate goal is to be able to process and simulate events fast enough to be able to provide same game prices for live in play events we need to be able to answer questions like will this corner result in a goal will this team score the niche will this play result in a touchdown and which player will score the next touchdown and we need to be able to do this fast enough to provide prices before that play is finished

to do this we need to optimize for time over any other Factor

so we have two main challenges our first one is that we have an incredibly high throughput and concurrency requirements events can update dozens of times per minute and each one of those updates triggers tens of thousands of new simulations simulation data can be hundreds of megabytes and at the moment we’re processing about 250 000 in-game events per second

our Second Challenge is that customers can be located anywhere in the world so we need to be able to place our services and our data with them with each request passing through many microservices even a small increase in latency between those services and the database can result in a big impact on the total end-to-end processing time

so we looked at a lot of options we looked at mongodb which several have the of the team had used before and so we were fairly familiar with it but we’d found issues where when there are a high number of concurrent queries some of those small queries could take seconds to be returned

we tried Cassandra and even though it supported the network aware replication strategies that we needed it just didn’t have the performance and resource usage that we needed

we also looked at Cosmos DB and it addressed all of our performance and Regional distribution requirements but as a startup we just couldn’t justify the high cost and since it was Azure only it would limit our future portability options

the winner was still a DV

so we had trial tiller DV in a previous project and even though it didn’t work for that situation we knew that it would be perfect for ZeroFlucs requirements it supported the distributed architecture that we needed so we could locate our data replicas near our services and our customers to ensure that they always had low latency it also supported the high throughput and concurrency of the route we required we haven’t found a situation so far where we can’t just scale through it was also easy to adopt by using silver operator we didn’t need to have a lot of domain knowledge to get started

so let’s look at our architecture with solidifi we use cylindb open source and were hosted on Oracle Cloud using their flex4 VMS these VMS give us the option to change the CPU and memory allocation to those nodes if needed

it’s currently performing very well but with every customer that we add our throughput does increase so we’re aware that there may come a time in the future where we need to scale up and move still a DB to Bare Metal instead

in development we use tiller operator but for production we’ve moved to self-managed because solar operator only supports a single kubernetes cluster so that made it a bit difficult to scale out geographically

we’re also still reviewing our strategies around Stiller monitoring and seller manager

to make the most of Cilla we have divided our data up into three main categories the first is global data and this is slow changing data that is used by all of our customers and so we replicate it to every one of our regions the second type is regional data and this is data that is used by multiple customers in a single region for example a sports feed if we find that a customer in another region requires that data we will separately replicate it into their region

the third type of data is customer data and this is data that is specific to that customer like their booked events or their simulation results

each customer has a home region where we’ll keep multiple replicas of their data and we can also keep additional copies of their data in other regions that we can use for Dr purposes

just to illustrate that idea let’s say we have a customer in London we will place a copy of our services which we call a cell into that region and all of that customer’s interactions will be contained in that region ensuring that they always have low latency we’ll Place multiple replicas of their data in that region and we’ll also Place additional replicas of their data in other regions and this will become important later

let’s say we similarly have a customer in the Newport region we would place a cell of our services in that region and all of that customer’s interactions would be contained within the Newport region so they also have low latency

the London data center becomes unavailable we can redirect that customer’s requests to the Newport region and although they would have increased latency on the first hop of those requests the rest of the processing is still contained within one Data Center so it would still be low latency and that would also prevent a complete outage of that customer we would then increase the number of replicas of that their data in that region to ensure our data resiliency again for them

so our data is segregated between services and key spaces every service uses at least one key space and some of them have quite a lot so services that use global data our global data has one key space that’s replicated to all of the regions our regional data has a key space per region and our customer data has a key space per customer so some Services may actually have more than one type of key of data and in that case they may have both a global key space as well as customer key spaces

so that brings me to correctis our open source golang package and we named this after the other sea monster that often features in Greek mythology alongside cellar

Army drivers for developing characters were that we needed a simple way of orchestrating the management of our key spaces between all of our services and we also found ourselves repeating table operations uh that were very similar between many of our services

so we created courageous and this provides a table manager which can be used to create key spaces create tables add columns and indexes as required for that service

it also offers simplified functions for crud style operations and it also supports ttls and lightweight transactions

so if we look at a simple topology scenario our service has its tables defined in Bowling structs that include data types column names and partitioning and indexes

we then can use here’s an example of a simple setup function that can be used to initialize and manage that the key spaces for that service under the covers that’s actually being converted into ddl statements that are applied to the database

that table manager can then also be used to call the simplified functions for table operations so here’s an example of an insert statement and all we require is a definition of the data that we want to insert and we can call manager.insert and it will all be done for you in this example we also have applied a TTL to that record with a very simple additional option to retrieve that record we call manager get by primary key and just include the values of all of the primary keys in order in that request and finally to update we edit the original object and call manager.update and again here we’ve applied a an updated TTL for that record

so in our environment we actually use Network aware topology so alongside every one of our services we also have a topology controller service and this service is responsible for managing the replication settings and key space information related to every service so on Startup the service will call topology controller and retrieve its replication settings it will then combine that with its table definitions and use that to maintain its key spaces in Stiller you can see here is an example of some ddl statements generated by cryptos that include a network topology strategy

so we still have a lot to learn and we’re really early in our journey for example our initial attempt Dynamic keyspace creation caused some timeouts between our services especially if it was the first request for that instance of the service and there are still many still a DB settings that we have yet to explore so I’m sure that we’ll be able to increase our performance and get even more out of Stiller DB in the future

thanks very much for your time if you do have any further questions you can contact me using the information on this slide and there’s also a link there to our crew disco link package thank you [Applause]

Read More