How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsar and ScyllaDB

Lei ShiZhiwei Peng Zhihao Chen18 minutesFebruary 15, 2023

A use-case study of why Level Infinite, a division of Tencent, uses ScyllaDB as the state store of our Proxima Beta gaming platform's service architecture. How we leverage time window compaction strategy (TWCS) to power a distributed queue-like event store specialized to our scenario, as well as multi-datacenter replication for global data distribution and governance.

Share this

Video Slides

Video Transcript

I’m a software engineer at Proxima Beta. Today I will be presenting on behalf of my team and sharing our practice of building event driving system using Apache Pulsar and ScyllaDB.

this system helps us monitoring and responding to a boarding range of risks that occur in our games before diving in we would like to thank event organization and sponsor for giving us the opportunity to share our work

before we move on to the agenda please allow me to provide some background information about level infinite and Proxima better proximata is part of turns and games now we are based in Singapore with other regional office around the world our business is operating on live video games Global is a globally except for the mainland China Market level infinite is our brand for Global publishing you may have heard of our titles we are currently running such as pubg mobile Arena of Valor and Tower of fantasy our team at proximal better is responsible for monitoring and responding to a broad range of risk that may occur in our games such as cheating activities and harmful content let’s move on to the agenda

today’s session had three parts it’s also a follow-up section of what we presented at the recent password submit easier in the previous session we discussed why we built our system based on cqrs and event sourcing patterns if you haven’t gotten a chance to go through the previous session no worries I will give you a quick recap of our service architecture to help you understand our system how our system works before we dive into the main topic you can also find out more information about the previous section by following this link the first topic is how we use Theta DB as a distribute event store that function like um queue and is able to dispatch events to a large number of gameplay sections at the same time the second top is how we leverage the B to improve our data governance to be specific making complex global data distribution problems easier to config text and audit now let’s begin the overview of our cervix architecture which is based on SQL asset event sourcing pattern if these terms are new to you don’t worry by the end of this overview you should have a solid understanding of this concept the first concept I would like to introduce is event sourcing the core idea behind event sourcing is that every change to a system stays is captured in an event object and this event objects are stored in the order in which they were applied to the system state in other words instead of just storing the currency they reuse and a pen only store to record a record the entire series of action taken on the stake the concept is simple but powerful as the events that represent every action are recorded So that any possible model describing the system can be built from the event the next concept is a cqrs which stands for command query responsibility second segregation cqis was coined by Greg young over a decade ago and originated from the command and query separation principle the fundamental fundamental idea is to create separate data model for reads and Rise rather than using the same module for both purpose by following the cqrs pattern every API should either be a command that performs the action or query that returns data to the caller but not both the naturally divide the system into two parts the right side and the result the separation offers several benefits for example we can scale right and recapacity independently for optimizing cost defense efficiency from a teamwork perspective different teams can create different wheels of the same data with real world conflicts the high level workflow of the right side can be summarized as follows events that occur in numerous gameplay I play sessions are back into a limiting numbers of event processors the implication is also strained for war typically involving a message but such as puzzle Kafka or Sim simpler queue system that acts as a event store events from clients are persisted in the event store by topic and event processor concealed event by subscribing to topic if you are interested in why we chose a packed poster over other system you can find more information in the link provided earlier one thing worth mentioning is that a yq light system are usually efficient as handling traffic that follows in one direction for example fan in they may not be as effective at handling traffic that follow in the opposite direction for example fan out in our scenario the number of gameplay session will be large and a typical Q system don’t fit in since we cannot afford to create a dedicated queue for every gameplay section we need to find a practical way to describe distribute finding and matches to individual gameplay sessions through query apis that’s why we use cellular DB to build another q live event store which is optimized optimized for events and all we will discuss this further in next session before we move on to the next session let me summarize our service architecture starting from the right right side Game servers keeps sending events to our systems through command endpoints and each event represents a certain kind of activity that occur in in a gameplay section event processor produce finding or match checks against the event strings of each gameplay section and act as a bridge between two sides on the east side we have Game servers or other clients keep polling matches and fighting through query endpoint and take further action is an abnormal activities have been absurd now let’s move on the next session to see how we use crdb to solve the problem of dispatching events to numerous gameplay sections by the way if you search for ScyllaDB or consensual and kill using Google you you may come across an article from over a decade ago starring that using Cassandra as a q is an anti-patent anti-pattern why this may have been true at the time I argue that it’s not it’s only part uh partially true today to support this dispatch of events to each gameplay session we use the session ID as the partition key so that each campaign session has its own partition and events belonging to a particular gameplay session can be located by the section of a session ID efficiently each event also has a unique event ID which is a time uuid at the cluster ring key as the clustering key because record with within the same partition are sold by the clustering key the event event ID can be used as the position ID in the queue finally uh clients I mean um DB clients can efficiently retrieve nearly arrived event by checking the event ID of the most recent event that has been received there is one uh caveat to keep in mind when using the approach the consistency problem retrieving new events by checking the most recent event ID rely on the assumption that no events with a smaller ID will become made in the future the Assumption may not always hold for example if two nodes generate two event identifiers at the same time an event with a smaller ID may be insert later than an event with a larger ID and this problem which I refer to as a phantomy is similar to a phenomenon in the SQL world we are repeating the same query can you different results due to the uh and commit change made by another um transaction however the root cause of the problem in our case is different it occurs when events are committed to 30br of the older indicated by the event ID there are several ways to adjust this issue one solution is to maintain a castle y state which I call it a pseudo now based on the smallest value of the moving time step among all event processes each event processor should also ensure that all future events have has have an event ID greater than its current timestamp another important consideration is enabling time window compaction strategy which eliminate the negative performance impact caused by Tombstone accumulation of Tombstone was a major issue that prevented the use of consensual as a queue before timing though compassion strategy became available that’s all for this section now let’s move on to the next section to see if any other Advantage you can get other than using ScyllaDB as a dispatching queue

um in this final section we will discuss how we use key space and data verification to simplify our global data management

as we are building a multi-tenancy system to serve customers around the world is essential to ensure the customer configuration are consistent across classes in different regions believe me keeping on this trip a distribute system consistent is not a trivial a trivial task if you plan to do it all by yourself we solve this problem by simply simply enable in enabling data replication on a key space across all data centers this means any change May in one data center will eventually uh propagate uh proper propagate to others and then see the DB also dynamodb and Cassandra for the heavy Living Word that made these challenging problem seem to you and you may you may be thinking that using any typical rdbms could achieve the same results since most database also support a data replication this is true if there is only one instance of the control pay running in a given region in a typical primary replica and architecture only the primary no suppository right why replica nodes are really only however when you need to run multiple instance of the control pay or across different regions for example every turn and have has a control pain running in its home region or even every region has a control pay running for local teams it became it becomes much more difficult to implement this using a typical primary replica architecture if you have used a AWS DynamoDB you may be family familiar with the future called global data which allow application to read and write locally and assess the data growth rate enabling replication on key space provides a sim a similar feature but without render log in you can easily extend Global tables across the multi-clown environment the next use case we will discuss is how we use key space as data container to improve the trans penalty of global data distribute distribution

let’s take a look at the diagram on the right hand of this side it shows a solution to the typical data distribution problem imposed by data protection laws for example suppose region a around certain times of data to be perceived processed outside of its border as long as an original copy is kept in its region um as a product owner how can you ensure that all your application comply with the regular This Record regulation regulation our potential solution is to perform end-to-end tests to ensure that application correctly sent to the uh send the correct data to the correct region as expected expected this approach requires application developers to take full responsibility to implementing data distribution correctly however as a member of application growth it becomes impractical for each application to handle this problem individually and end-to-end text also becomes increasingly expensive in terms of both time and money let’s think twice about this problem by enabling data replication on key space we can divide the responsibility for correctly distribute uh Distributing data into two tasks number one um identifying data time and creating their destination number two cop uh copy copying or moving data to the expected location by separating these to do this we can abstract away complex configuration and regulations from applications this is because the process of transferring data to another region is often the most complicated part to deal with such as passing through Network Banner boundaries correctly encrypting traffic and handling interaction interactions after separating these two Duty application only required to correctly perform the first step which is much easier to verify through testing at earlier stage of the development cycle additionally um the correctness of configuration for data distribution becomes much easier to verify and audit you can simply check the setting of key space to see where data is going

in summary the key take away from today’s session are when using ScyllaDB to handle times series data such as using it as an event dispatching queue remember to use time window comparison strategy consider using key space as data container to separate the responsibility of data distribution this may make complex data distribution problems much easier to manage and if you have any question please do not hesitate to reach us through the following links thank you all thank you for your attention [Applause]

Read More