Get started on your path to becoming a ScyllaDB NoSQL database expert.Take a Course
Nguyen Cao is a Staff Data Engineer at SecurityScorecard where he builds pipelines to process billions of data points to produce cyber security scores for millions of companies.
Thank you for coming to my presentation today. I’m very happy to share with you our recent work to view a system which is scannable and resilience based on ScyllaDB. So let’s get started.
so now you’ve probably gonna ask like how our security ratings for are calculated um in fact we compute a score we were very complicated data pipeline but you can see here is an abstractions of different modules which are starting from signal Connections in which our Global networks would have sensors had deployed across over 50 countries to scan IPS domains DNS and various external data sources to instantly spot threats and then the result of this will be processed by attribution engine and cyber analytics modules which we try to associate IPS and domains with all kind of vulnerabilities they might have and finally the score engines you compute the rating score for for over 12 million uh scorecard which is also an organization’s you know in our terms and my team here is responsible for scoring engines in which we compute a score and we provide an efficient way to access this goal to other front-end services and this is where we file some challenge that we need to address and we’re gonna go over it now
here is our scoring architecture before a meet of 2022 on the left you want to see a platform API which receive requests from end users and then it make further requests to other internal service one of the service that we have which is a Management Service it will query into different data store such as redis Aurora and Presto on top of htms to get the requested information that it needs and on the right side you see a scoring will flow it is an enhancement of airflow apigee airflow and it’s responsible for running the job to generate a score of and the measurements and the fightings of over 12 million square now to each domains in IP
why we have this architecture that’s maybe the question that you might have at first and it has been developed quite a few years ago and it proved that itself our purpose to a certain level and it is a good architecture as well because as you can see we use the right data source data stores for different purpose such as redis because this is a key value store it is being used as a key as an in-memory catch to to look up faster and our app this is a relational database cluster that’s with distributed story allow us to store a huge amount of data across different nodes and runs low level and low latency queries on top of that and here we also have Presto which is a SQL engines General SQL engines running on top of Big Data store like https it’s help us to run the complicated SQL query to extract information regarding historical and findings and over the long period of time so um so how however this architecture has several issues that’s um we need to handle especially when we have more data and more requests over it over the time the first issue is that at a certain time and form a certain requests there are um uh high latencies in our platforms as I said in the previous slide at a certain time during the day the the scoring will flow generate the data and it runs a batch inserts into our data store and the outer array cluster you see it’s consists of one primary Masters for write request and several replicas to provide High availability for read requests but during this time the replicas are so busy to keep up with the primary one to make sure the data is inserted consistently and since they are more than there was more than 4 billion rows to be inserted Within four to five hours then these replicas just so basically to handle this uh these requests to make the data consistent and it it calls the respawn in read request up to 4 million to high latency during that time another issue that we have is with the Presto query which is uh it’s one of the issues that I show up here is is a query to send to presto and it’s look for all management details from a list of stalker within a period of time and the Presto workers what you need is it will send a query and scan all the block files in htms during set time and filter those record records that meets the requirements it worked well uh under the allow workload but when the request volume is getting larger as well as we collect data over time we have more signals we have more data need to associate with each stroker we see that the Presto workers behave quite badly and it stopped working during certain time because under the high workload the latency is at up to 10 seconds or more up to minutes and not to mention it’s also impact the whole platform to serve other requests so that is an uh Highlanders issue and the next issue that we have is the scalabilities in particular we we reach the vertical scalability high limit of uh radius which is in particular iws elastic catch because um it’s it’s the the largest instant we can get which is around 400 by in memory catch to how up to about 12 million scorer and we cannot go further than that so at that time we try a different solutions that use a redis cluster um to to be able to to set up a multiple instance so that it worked together to provide a larger memory capacities however there’s an issue that um the red the red is clients the one that’s running the the query radish query from our service need to know exactly what red is instant is where it is and what it is to send the query to and at that time we found out that only red is a cluster which is a python driver that would be able to do so but since we have different service running in Java in node.js and in order to make it work with Reddit cluster we have to implement a custom driver on top of that which is also a lot of workers and here here’s another issue that’s related to our business requirement in which we want the organizations to be able to remediate findings so that the the security ratings uh score would be improved as quickly as possible this update is is not be able to to do because the edge device data store which is the the dinosaur running behind the Presto query is the right one and read many times of block files so all the data ingestion into breast.js need to go to scoring more flow and it’s cause the delay to update the score up to three days and that makes our customer unhappy and its drivers to seek a better solution and this is our last issue that we encounter which is uh we’re adding related to software maintenance we have about 50 internal service they own implemented in different Tech stacks such as Java node.js python gold did you name it but and this service could access the the data store on their own directly so they have to handle different type of query SQL query radius query um on their own and also have to handle its very efficiently and effectively so if we change any things on the database schemer uh we upgrade the database all these Services also need to be upgraded which is causing a lot of technical
um apps to our system
so that is the issue that we find out and um that’s why we have to go further and seek for better Solutions um and we find out that why when we migrate our data to cldb that turn out uh to help us to solve on the issue that we have so let me show you why
so this is the new architecture that we have since August 24 2022 as you can see this new architecture is very simply is more simpler it’s simpler and which is we use a silica Cloud cluster to replace radius and Aurora and instead of having hdms we use a stock S3 AWS S3 to store all historical details and here we also introduce a service called which is work as a data Gateway this service scoring API to handle requests to different data store um so now you might ask why this architecture can address the the the previous issue that we mentioned so here come the answer so about the latencies first of all this architecture help us to delete low latency and High throughput um to to other service so let’s consider the same query as we have before I think which we get the detail of uh scorecard detail of a list of smoker even within a period of time and based on that query we have desires an optimal sealer DB schemer to allow us to quickly access the data that we need based on the primary key and so imagine that up and receive a request from scoring API a sealer coordinator not would know exactly where the replication notes containing such data and it just need to send those requests to these uh replicas and from there the replicas just uh just just get it and send it back
um one more thing is um we also have our scoring API to implement it in the way that it supports highly parallel processing tasks so for example in the query that you you’re looking at it in which there’s a where condition where uqid of a company in a list so that list might be really big like like thousands and up to 100 000 scorecard there and it is a limitation of of ScyllaDB DB in particular SQL to handle that long list so we have to we have to provide the parallel processing by dividing that list into smaller sublets and then we send the request in parallel and aggregate the response the result before returning back to the to the other service and here also one noticeable Improvement is that during the the big time of right workload into sealer the read throughput is still very good and with low latency access as well this is because the eventually consistent features provide by cldb which allow a read consistency can be achieved even even a certain level of right consistency that we we have in the in the request
so the second Improvement is about scalability in which we can add more nodes into seal our scene clusters cluster depending on on the needs and also we can um and by doing that we overcome the limits of uh of storage of inmate story that we have with the credits also we have um the scalability in in the scoring API because it is deployed as an ECS service and we can add as many more instant as we need to serve the request faster
and this is the last um Improvement that we can share which is uh now on the other service instead of accessing the data store directly it just send a rest query press API query to scoring API regardless of the tech stock and now the scoring API look as a Gateway for data accessing it will direct the request to the right place depending on the use cases such as if the data if the requests need a low low latencies respond let’s go to sealer otherwise if it is like a high latency tolerance for historical data to go to presto S3 and here is also can work as a data for tolerance fallback because if there’s any error in cldb then scone API will send the request to press to S3 to get the data that they need and also now the customer can quickly remediate the security fighting because you just need to send an upserved request to cldb and Thursday it’s also satisfy our business requirements and here’s the and here’s the result so in August um 24 up 220 and we switch over from all architecture to the new one and as you can see the latency is is immediately significantly degrees up to 90 percent or to almost to only on endpoints that we have and then um we have like 80 incidents and less compared to the time that we have Presto and our um there and also um we have saved over 1 million infrastructure costs per year when we switch over to ScyllaDB um DB because now we don’t have to maintain several clusters of AWS EMR and Aurora which is have a lot of costs over there and also we have reduced the time to to do the scoring uh up to 30 because now when we write to ScyllaDB it’s much faster compared to when we write the data to presto and our end and last but not least has improved a lot on the customer experience everyone’s surprised how fast this is and customers are happy is what makes us happy as well so this is the lessons that we have we learned that why working with ScyllaDB DB deciding the skimmer is one of the very important things it needs to be doing with it need to be done with courses and especially from what uh what we have done in the past we checked the access pattern from the query that we have and then we rewrite the schemer based on that we know what should be in the primary key in particular what could be a partition key that will be the cluster key and also the time to leave of the data that we expect uh in cldb and the second lesson that we learn is uh we should we should leave the request of high latency tolerance and historical data access to olap engines such as Presto or atenement because um it’s it’s it’s where desired they are desires for that purpose not for user facing a requests so on top of that we can do a synchronized processing to to generate the reports and and send a signal send the notifications to to the platform and finally we should have a DACA controller which is in this case a scoring API in our architecture that need to be so that it can work runs parallel tasks and then it’s also tackle certain cases that the SQL query cannot handle such as joy and select in query that’s that that’s uh that supported that is supported in in normal SQL but not in C in SQL and that’s that’s company is important to redirect the request to different data store uh playing as a data Gateway and control the rate limit to the data working as a fault tolerance and or something like that so that’s really important to have that that data controller in in your architecture as well so that’s on for my presentation thank you for listening [Applause] foreign