Hi, everyone. My name is Hili Shtein. I’m chief architect for Amdocs Data and AI Platform, and in today’s presentation, we’ll talk about how Amdocs manages complex data pipelines for the modern telecom and how ScyllaDB helps us do these tasks in the most efficient way. A little bit about myself, I’m leading architecture for Amdocs Data and AI Platform for the past 5 years.
I’ve led architecture of Big Data and event processing for the last 12 years, over 30 years of experience at global operating telecoms and for telecom solution providers. So let’s start the presentation. So problem statement or what we are going to talk about here, so this is the problem statement for our data management and AI platform, concentrating on the data management part, and it’s all about build and maintain in ODS or EDW, the data warehouse, and it can provide required diagnostics, predictive and actionable prescriptive analytics in a quicker and cheaper manner than … or the quickest and cheapest manner possible, while what we talk about for an average outcome, we’ll discussing many sources of data. And when we say many sources, it can be from a multitude of operational systems in many different technologies and protocols, how to get the data from all sources and align them, and we’ll explain a little bit during the presentation on why and how we try to overcome it. Ensure the quality, both from technical and the business perspective. Ensure data freshness, so especially when you talk operational data systems, we try to provide the data as close to real time as possible, and for that you need to actually do all your transformations in near real time as opposed to regular ETL pipelines. This requires some more handling and contextual scenarios, et cetera. Supplying data to different departments for different use cases, hard to organize the data to be effective for reports and insights, and actually this is one of the most important tasks here. It’s to make sure that the model of the data, once you are providing it, would be such that the access to the data would be as efficient as possible but as informative as possible. Of course, we need to scale with the growing amount of data and still remain cost-effective in scaling here. The mere ability to scale is extremely important and, of course, when you are on the cloud especially, very important to provide dynamic scaling and to be able to both outscale and sometimes upscale if needed. Changes in the source all the time occurs, and we need to respond to it as quickly as possible, and technology and business requirements are changing all the time. So this is more or less to discuss that problem statement. In high level, if we look at the following diagram, it represents more or less the way that we look at it and what we do here. So right here on the left, you can see a multitude of sources, and this is sample sources that are very relevant for the telcos, and these other third-party apps can represent any other application that provides data, et cetera. We can see here at the collection level that we have different way of interfacing with those sources. So some of it will stream the data over Kafka, while some of those source … Those on Kafka, of course, this is how they provide information, okay? Then a lot of information systems, especially those that are more legacy system and less modern, will provide the data out of their RDBMS, okay? Usually it will be Oracle or PostgreS or actually any other RDBMS, and we will need to collect the data from there. So for that, we’ll actually use either the actual SQL to collect, and for streaming, we’ll use CDC technology. That stands for change data capture. This is the technology where you access the redo logs of the database, and you actually stream the transactions as they are committed to the database, and this way you get all the database updates as soon as they occur. Application APIs, files, Couchbase, and we have more, but this more or less represents the multitude of technologies. So essentially all these are collected into a Kafka stream that we manage. We’ll have … It will be segregated like topic per source, and what’s important to understand is that all the messages, it will be our JSON format, but they will actually contain source schema of the database or of the sources. So this is still where each source provides the data with its own source, but this is already in our format, and this can be either on prem or on cloud, and sometime it will go directly to cloud, or it will replicate from an already existing stream on prem. It doesn’t matter. What does matter is, we’re going to read from these topics here of Kafka that contain the source schema, okay, and essentially, when we get out of this transformation phase that in a minute we’ll actually go deeper into what it means, here we have now a stream of model schema, and this model schema is … This is what we were after, okay? This is not the end of the pipeline. This is actually where it starts to get interesting, okay? But getting here is a big challenge, and actually most of our presentation will concentrate here on this area of transformation and what we actually need in order to be able, in near-real time, to get all the data from these sources and stream them on these model schema Kafka topics in order for it to be available for rest of the clients. So some of these clients are ingesting the data in operational data store or analytical data store, running all sort of aggregations, whether real time or real time that will actually perform it directly on the Kafka and rich aggregations that will read whatever you already ingested, all sorts of analytical enrichments, profiling, AI insights, alerts, et cetera, and all these can provide the APIs a lot of insight and information directly to apps or can actually stream the information back to our model schema which will, in turn, also go to our data store, to the database where it can be accessed via SQL. From this point on in the presentation, we’ll actually concentrate on the part that I didn’t go into yet, which is the transformation, and we’ll talk a little bit about what this needs to provide and how we do it, what is the challenge and ScyllaDB helped us with that. So first, in order to understand the way that we address the data transformation, okay, so as we said, we have multiple data sources, okay? In this example, we see that we have three data sources. One is billing, one is CRM and one OMS, okay? So actually this is a real-life example, a little bit simplified, but a real-life example on what we need to collect from three different sources in order to assemble a single or multiple, but one type of subscriber record, and this stands for a record of subscriber in our shared model or the target model that we saw that we streaming to. So in order to do this in real time, what we actually need to have is all these types of entities with these relations that we can see here. And, of course, I will not go into it, but what we need to understand is that when we get the updates from the sources, so the updates might be that, for example, I had an update in a contact here. You see we have here table contact. So maybe I got a contact update, okay? And now this contact update should generate us a case where we need to reassemble the subscriber record that we are interested in and republish it, okay? It might be that just the content came. It might be that we had a transaction that provided us an agreement, which is something much more, and actually it might be that this record that came belonged to more than one subscriber, and this subscriber here that … or the billing subscriber, actually, we refer to this … First of all, this whole construct we refer to as a context, and the subscriber, the billing subscriber entity would be the leading entity of this shared model subscriber context, okay? This is important in order to understand how we actually perform the transformation. So actually what we need to do in order to do the transformation, okay, is when we get the record, we will need first maybe to complete it. Okay, why would we need to complete it? Because it came from CDC while all that it reports is the delta, meaning if a record was updated, let’s say a subscriber status was updated, okay, so we’ll get just the status, some control fields and the transaction maybe. So we’ll need first to make sure that we have the full subscriber record. We need to persist it somehow, okay, and this goes for all the records in the transactions that we get. Now we need to reassemble the record. So let’s say that what we got … Again, let’s go back to the contact example. So we got the contact. We persisted it. Now we’ll need to find out all the subscribers and, by the way, maybe some other type of entities also that are related to the contact, but in this example, let’s stay with the subscriber. So we need to find all the subscribers that contain this contact, and now for each of these subscribers, we need to republish its information because it changed. So this is a short animation that actually tries to show what is being done. So we see the information coming from the source system, okay, by collection. It persists into a cache database, okay, and then our … This transformer service here will actually read all the records that belong to this context here and will generate the subscriber, the shared model subscriber record that we need, so we’ll be able to persist it to the data store. Okay, so I hope this a little bit clears what is the tasks at hand, and I think we start to understand the complexity of doing that. So essentially, this is real-time transformation as we do it, and it handles record completion, like I said. It knows how to filter and change records and manage the storage, perform … then, in order to do the transformation, we perform in-memory with an engine called SQLite, the actual transformation, and this is … These are complex queries that actually operate on the context as I showed it. We want to leave minimal amount of work for the ETL. So the ETL engineer, all it needs to do is now define the context and the query, and the rest should be done by the system. And we need to autoscale the system according to the lag that is accumulated on the Kafka topics. So it means that we don’t want it to lose any freshness of the data. Okay, so just to show how this looks like, okay, so here we have a cache database that needs to provide the low-latency, high-throughput in order to perform all these actions, okay, and these actions would be from the collector, we’ll get streams of master data. Okay, sorry. So we’ll get streams of master data. This will be partitioned one per topic. We’ll retrieve all the changed entity, update them in the cache database, determine what is the contact that we need now to republish, okay, and we’ll publish to Kafka, to this Kafka here, a list of all the leading entities or list of all the contacts that now need to perform transformation. This is just to show that in case of what we call transient data, meaning data that we don’t want to persist because we are not updating it, it can go directly to this Kafka with the instruction, and we don’t need to go through this stage of persisting it. After we persist it, so the transform now need to load from Kafka all the instructions for transformation, needs to get all the contacts that were … that are required, and, of course, it does it in batches, okay? So it will retrieve it from the cache database, okay, and then it needs to do the transformation, and the result set from this transformation would actually go to the model schema Kafka that we saw in the data flow slide. So the issue at hand that we had when we wanted to take all these and migrate it from previous technology that we had over Hadoop to cloud-native technology was that we need a cache database that will actually … On one hand, it can be aligned with other tasks that we are already doing, and on the other hand needs to provide us with the low-latency, high-throughput consistent response time. So we looked at a few databases, and to tell you the truth, we already had Cassandra for some of our enrichment services. So Cassandra was an obvious candidate in order to do it, but we wanted to also explore looking at engines that are more memory-based maybe and that provide us with a more user-friendly access. So we looked at a couple of them, and one is a very well-known and highly used memory grid, and the other is another memory grid but that also provides SQL access, et cetera. We looked at the following criteria that we see here, the tech alignment that I discussed. TCO meaning, when we actually deploy it, what’s the BOM? What is the footprint that we’ll need for that? How much the license would cost, and, of course, what is the bottom line? Nonfunctional and technical requirements, what will be the refactoring effort for us to do it? Install base of the engine, scale, how much we can scale it, vendor locking, querying, availability of … querying tools, availability of Kubernetes operator and ability to get managed cloud services, and most important, we had to run a satisfactory proof of concept execution results. Okay, so these were, as I said, our candidates. So we used Cassandra, those two databases, and ScyllaDB that we added that we didn’t need to do any special programming for the ScyllaDB because we had Cassandra already. So we figured, “Let’s get it into the POC.” So here, this is the evaluation criteria that we just discussed, and we can see that the results weren’t … Most of the results weren’t leaning towards one or another. If all the engines would deliver, choosing would probably depend a lot on commercials and really where we would put the most of the weight, okay? But unfortunately for most of the engines and fortunately for us, just one engine actually managed to pass the POC execution, and when I say that, it was the only one that delivered, okay? Actually Cassandra … Both contenders of the other databases actually with the sizing allocated, which was really the most, let’s say, expensive and large footprint that we could afford for this, in order for it to be good to sell, could not even finish the process. Cassandra did manage to finish, but it took three times the time that was considered to be reasonable, and the same code on ScyllaDB actually provided for us better performance than we had on our previous Hadoop platform with less hardware. So at this stage for us, it was a no-brainer that we should choose ScyllaDB for the task because really it was the only one that delivered. We are using it since and enjoying working with the ScyllaDB team, and this is the end of my presentation. I want to thank you for listening. You can always contact me at the email shown here, and thank you very much.