Where Database Monsters Connect!

Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way

28 minutes
Discover the latest trends and best practices impacting data-intensive applications. Register for access to all 30+ sessions available on demand.
Enter your email to watch this session from the ScyllaDB Summit 2023 livestream. You’ll also get access to all available recordings

In This NoSQL Presentation

Today, enterprise technology is entering a watershed moment, businesses are moving to end-to-end automation, which requires integrating data from different sources and destinations in real time. Every industry from Internet to retail to services are leveraging NoSQL database technology for more agile development, reduced operational costs, and scalable operations. This institutes a need to model relational data as documents, define ways to access them within applications, and identify ways to migrate data from a relational database. This is where streaming data pipelines come into play. Over the years, as the cloud’s on-demand resource availability, full-service, API-driven, pay-per-use model became popular and competitive, cloud infrastructure consolidation began, requiring the automated deployment of infrastructure to be simplified and scalable. This session details one of the easiest ways to deploy an end-to-end streaming data pipeline that facilitates real-time data transfer from an on-premises relational datastore like Oracle to a NoSQL database with low latency, all deployed on the Kubernetes clusters provided by Google Cloud (GKE). Apache Kafka® is leveraged using Confluent Cloud on AWS, depicting a true multi-cloud deployment.
ScyllaDB Summit 2023 Speaker – Geetha Anne, Confluent, Sr Solutions Engineer

Geetha Anne, Enterprise Solutions Engineer, Confluent

Geetha is a Solutions engineer in the big data management space with experience in executing solutions for business problems on cloud and on-premises. She loved distributed computing during her undergrad days and has followed my interest ever since. She provides technical guidance, design advice, and thought leadership to key Confluent customers and partners, helping translate their enterprise business needs into right technical solutions, ensuring highest level of customer success and maturity, while keeping it all simple.

Video Transcript

Welcome to this session. My name is Getha and today we’re going to be discussing some of the strategies of moving away from a SQL based relational data stores to a NoSQL database or data store of your choice and how do you do it with the help of Apache Kafka. Thank you so much for joining me today.

before we go a little bit about myself uh I am Gita as I mentioned before I’m a Senior Solutions engineer at conflict I joined conflict two years ago and before that I used to work at Cloudera servicenow the Hawaiian Airlines and a few other companies and I was a software engineer earlier and I’ve spent like almost 10 years in this space I used to do software development to automation to I am right now in pre-sale solution architecture I’ve been all this time in Silicon Valley I have two daughters and I live with them in the Bay Area I enjoy cooking singing hiking other various physical Pursuits and uh yeah that’s a little bit about myself so let’s get started so as far as the agenda for this session goes right so I’ll introduce you to the problem of what Enterprises are facing today and why do you even have to modernize or migrate away from relational world to a No Secret store and how complex of a process is it and how what are the other different elements you need to consider while you’re moving and why conflict primarily because we are based off of a fundamental technology like database it’s where con and confident is uh equivalent to Kafka it works with several databases out there and it helps with data modernization techniques and you will see why confident is pioneering this category of data movement and data in motion paradigm I’ll introduce to the solution what different Enterprises in this space are doing already to solve this problem uh we will I will be proposing an architecture and several action items that you can take in your own organization to help remediate the problems you’re facing with no SQL stores and other gaps that you might identify along the way and how do you you know solve those problems with the help of Kafka and you will have some takeaways from this in few next steps to for you to execute yourself and deploy I will not be showing a demo but uh I will be giving you all the resources at the end of this talk to for you to deploy the environment and test the proposed architecture that I’m going to talk about today

so for when it comes to the problem right let’s just discuss why do you have to modernize your database uh using Cloud if you’re on Prem why do you have to do that Legacy Legacy relational databases have been around for so many decades you know many companies are happy with that they have provided the latency the throughput and all of the different modeling and things that you would need for a large Enterprise to thrive and provide reporting in other forms of data needs that it has but then why are many companies um also seeing the need to migrate away from relational databases to no secret right so to start with migrating from SQL to nosql can feel like uh it’s like a puzzle game and um when you have different pieces in a puzzle there are multiple options for you to start solving that puzzle right and you have to reason several things when you start when you start doing that puzzle but uh there are multiple options for you for this migration when you’re comparing it to my puzzle game but our goal is here is to help you with that puzzle for you to figure out the whole complete picture at the end of it you will know let’s say you don’t know the whole picture looks uh how if you’re given a picture you’re solving a puzzle that’s easy but imagine you don’t know the complete picture of the puzzle but you’re trying to put together things together and then arrive at some form of a solution that makes sense the goal of the solutions like Kafka is to help you get there to achieve your full picture in an Ideal World your apps on relational databases don’t run into any issues and there’s no need to switch over to nosql but if you run into scalability issues or if you ever need to go Global then you’ll have some great reasons to migrate so the some of the nosql database vendors have partitions over displayed architecture to handle petabytes of data for linear scalability and it they replicate data on multiple data centers or across the world around the world on cloud keeping up with your data needs but imagine how easy it would be if you can just take the old application rewrite the parts of it and migrate your schema and your data to get to you to get to your new application that that’s that’s that’s not how it works in reality though it’s not that easy to do all those different pieces of the puzzle um right away because a relational database contains hundreds of tables with different applications using them across different business functions so migrating the whole application can take a long time during which your applications and data requirements might change and the applications um these relational databases are providing data to will evolve and your migration strategy also has to Evol and pick up that Evolution becoming way too complex to manage um so a more realistic approach is to take a small piece of a larger application and then say a feature of business function and then that not only uses 20 to 50 tables in a relational database but you and then you start migrating it migrating into nosql in a reasonable reasonable amount of time um that’s that’s how you start usually uh in a way to uh break down the large complex application into smaller pieces and approach it one time one at a time and solve those puzzles are smaller puzzles at a time and then put them together in order to migrate completely to that nosql database I’m going to discuss one approach with the help of Kafka how you can do that how you can decentralize that whole uh data into smaller easier disorder digestible problems and then solve them one at a time with the help of Kafka so uh you know cloud is exploding everywhere you’ve seen Cloud native applications coming up there are self-managed databases and there are Cloud databases um we really know that there’s no sequel is my secret sorry my sequels first grade sequels servers and all of these are slow and rigid they provide immense value but they’re still rigid which it is expensive in terms of both upfront and ongoing cars and as a result it limits a speed in in which the businesses can scale experiment and drive Innovation so I have a favorite code that I always tell in these conferences is the only thing that’s worse than developing software that that you can buy is running a software that someone else can run for you meaning which today organizations are suddenly realizing this need to modernize their databases and then they have a strategy that includes streaming data pipelines they have this integration to enrich operations in order to increase their agility elasticity cost efficiency all the good things that you see in a cloud-based application right that that’s because of the streaming element that’s subtle introduced in this real-time applications um near real time batch even batch is moving towards real time most of these companies are seeing that influx of data so traditionally self-managed databases are difficult to manage and scale that you need to consider Computing resources storage capacity patching and Etc and all of these will will gave away the need to migrate to a cloud hosted system so that you can decrease your TCO by lowering your operational costs increasing agility as I mentioned this slide just reiterates on how time and resource intensity is for large mature organizations to connect these numerous Legacy systems and then make sense of all the silos of data um because it’s first it’s highly complex and resource intensive um and second thing is there are these uh latency issues that are inevitable and they’re still later during the cloud migration due to the batch based processing and it’s that is unsuitable for time sensitive and Mission critical use cases um in the third and then the third uh in the third current migrations uh Solutions uh the third situation is that the current migration Solutions are prohibitively expensive and they deal with either data silos or they also have this course of licenses that further lock you down into render vendors of your but that you have picked so because of that reason I am today proposing a solution that is that’ll help you easily modernize your database and simplify your migration towards um uh towards a no SQL solution uh be it on-prem or on cloud uh there are systems that that are everywhere and then with conflict uh what you can do is you can actually accelerate all of this uh in just a matter of Click of a button and because it’s a fully managed offering one of the offerings is fully managed it’s based off of latest and greatest in the communities world the the transition will be almost seamless um and because of that uh we will discuss how that is done and uh now we’ve discussed the problem and how now we let’s dive into how can confident help you overcome all these challenges of um becoming a ubiquitous solution for you to move data from SQL to nosql so fundamentally if you know conflict is a creator of Kafka the the founders of conflict are the original creators of Kafka at LinkedIn and they are fundamentally identified this uh new paradigm that I was going to just it has just become to disrupt the the data space and now it has become a phenomenon because the data infrastructure is is like really needed to be reimagined to seamlessly harness the flow of data um and across all your apps and databases and the SAS layers and Cloud systems so they have introduced this entirely new paradigm that is purpose built for digital economies and the expectation is that that the data is real time or near real time all the complaint all the companies right now most of the companies Fortune thousands and 500s are rethinking their data architecture the to shift to this data and motion Paradigm um like CT Bank does and if uh like they have like they have given this quote unquote we need to shift our thinking from everything at rest to everything in motion if bank is saying that you can imagine other social media Giants and everywhere where the real-time data is they have where they have Legacy databases uh hugely storing petabytes and zettabytes of data they want to move to oscillation like nosql they are coming to the rest of Kafka and you know Kafka is a distoid system it’s written on it is it’s written in Java it’s really highly complex and operationalizing Kafka if you know if you have done that before you know how complex that is because it involves effort from it’s a developer focused product so you will have to have immense number uh number of hours of Kafka expertise and uh with open source Kafka you still have to know lots of ins and outs of how do you operationalize it how to how to zookeeper work how does governance is uh managed because governance doesn’t come out of the box with open source Kafka you will have to build something of your own so that this is where confident is a differentiation it can say kicks in we offer a fully managed service that handles all aspects of operational barrels that comes with open source Kafka that will solve real-time connectivity problems for you um so conflict is made Cloud native it cannot it can also be deployed on-prem don’t get me wrong but uh we have re-engineered completely the workings of Kafka from ground up so that it is made uh Cloud native it is elastic meaning which at the click of a button you can increase the the number of instances that your workloads the data from your relational stores will need um Global meaning which you can deploy it on-prem everywhere you can think of on cloud bare metal VMware wherever and the infinite storage piece will give you the capability to expand your workloads to as much as many number of petabytes or terabytes of data you want and that conflict will support you too with the help of object storage uh on in the cloud space um and other forms of ways of mechanism to deal with on-prem Solutions and customers can store data as they see as they see the fit and the then they’re only Built for what they use um as I mentioned content cloud and content platform are the two different options that will give you the facility to Pro to deploy Kafka um content Cloud it’s a fully managed offering we deploy everything for you you’re hands-free um it’s on all leading AWS uh all the public clouds out there um for some reason I don’t see other other Cloud providers but uh yeah there are other we are on all uh conferences can be deployed on all csps control platform on the other hand if you have the expertise of Kafka you can deploy it on and you can resort to us for our support we have world-class support organization that um that completely knows what they’re doing and they’re these are uh conflict uh and these are Kafka committers so they they do we live breathe and do only Kafka so that’s the that’s the value addition that you get um come into the solution like we have discussed we have the the the landscape right now is that we’re moving data from your on-prem to on-prem or on cloud databases the relational store to a no sequence store but how do you do that the solution right now is that that we are proposing is conflict clouds uh um ecosystem which is cloud native complete and everywhere as we discussed it has capabilities that can give you the ability to stream events from your relational databases with the help of something called less connectors um B supports team processing meaning which um fundamentally all your all the data that’s being streamed into confluent or Kafka can be processed in in natively inside Kafka with the help of something called as uh K SQL DB and if you prefer it to be done with something like Flink that can also be done we just announced our integration with Flink and we’re going to now help our stream processing led by Flink of his team governance we have a lot of features and then everywhere wherever your relational database is your content Cloud can be deployed and how do you do that it’s a three-phased approach first you migrate your workloads uh with the help of something called as managed confidence Source connectors and then you optimize all the so the data that you’re going to get from your relational database sometimes need not be optimal for to be pushed directly into your nosql data store so you might want to massage or Wrangler some of that data or aggregate or perform certain Transformations you can use K SQL DB or Flink for that matter that is part of confined cloud and optimize the solution and then you can move it into a nosql solution with the help with the the third with the with the help of a third phase called as modernizing using our sync connectors um so this way you you can continue to migrate your workloads into cloud or on-prem of your choice with the help of a three-phased approach like that so you have uh yeah so to start with the migrate option right we have something called a source connectors connectors is a fairly uh common topic for a lot of you uh in Kafka uh Kafka connect is an API built into Apache Kafka that allows you to Source events from any system that you see on the left like Hadoop redshift uh you name it whatever you see on the screen I’m sure we’re missing a lot of them on this deck so what would it will help you to move data from relational databases or mainframes or Etc into Kafka topics or in Conflict through what we call The Source connectors um and then uh after the data lands in the car into the Kafka the connect framework is uh because it’s fault already and also scalable uh like there are it works internally to uh store the data temporarily into kafka’s uh internal storage and then once you can once a Kafka identifies that all the source data from the uh your relational database speed or up to mySQL is snapshotted and moved into Kafka it can then leverage something called a sync connectors to push data into the destination of your choice no secret database that you can see on the screen um and if you look at it right confluent originally your open source Kafka had only had so many connectors but conflict had done is that it had expanded on those open source Kafka connectors to further increase the number of available data sources and syncs and then these connectors are fully supported by conflict and our partners as well so you save time in an effort building these connectors yourself large corporations actually earlier invested time in building connectors which can anyway take from six weeks to almost six months on an average to develop um a connector so with conflict Cloud we’re actually taking away all that our operational risk view that you’re running into um also the risk if you choose to run an open source connector you’re actually disking it in production so you can bypass all of that with the help of our conference Cloud Source connector that can connect to your on-prem Oracle or MySQL or whatever relational store you can pick so the overall architecture is going to look like that your modernize look like this there are these sources that will have uh informational sources and then you use the source connector pull the data into confluent Kafka massage it and transform it with the help of K SQL DB and then use the sync connector that is the modernized part you can also do on the Fly Transformations with kafka’s connectors and then push it out into whichever is you know SQL DB preference can be solid DB mongodb elastic Cassandra what not that’s on a high level what uh how this is going to work it might it looks so simple right but imagine doing it on your own writing your own application to pull data um and also making sure that that application is falter and scalable provides you millisecond latency this is a lot of work so we are saving all that time conflict is going to save all that time for you and the optimized piece of it in the three-phase plan we discussed it’s going to focus on uh you know more massaging or sort of transforming aggregating all the data that you’re moving into Kafka if you prefer to do it natively into Kafka in the Kafka space you can use something called as streams Kafka streams which allows you to build real-time data in motion applications with um familiarity of Java or Scala so if you see in the screen the first in the First Column we see a traditional application using Kafka client to essentially count the number of Records in the second column you have a code which reduces using a Kafka stream written in Java and to expand the accessibility of stream processing we have used Kafka streams um and Compton on the third column you see K SQL DB so confident has built an abstraction layer to Kafka streams to allow you to build the same data in motion applications using SQL like interface in the column that your CK security B it that’s exactly what it is so kcqdb supports aggregation joins Windows based queries order handling and even exactly one semantics meaning which you want the data that you’re moving from relational store to be exactly delivered only once to your SQL database so that makes car security we make sure of that and uh and all this occurs in a computational and at the storage resources uh storage resources levels operating from your conflict clusters so they both are different components but uh okay security be relies on Kafka for its high availability and fault audience and that is by intention that um Kafka is the backbone of case equal DB and it uses uh its first and foremost as a storage layer first but uh kcq DB also reuses Kafka as a foundation for its elasticity fault tolerance and scalability but it’s available as a fully managed option on conflict Cloud on all Cloud providers and then becomes the same connectors piece which will take the events from confluent and send them to your data targets like no SQL stores and uh like the source connectors they support single message Transformations that are smts though so that you can employ further to customize your data as if uh as it is expressed to your cluster so you can take one aggregate topic and sync it to the multiple external sources uh systems um each with a unique single message transformation

so on a high level conflict connects all your organization systems and applications together via a single pair of glass platform that enables data to flow freely and securely across all parts of the globe Enterprises in order to power railtime applications um this is really on a high level what conference does and no security be a nosql Solutions are definitely part of our modernized piece and we have our managed connectors that are available on cloud further easing the operational burden for you um so this is a high level architecture right this is a this is like a real world scenario that we try to simulate um we are I’m not able to provide you a demo for you today but I did give you Links at the end of this talk for you to go deploy and see how this entire flow works the goal is to modernize um Oracle database which is uh with real-time streams from Oracle database and rabbitmq to Amazon Aurora um that is one piece uh another piece is the same flow but you’re putting the data also internal SQL store on cloud you can pick some anything of your choice mongodb or uh I don’t want to take names but there are so many solutions are confident natively supports integration with you can pick any of them so you extract data from Oracle use Oracle CDC connector transform sensitive data information using single mixer transforms and then extract the real-time transaction data from rabbitmq using the fully managed IPTV connector and then merge these two data sources and build a fraud detection using K SQL DB this and aggregating and windowing in case equal stream processing and we are going to load this fraud detection result into the monger mongodb or any like Siler DB or whichever is your preferred uh in the destination but if nosql is your choice you can go look up in our confluent Hub and you’ll find relevant uh connectors for the no secret data store and you’ll be able to push the data down to the nosql databases so just to reiterate um this is a three-phased approach conflict takes in modernizing your databases from on-prem or cloud-based relational data stores to nosql which is you first identify the workloads that can be migrated that can be moved from relational to nosql and then you optimize them with the help of K SQL DB or Flink for that matter that we’re only now starting to support Flink and then modernize with the help of sync connect this is where the nosql piece comes in and then it will help you move the data to nosql um and then when it comes to modeling and how you want the data to be exactly replicated into your end uh destination which is no SQL DB there are several nuances and uh how you want to do that and how do you want to map the tables to the collections or documents or whatnot there is thorough documentation online you can go check out a docs.controller.io if you have any questions please shoot me um and then um to summarize um conference is much more than Kafka and uh conference is a cloud native complete platform for data in motion and even streaming uh and confidence is available everywhere on AWS or gcp or Azure or bare metal everywhere so with a fully managed solution like that you’re almost getting a zero operation service um you can run also confident in your own Data Center and we’ll help you support beyond the core functionalities of Kafka conflict also includes uh as I mentioned KC code EB and Kafka streams which are really the processing and analytics part of our solution um and you can use connectors to really move disparent sort data sources data from these sources into cloud services and third-party systems like databases Enterprise security governance and infinite storage for your data streams and much more supported by conflict so yeah that’s um almost uh to the end of the session uh that’s the GitHub link that you can go check out there’s a source code that will show you exactly how you can move your on-prem or cloud-based relational data store to uh uh nosql database it’s it has a I’ve given clear cut instructions how to do that with how to configure the connectors how what are the different security fundamental knowledge that you would need for that and also I’ve written a Blog about it a few years ago uh to do the same in this case I have chosen a certain nosql provider but you can really do that with any uh provider of your choice uh conference has did really do a great job at documenting how to modernize that and how to move your relational to nosql and how to model and how to really do if it’s a fully managed nosql solution there are different nuances that content has really well documented so that said I I really want to let me see

um and just bear with me and yeah uh we are almost to the end of the session so the goal of the session is to really give you an idea of how you can modernize your databases or data stores that are Legacy and if you identify gaps in that solution if you want to move to another SQL solution there are plethora of migration tools out there but with their own drawbacks but Kafka is really a solution that ubiquitous and uh generally accepted by most Fortune 500 companies and confluent is the pioneering vendor in that space so please check out our docs.io for all awesome documentation we have on our products and if you want to connect with me I have my email Twitter and GitHub that’s listed here I will I also would like to connect with you on LinkedIn if you want just to follow what I do and what we are doing at conflict um if not also I’m I’m open to connecting with anyone who has uh questions for me or who thinks I can be of any use thank you so much but uh I’ll see you later bye [Applause]

Read More