Everything in its Place: Putting Code and Data Where They Belong

Brian Sletten22 minutesFebruary 15, 2023

There is an old saying: A place for everything and everything in its place. It brings to mind a natural order to things to facilitate a smooth traversal of our days. If we don’t have to work hard to locate our tools and things that we need to accomplish life’s tasks, everything is just easier. The idea that code runs on computers and data is stored in databases is increasingly only part of the story. I will highlight the changing trends to this notion and the issues with getting it wrong.

Share this

Video Slides

Video Transcript

Good morning, good afternoon, good evening depending on where you’re coming from. Welcome to the ScyllaDB Summit 2023 my name is Brian Sletton and I am very happy to be here. 

I’m uh here to talk today about one of the trends in our industry that is influencing organizations like so to um change a little bit of the Dynamics of how our code interacts with our data and so the idea that you’ve heard elsewhere in the conference about being able to run code directly in the browser through user-defined functions and Aggregates and you know to the greater a great extent prepared statements and um structured statements and things like that um are also part of this trend but the trend goes back much further and I want to highlight that context in this particular discussion so I wrote a book for O’Reilly on webassembly and uh in it I highlighted all the different uses that it has one of the things most people think about is that it’s a technology for bringing um languages other than JavaScript into the browser and while that’s certainly part of the story as as we will hear throughout the conference and as you’ll hear elsewhere there’s a much bigger picture for webassembly that basically is about making code safe fast and portable so we’ll be able to position the code where we want to and be able to write it in whatever languages we want and have the safety of a sandboxed environment whether we’re running inside or outside of the browser but I want to take a little step back in time now um perhaps before some of you were born certainly before some of you were in the professional world but back in 1999 I was the first engineer at a company called parabon computation and rest in Virginia we were an internet distributed computing platform so we were doing computational intermediation this is obviously pre-cloud but what we were trying to do is to bring elastic Computing to organizations that needed it the central thesis was that people don’t buy as much computational power as they need and so they only use what they have when and how they have a need to if they bought more competing power than they needed then generally the accountants get mad about resources sitting around and not being used so we had a series of uh organizations that had these elastic needs that weren’t being met at the time by things like Beowulf clusters because again remember this is pre-cloud and we did not have a widespread infrastructure for just being able to say I need to use 10 000 computers for the next couple of hours and um so we we had pharmaceutical researchers that were trying to apply new drugs to existing diseases and they couldn’t explain why it would sometimes work and why it sometimes wouldn’t work and it turns out that they were not able to look at enough of their data and by being able to have the elastic power of some additional resources for some period of time they were able to find the features in their data set that did explain why it sometimes worked and why it sometimes didn’t we had people at Duke University that were trying to study zero day prescription problems basically the body would metabolize medicine differently even if you had two people of roughly the same sort of physical size and there was no real explanation why that was and they were able to Again by elastically um acquiring additional computational power they were able to look at more of their data and discover that it was the presence or absence of certain enzymes in the pancreas that were causing um these this difference in metabolism another example we had a NASA researcher who would run a simulation on his own computers and it would take nine months for the simulation to run and because at the time the only real option for portable safety and um protecting the resources that we were using was Java we required him to Port his application to Java or at least the computationally Intensive portion of it so he moved from Fortran to Java and I can assure you you know 23 24 years ago Fortran was significantly faster than Java for numerical processing but because we could give him thousands of computers you know on demand he was able to launch his job from Maryland on the east coast of the United States fly to San Francisco on the west coast of the United States get to his hotel and get his results and so that’s the power of parallelism is that certain problems lend themselves well to this kind of situation and we were able to speed things up from nine months down to to say six hours but there were other things that we tried we we had a Monte Carlo based digital rendering system um in that in that time period and you know we were able to produce very nice results but the results were not as computationally bound purely there was a data element to them and so we had to get the data down to the the computers that were providing us these resources and at that time in addition to being slower single processor computers with less disk space and less memory they also generally were on dial-up systems so we didn’t have Broadband communication and the cost of moving the data to the device basically offset any additional power that our Network would get by bringing that node into our computational collection and so things like 3D graphics and other systems that were more data intensive were not a good match for for the system but things that were embarrassingly parallel very much were were a good good match for this kind of system and we had several kinds of successes like that this is really just a smaller example of the larger problem that I want to contextualize and these days we find our world even more complex because it’s not just about the computers that we run on premises it’s not just about the computers that we may run in our own data centers we also have cloud computing which is available to us as an elastic resource meaning if we need more of it we can ask for more of it and the value proposition that this gives us is the ability to buy computation on the margin and therefore if we need to spin up some resources for some period of time we can do so but not everything in the cloud makes sense right it makes sense if we have large amounts of data that we’re just going to Chunk on for a while to put it all up there let it run for a while do the computational uh demand that we need and then spin it back down but if we leave our data in the cloud if we leave the machine learning models in the cloud if we leave our services and the the things that clients are going to interact with in the cloud then there’s a different kind of cost there’s a latency cost for interacting with these back-end systems so sometimes we want those applications or services to run on our devices so these could be computers Enterprise resources data centers mobile devices tabs even Internet of Things devices and sensors but just as we faced 23 years ago people were a little concerned about running their code on computers that they didn’t control or you know that there was no real control over some people aren’t going to want to run their machine learning models or intellectual property on device or on premises and so what we’re seeing is the emergence of this new interstitial computational area called cloud computing here in this diagram it’s referred to through the name fog Computing but the the more conventional common term is is Edge computing and so the idea is we could move the computation to be on the border of the cloud or in the cloud or on device or on the border of on-premises or not on-premises we could move it to Municipal downtown areas per se like if if your Wi-Fi provider had the resources to run Computing elements near cell towers right you could move code to where it made sense to so that we would remove that latency cost because you clearly do not want the brake system in your car issuing calls into the cloud to decide whether to stop the car or not and so what we see is that there’s this range of values this range of locations of where we can locate our code and our data and increasingly we’re going to have more complex ways of of doing that it’s not just going to be in the cloud or replicated to Some Cloud region we are going to have to think quite clearly about where does it make sense to put our data and where does it make sense to put our computation and I believe that the primary goal of people in the I.T space in the 21st century are going to try to be controlling the costs of computation what does it cost to answer a question what does it cost to service a particular user request and those costs aren’t just Financial but clearly there is a financial element to it when you have to buy the computers and set them up and network them and power them and air condition them and obviously there’s a cost to that when you spin up resources in the cloud you’re basically buying a resource on the margin and so there’s a cost associated with that there’s obviously a temporal cost to computation how long does it take for the code to run to execute obviously faster Hardware faster chips make it run faster but then there’s a financial cost right for the for the premium this is the latency cost that I mentioned if the code has to interact with a remote client or additional peer of some sort then that latency um while you know maybe not out of control still adds up and so chatty apis and protocols have to pay that latency cost over and over again one of the reasons why the the rest architectural style of the web has been so successful as a Global Information service is because its capacity to Cache results on the client and and whatnot but if deploying code into this complex infrastructure is difficult time consuming requires a bunch of skills that you know to architect in to to use complex networking to reduce the latency maybe some organizations don’t have that ability and therefore there’s an opportunity cost of not doing whatever it is you’re trying to do so we want to make it easy we want to make it possible we want to make it affordable but at the end of the day there may still be based on the nature of the data some kind of cost for Regulatory Compliance so if you have private information if you have gdpr constraints if you have HIPAA medical privacy concerns or PCI concerns then putting the code where the data is and not taking it off the device may be a way of reducing some of the regulatory Appliance costs as well so my main point is the world is evolving to be a computational fabric and where the code runs is going to cost more or less on one or more of these dimensions and the scenario that we’re dealing with isn’t that different from what’s happening on gpus right that’s just a small version of what happened at parabon many years ago if you want to get the benefit of parallelization you have to copy the data and the code to run on the device you need to send it from the CPUs to the gpus and then wait for the response and so your choice is to either have a synchronous blocking weight in which case the CPU is not doing something else or to have a more complex modern concurrent system that allows for the asynchronous communication between the CPU and the gpus that it’s using so that it’s able to continue to do other things while you’re getting the benefit of the parallelization of the gpus that’s just another example like I said of what we were facing years ago but if we go even smaller and don’t look at like a computer system with the desktop but we actually look on the modern equivalent of of a motherboard on a chip you know we call it a system on chip or a sock um we have this problem as well because one of the main obstacles to Performance even in the super the top supercomputers of the world is the memory bandwidth right what does it take to get the data to where it needs to be so we can deal with it this is a this is a a picture of the Fujitsu a64fx which was the main chip sorry the main system on on chip um that funded the fugaku or funded power of the fugaku supercomputer which was the top supercomputer in the world a couple years ago at the reiken center in Japan and each one of these system on chips has 32 gigabytes of memory on board so that we can avoid copying data around more often than we need to we can push a lot of data down there and let each of the 48 cores have access to some portion of it um in in those scenarios again very similar to what we were facing at parabon very similar what we see in the GPU space and really a larger example of what herb Sutter famously referred to as the jungle in his blog article Welcome to the Jungle this is our computational ecosystem imagined you know many you know 10 years or so ago but it still applies so on the left you see that we have this range of processors either from complex general purpose CPUs which are computationally flexible but not necessarily optimized for anything in particular all the way up to Asics application specific IC by way of fpgas and gpus that do less and less and less General stuff and more and more specific stuff but faster and with lower power consumption which is another cost that we have to try to deal with uh in in the modern world to try to reduce the footprint the carbon footprint of computation on the other hand we have the memory models that are in place so if we’re talking about multiple CPUs talking to each other then we have to have data co-located memory through some kind of unified memory but if we’re talking about cloud computing then there’s this cost of moving things around again sending messages and we have to deploy code into this so we need technologies and platforms and tools that help us deal with those and that’s exactly what webassembly and Technologies like llvm are doing for us they are allowing us to write code once and to be able to move it anywhere so we remove the Obstacle of platform-specific libraries and yet we don’t have to give up performance in the process we are able to run these in sandboxed environments both inside the browser per se or in node or in Dino or in new modern environments like the Wazi environments of the bytecode alliance so that the code is safe the the ability to have multiple copies of these things running spinning up at different instances gives us resilience and reliability in terms of the systems we build and the cost to move it to where it needs to be can be minimized in the sense that we can cache the the web assembly modules and things like that just like we cache other assets one of the biggest things that it helps us overcome though is also the concept of debugging because it’s hard to debug things in a distributed system if we can write the code locally test it in our own tool infrastructure and then deploy it into our Continuum of places for deploying it then there’s less need for doing the remote debugging now I’m not saying there’s no need but that we can solve some of the costs of supporting distribution in that way so these are additional complexities of deploying code into this Continuum of heterogeneous Computing elements but because we have tools that help lower the cost we can avoid some of the opportunity costs that might otherwise prevent us from being able to successfully do that so with that in mind if we re-look at the kinds of conventional installations that you might have with a high performance database like uh Cilla you may have somebody buy a license and or install it someplace on premises some somewhere and then just write local code to interact with the database this is a fairly conventional installation of software obviously the cost is it’s hard to scale up it’s hard to you know have more instances without having the physical devices there to install the software on and so that’s where cloud computing comes in but before we get there the the kind of work that Cilla is doing with the user-defined functions and Aggregates is just another example of not moving data to where it doesn’t need to be if our query language did not have the ability to be extended arbitrarily then we would have to take larger chunks of data out of the database analyze it in our computational environment in this case the CPU of wherever our code was running and then go back and maybe select more things but when we can push code into the database safely written in arbitrary languages and debugged independently of their presence as ways of extending the query power of the languages that our databases are using then that reduces the cost of trying to pull the data out filter it and push it back so we solve a lot of those costs by being able to extend a powerful platform like Cilla with arbitrary user code written safely securely and inconveniently whatever language they want so the scenario of asking the NASA researcher to re-write his code from forfeit random Java is no longer necessary if you have the need for expanded computational power or you want to reduce the latency cost or some other element that’s going on these tools are now helping us reduce those costs and allowing you to locate the code where it needs to be and the data needs to be specifically in the situation of interacting with with the acila system but beyond that right we could push these things into the cloud we could have cloud-based instances and installations of our data we could push the code into the cloud we could do computational analysis or complex analysis up up there and then make the results available for querying from any of these other locations that’s a fairly Convent uh very very common approach but we can also Imagine leveraging architecture in the form of serverless computing functions or microservices and pull those into the edge someplace not directly on premises and therefore giving up our intellectual property but moving them closer to where the customer is so that we reduce the latency costs and yet still have the benefit of elastic cloud computing and Geographic speedups through the edge Computing environments and then finally one thing that we’re also starting to see it hasn’t happened with ScyllaDB yet but it’s just a matter of time until we can actually take a database and push it into the browser and we’ve seen that with uh you know smaller databases like sqlite and other databases like postgres and um duckdb but eventually something as as powerful and high performance as Silo will make its way there as well and if it makes sense to run the code and the data in the browser at the same time that’s going to be an option as well so really all I’m trying to get you to think of is that the future is is going to be very interesting and our ability to reduce the cost of computation by co-locating data and what we want to do with the data where it makes sense to is going to give us the freedom to reduce the costs and address these things uh in in a much easier way than we’ve been able to do even in the past 10 or 15 years with cloud computing so thank you very much for your time I hope you enjoy the rest of the conference uh please please do reach out if you have any questions but uh enjoy the future and it’s it’s getting very exciting and we’re able to do much cooler things than we’ve been able to do in the past [Applause]

Read More