CI/CD for Data – Building Data Development Environment with lakeFS

Vinodhini S Duraisamy29 minutes

A property of data pipelines one might observe is that they rarely stay still. Instead, there are near-constant updates to some aspect of the infrastructure they run on, or in the logic they use to transform data, to give two examples.

To efficiently apply the necessary changes to a pipeline requires running it parallel to production to test the effect of a change. Most data engineers would agree that the best way to do this is far from a solved problem.

Most attempts at doing this fall on one of two extremes--either executed with overly simplified hardcoded sample data that let through errors that will appear with production data. Or, executed in a maintenance-intensive dev environment that requires duplicating the production data, which also ends up massively increasing the risk of a breach or data privacy violation.

The open source project lakeFS lets one find the much-needed middle ground for testing data pipelines by making it possible to instantly clone a data environment through a zero-copy cloning operation. This enables a safe and automated development environment for data pipelines that avoids the pitfalls of copying or mocking datasets, and using production pipelines to test.

In this session, you will learn how to use lakeFS to quickly set up a development environment and use it to develop/test data products without risking production data.

Share this

Video Slides

Video Transcript

I’m so excited to be speaking at the ScyllaDB Summit today. We are today going to talk about CI/CD for data meaning how can you test your pipelines the right way by building development and test data environments with lakefs.

and I am V now the developer Advocate at lakefest as you can see here I started off as a software engineer and then on to data and ml engineering and currently I work as a developer advocate for lakefs lakefest is an open source data versioning engine that can offer gift-like interfaces for your data Lake and let’s Jump Right In now as data Engineers this is exactly how our life looks like right you expect our pipelines to run in a predictable way the way we programmed but this that spark job is not going to run exactly the way you code it and this is what we end up with one small schema change there goes your own call weekend trying to fix something in production that is on fire right what can go wrong with our data pipelines on a day-to-day basis as a data engineer I’ve been guilty of you know doing some of these things I’m sure you have few of them all of them or even more than what is in this list spark or EMR upgrades of course any infrastructure upgrade that involves our data pipelines are bound to go wrong and remember the tiny schema change that the business wanted you to update or the Upstream forgot to you know mention because there was no data contract in place and also the kpi definition that got updated by the business and now you have to rerun or backfill all the data thanks to running the pipelines all over again troubleshooting the failed spark jobs and non-ident pipelines so you can just rerun them anytime you hit failure in your airflow tags and now the most absolute absolute favorite of all is the Legacy dags that nobody in your team wanted to touch but now you need to make a minor upgrade to some logic in the dag and you never know what goes hey fire in which part of the tag with you know hundreds and hundreds of tasks and whatnot they’re like literally like the list goes on and on and off what are the things that can go wrong with our data pipelines right but why is it so complicated why are dealing with data pipelines that hard it is because the data life cycle is a bit different from the software application development life cycle right when you look at the data life cycle you have multiple data sources that are writing to the central repository that is data Lake data lake house whatever it may be depending on your environment to take now all these ETL pipelines you know multi-step complex ETL pipelines that are running on this raw data to make them into a more you know consumable format curated ready for consumption by the downstream now on the downstream we have multiple consumers sometimes even concurrently trying to read the data of your data Lake and not even not only that right for a typical software development life cycle start with the data sources and consumption it is done but with data life cycle thanks to the Privacy regulations and whatnot now we also need to take care of all the way till the data deletion takes place right so Monitor and debug and all the way to data deletion is this complex data life cycle because of this is why we hit a lot of these challenges in working with the data pipelines on a daily basis now because of all these changes we need to come up with an effective testing strategy to test our data pipelines but then though do all of us test our data Pipelines not so much so here you can see the poll that I posted on Reddit data engineering from actual data Engineers working in different companies and it is fairly understandable because as data Engineers we are running against hard deadlines and some of that means we need to deliver the data or we need to serve the data by the end of the week rather than focusing so much on building best practices to ensure the data quality or building effective testing strategy for idea Pipelines so we need to do the unit testing integration tests and end-to-end tests and whatnot end-to-end tests because the data pipelines are not isolated they are connected to multiple data sources and multiple data consumers on both ends so end-to-end testing is absolutely needed to ensure the data reliability in quality now all right you agree that you know we need to do the e2e test now what how are you going to test that some of us you know copy production data outside of the production environment into a test environment some of us try to create more data because you know it’s easy to kind of create like create a template schema and then go around creating some data to kind of test test your pipeline before you deploy it in production and some of us try to sample the production data trying to optimize for variety volume and velocity and replicate the production data into a test environment but then none of these strategies are effective because in mock data it will not represent the real life production data to test your pipelines but when you copy the production data outside you are going to run into Data privacy violations thanks to high pass CCPA gdpr and whatnot right so now how exactly are we going to deal with building an effective testing strategy for that is exactly why we need CI CD for data meaning we will apply the engineering best practices to data engineering and build CI CD for data product instead of a software product and let’s go ahead and see how we can exactly implement this for your data Lake right what would be the cicd best practices in the application development World there are three different phases right build test and deploy so let’s talk let’s take these faces and see how data best practices or CI CD for data best practices can be in each of these steps in the build phase of course we would place all the data assets under Version Control instead of source code Version Control we have data Version Control here and in the test phase of course we want continuous integration tests to be run automatically before promoting the product to the next stage and to run these continuous integration tests we also need isolated environments the data environments to be spin up in an automated fashion in the software in the software world we would be spinning up a Docker container or a temporary sandbox environment to run our tests how are we going to do that in the data world let’s we will see that later in the presentation and in the deployed phase two we want to run a suit of tests and only if those tests are successful we want to go ahead and deploy the product be It software or the data product right and then even after deployment even after all these checks and tests that you might run you would still end up with some data errors in production because the data is constantly changing at the pace that you could not test them in now for such scenarios you also need to have automatic rolled backs in place just like a git reward if I can say so now how are we going to implement these best practices in these three different phases of deployment into the data world right let’s take the first one Version Control your data in the build phase this is exactly where the lakefs comes in right what lakefs offers is a git like interfacing capabilities for your data Lake and it offers Version Control as you can see here what I have in my data lake is a Delta table which has a Delta log and a couple of parquet files partitioned with date now we do have table level abstractions thanks to databases so we have two tables table one and table two and we have our data Lake here now what lakefest can do here is it can offer Branch level abstractions on top of the tables meaning just like in git you can create data branches say main V1 V2 and whatnot and the data in all of these branches are isolated from each other and it can give you those git like interfaces like Branch commit and whatnot to work with this data and for example right here you can see that this particular Delta table is sitting on top of any of these data reposed to be S3 Azure blob or GCS right and let’s kind of Deep dive in and understand suppose if you have your data Lake today on an S3 how can lakefest work with it

as you can see sits on top of these object stores again it could be S3 Azure blob or even like minio on-prem Object Store for example lakefest sits on top of these object stores and offers git like interface it could be the lake CTL command line utility or the lake FS UI or several language API clients they work with python they work with Java spark lakefest has gotten API clients for all these languages so you can easily work with lakefest and suppose currently your data consumers or the data applications like kafka’s Park airflow Athena Hadoop Chino and whatnot the existing host of applications that you have today can either access the object store directly or can access them through lakefest with data versioning enabled and with lakefs the data versioning is not limited to just table level it can be enabled at the repository level meaning an entire S3 bucket can be versioned and kept track of and now as a data engineer you also want to make sure that the introduction of a new tool or a software does not in you know intervene so much or it does not disrupt your existing code base right if you look at the right side here just by installing and you know having lakefs as part of your storage layer all you need to do is just simply add the name of the branch here the name of the Branch’s main so just add the main branch to the S3 prefix and you’re good to go with minimum level of intervention into your existing code base you will be able to achieve data versioning on your you know data lake with the help of lakefs but then how does lakefest work is it going to create a copy of the data every time I’m going to create a new Branch not at all what lakefest does is it does zero copy cloning meaning it only copies pointers that are pointing to the objects underneath and as you can see here we have two different commands that point to the same pointers meaning they share the objects that are underneath and anytime a new object is created it is gonna write it is going to do copy on write meaning it will write the new data and then create pointers for that new data and you know add it to that comment just by doing this creating a branch is a few millisecond operation imagine you have your production data that has I don’t know like a few terabytes or even a few petabytes or hundreds of petabytes of data and you can create a new data environment to run your tests so copy all of the hundreds of petabytes of data into a new test in my environment it is going to take a lot of time and also storage is not free just by using lakefest you can do that in a few milliseconds and you’re not utilizing your storage costs at all you’re not increasing your storage costs at all zero copy cloning of your existing production data into a test environment so you have all the data that is in your production for testing your data Pipelines this is interesting right now for the next phase we only talked about how do we enable versioning for your data like the next is though is about how do we run integration tests on an automated you know on-demand test data environment now Lake FS also offers this feature called lakefest Hooks and using that you can have the CI test run and as you can see here the hooks can be used in several different ways one ways is that one way is that on the ingestion side of things before you ingest any data into production you always want to run a suit of tests and only if those tests are successful you want to ingest the newly arriving data into production you can leverage lakefest hook for this and now on the experimentation side let’s see there is an error in your production and you want to troubleshoot what is going wrong instead of directly working with production you could simply create a branch out of production and then look at what’s going on and suppose if you have a hot fix that you want to test on production you can do that as well and if the fix is verified and approved then in the end you can merge it back to production overwriting whatever was in the production meaning overwriting the error with the hotfix that you have tested in your test Branch so there are several different ways in which you can leverage lakefest hooks to implement the CI part of the pipeline for your data Lake now once we have all these tests run right suppose despite running all these tests and checks and the quality validations what happens if there is an error in production the first thing that you want to do is you want to be able to revert to a previous state that was consistent that was error-free because at any point in time your consumers are still consuming data from your production so the first thing you want to do is revert to a consistent state of data that is available for your consumers you can do that as well with lakefest report now let’s go over a quick demo of how you can use lakefs and build a development or test environment run data exploration and data cleaning in the test environment and once the data cleaning is done how to write the clean data back to the main or back to the production branch I have my leak if it’s running locally on my Docker container as you can see here I have lakefs on the port 8000 and I also have minio locally running at 9001 and the jupyter notebook at 888 and this is a sample jupyter notebook that the lakefest team has put together for you to explore and play around with the creation of development or test data environments with clickfs now first things first let’s go over and explore the like FS UI you see an example repository here is just like a git repository we have a lakefest data repository and once you go in there you see there is no data currently and it has only one branch which is main no uncommitted changes the only commit was that the report was created as we saw the main branch no tags nothing to compare and so on now let’s go to the repository and let’s go ahead and create a new repository and add new data and see what it looks like and I’m gonna be doing that from my Jupiter notebook the API using the python lakefs API and let’s go ahead and install these first all right this is just the installation and the configuration step just to connect lakefest to the Jupiter notebook so we can run everything programmatically

thank you and what I’m gonna do here is I’m going to create a new repository called Netflix movie data right

before that I also need a storage layer to sit underneath so I am going to create a new minario bucket and I’m going to call this Netflix movie triple bucket

awesome so we have this bucket and now let’s take this bucket and let us create a new Lake FS repository on top of that go create a repository blank Repository and let us call it Netflix movie Repo and the SD bucket the minio bucket that I’m gonna use here is Netflix movie rapper which we just created I’m using minayo here and I am using S3 as a prefix in here because main iOS suppose S3 like API and Lake FS can work with any object store that supports a S3 like API to work with so let’s just keep the default branches Main and let’s create a repository so we have the Netflix movie Repo so let’s go ahead and create these branches as well and I’m creating two branches here one is the Ingress landing area branch which will contain the data that is coming in from our data sources and the second is the staging area Branch where you know we will be doing all the preprocessing cleaning and you know of the data before actually pushing it into production right so currently let us try to list the branches and it should only have the main branch because we only had main branch here as well we didn’t create any new branches now let’s go ahead and create the new branch which is the ingest branch and I also created the staging branch and once I have both the branches created let’s see awesome so I have in Just landing area Main and staging branches here for me let’s go ahead and check that in the UI and if I go to branches see I have three branches and one interesting thing that I want you to notice here is that branches have the same committee because I created these two branches from Main and since all of them have the same data underneath since they’ve not changed anything yet so far they all have the same pointers to point to meaning they don’t copy the data they just reference the same objects under underneath in the object store now I have my movies.csv locally and I’m going to upload that to my ingest or Ingress landing area branch and let’s do that and see what changes have happened here we’ve just added a new file which is movies.csv under this path right so let’s go here in the Ingress landing area awesome so we have the data that has landed now the next step is let’s commit this data in the Ingress and once you come at you can see here who is the committer the creation date and all the commit metadata that goes in and just like the git commit message you can also add any comment message and you also have an option to add the metadata as well so you do you have the specific diagram that has committed this data you have that option to add the diagram here as well or any other metadata that you want to add it to each commit you will be able to do which will help you with you know tracking the data very easily and let’s go ahead and see the commit again awesome so we created this repository and then we put the Netflix data into the landing area and what next is let us copy this data from our interest area and to the cleaning or the staging area and we’ll just do a simple pre-processing see what the data looks like a bit of data exploration and so on

so we have copied the data the staging area and if you go to staging we should be able to see our data in a bit so this comment is okay so reading the movie the data file with spark is taking a bit of time awesome so this comment is successful and we should be able to see this here now okay so we have the data now in the staging area for us all good now let’s go ahead and do some exploration and see what the data looks like right and it’s just the movies data and it is going to have a few columns like just 8 000 records for us and you also get a glimpse of what the data looks like and so on here it looks good it’s fine let’s go ahead and see if it has any null values or anything that we need to clean up right and let’s go ahead directly to the null checks what is going on yeah so it does have a couple of null values in some of these columns so let’s just simply drop these you might have a different you know way or a strategy to deal with these null values at work but here for the sake of the demo I’m just gonna drop them all and now the data looks clean nonal values everything sorted and what I’m going to do here is take that data partition it by a different column by a country column here and I’m going to save them as parquet files in my staging environment so let’s go ahead and do that what I’m doing here is when I go to the staging area you see the raw data which we copied and again in the staging area we also have analytics we are saving the datas in the parquet format by a different partition as well and it’s this part jobs running it’s taking a while we don’t have success here yet so there are different countries we’re partitioning them by just it’s gonna run for a second

is it done okay so this is successful we have that as well and we have also we need to commit this as well right because at any time you do these changes and go here and see you should be able to see the uncommitted changes just so I can get so the uncommitted changes are the analytics files movies you know by different partition and you can also have a look at the you know the change log of each of these files that should let you see what the changes are before you actually commit them now let’s go ahead and come at these as well

awesome so this comment is successful and when you go here and see the comments we had copied the data and now we have loaded the partitioned files in the parquet format to the staging area all good I think I wanted my files in the parquet format I wanted my files to be partitioned by country also no null values all clean data now I want to push this data into my production right all I’m doing is merging my staging Branch into the production Branch so the source would be the staging Branch destination is my broad branch currently I want to show you before we merge what is in my prod the prod is essentially the main branch currently we have nothing because we kept it clean we only wanted the clean rightly partitioned parquet files in May so let’s go ahead and merge this and see what happens awesome so no conflicts everything is good let’s go and see what happens awesome so we do have the files here from the staging area and if you go to the comments you should be able to see all the commit history the changes that happened or the data lineage of the repository being created copying the data from ingest to the staging area doing all the changes and then eventually deploying them in production right and there it goes so at the end of it testing like the staging Branch was just a temporary sandbox environment we spun up for you know testing our changes locally or cleaning up the data locally so the end of it it’s up to you you can either go and delete a specific Branch because you think you’re done with the sandbox environment and you don’t need it anymore or you can go ahead and keep it for running on other you know tests as such with the data but here I would like to delete it and just move forward and that is it so we’re done with the demo now that you know you can create these development or test data environments with lakefest to test your pipelines and to clean up your data let’s go back to our presentation and see what else can like if it’s do for you all right so now we’re done with the demo and we saw how you can use lakefest to build your development or test data environments now off to the next phase which is the deploy phase right it is fair that for the build phase you use Lake FS for building data Version Control and in the continuous integration test phrase you use lakefest to run a suit of tests before deploying your data into production and in the deploy phase though you want to highlight how lakefest can help you achieve the data quality checks and also do automatic rollbacks of production issues or production errors now like I said lakefest has this feature called lakefest hooks which are conceptually very similar to git hooks meaning you can have these hooks enabled on a specific I would say action git action for example if you’re creating a git Branch or committing it or merging it or for any of these actions you can have a lakefest hook say a pre-commit hook so anytime before you commit it it is going to check or run a list of checks which you can configure and only if those set of checks are successful will like FS let you do the commit for example in our case right when we merged our data into production we can have a set of Suits run for us or you could even have a great expectations or so die or run a suit of tests and only if those tests are successful we would want to deploy our data into production and now once you have that like I said there may be instances despite all of these best practices in players you will still end up in production errors where lakefest offers the revert option as well so in one command you will be able to revert from your error state of production to a very consistent the right state of production data so your consumers can consume the data while you’re troubleshooting what went wrong in production and you can also have like a post merge event or post commit event configured for like FS hook to do this programmatically as well right and so today when you have lakefest in your data Lake what can you do with each of these faces is first of all in the build in the build phase you will be able to you know build the data lakes with different versions enabled for you meaning you can time travel between different versions of your entire data Lake not just you know at the table level not just at the object level but at the entire repository level you can do time travel and when it comes to testing you can test the pipeline on an isolator testing Branch without risking your production data at all and in the deploy phase you can revert back from potential production errors again in just one line command by using like FS and here we have a bunch of companies who are using Lake Affairs and who are contributors to lakefest and who are lakefest champions as well and if you are like like I said like FS is an open source project that gives you data versioning capabilities so if you want to try out a POC or just think about you know how lakefest can help your data team with different use cases feel free feel free to reach out to us at Lake fsio for the documentation or for any specific questions regarding use cases and whatnot and we also have a thriving and active community of slack users lakefest users on slack who discuss the best you know data architecture and the best practices in data engineering and whatnot so feel free to join us if you have any questions as well thank you so much that’s all I have for today [Applause] foreign

Read More