ScyllaDB V brings new performance, resilience, and ecosystem advantages that resolve longstanding challenges of legacy NoSQL databases.ScyllaDB V is Here
Vinodhini is a developer advocate at Treeverse, the company behind lakeFS. She is a data practitioner with 6+ years of experience working for companies like Apple, Nike and NetApp, and has a blend of data engineering, machine learning and business analytics expertise. Vinodhini is an avid public speaker and an ardent toastmasters member.
I’m so excited to be speaking at the ScyllaDB Summit today. We are today going to talk about CI/CD for data meaning how can you test your pipelines the right way by building development and test data environments with lakefs.
as you can see sits on top of these object stores again it could be S3 Azure blob or even like minio on-prem Object Store for example lakefest sits on top of these object stores and offers git like interface it could be the lake CTL command line utility or the lake FS UI or several language API clients they work with python they work with Java spark lakefest has gotten API clients for all these languages so you can easily work with lakefest and suppose currently your data consumers or the data applications like kafka’s Park airflow Athena Hadoop Chino and whatnot the existing host of applications that you have today can either access the object store directly or can access them through lakefest with data versioning enabled and with lakefs the data versioning is not limited to just table level it can be enabled at the repository level meaning an entire S3 bucket can be versioned and kept track of and now as a data engineer you also want to make sure that the introduction of a new tool or a software does not in you know intervene so much or it does not disrupt your existing code base right if you look at the right side here just by installing and you know having lakefs as part of your storage layer all you need to do is just simply add the name of the branch here the name of the Branch’s main so just add the main branch to the S3 prefix and you’re good to go with minimum level of intervention into your existing code base you will be able to achieve data versioning on your you know data lake with the help of lakefs but then how does lakefest work is it going to create a copy of the data every time I’m going to create a new Branch not at all what lakefest does is it does zero copy cloning meaning it only copies pointers that are pointing to the objects underneath and as you can see here we have two different commands that point to the same pointers meaning they share the objects that are underneath and anytime a new object is created it is gonna write it is going to do copy on write meaning it will write the new data and then create pointers for that new data and you know add it to that comment just by doing this creating a branch is a few millisecond operation imagine you have your production data that has I don’t know like a few terabytes or even a few petabytes or hundreds of petabytes of data and you can create a new data environment to run your tests so copy all of the hundreds of petabytes of data into a new test in my environment it is going to take a lot of time and also storage is not free just by using lakefest you can do that in a few milliseconds and you’re not utilizing your storage costs at all you’re not increasing your storage costs at all zero copy cloning of your existing production data into a test environment so you have all the data that is in your production for testing your data Pipelines this is interesting right now for the next phase we only talked about how do we enable versioning for your data like the next is though is about how do we run integration tests on an automated you know on-demand test data environment now Lake FS also offers this feature called lakefest Hooks and using that you can have the CI test run and as you can see here the hooks can be used in several different ways one ways is that one way is that on the ingestion side of things before you ingest any data into production you always want to run a suit of tests and only if those tests are successful you want to ingest the newly arriving data into production you can leverage lakefest hook for this and now on the experimentation side let’s see there is an error in your production and you want to troubleshoot what is going wrong instead of directly working with production you could simply create a branch out of production and then look at what’s going on and suppose if you have a hot fix that you want to test on production you can do that as well and if the fix is verified and approved then in the end you can merge it back to production overwriting whatever was in the production meaning overwriting the error with the hotfix that you have tested in your test Branch so there are several different ways in which you can leverage lakefest hooks to implement the CI part of the pipeline for your data Lake now once we have all these tests run right suppose despite running all these tests and checks and the quality validations what happens if there is an error in production the first thing that you want to do is you want to be able to revert to a previous state that was consistent that was error-free because at any point in time your consumers are still consuming data from your production so the first thing you want to do is revert to a consistent state of data that is available for your consumers you can do that as well with lakefest report now let’s go over a quick demo of how you can use lakefs and build a development or test environment run data exploration and data cleaning in the test environment and once the data cleaning is done how to write the clean data back to the main or back to the production branch I have my leak if it’s running locally on my Docker container as you can see here I have lakefs on the port 8000 and I also have minio locally running at 9001 and the jupyter notebook at 888 and this is a sample jupyter notebook that the lakefest team has put together for you to explore and play around with the creation of development or test data environments with clickfs now first things first let’s go over and explore the like FS UI you see an example repository here is just like a git repository we have a lakefest data repository and once you go in there you see there is no data currently and it has only one branch which is main no uncommitted changes the only commit was that the report was created as we saw the main branch no tags nothing to compare and so on now let’s go to the repository and let’s go ahead and create a new repository and add new data and see what it looks like and I’m gonna be doing that from my Jupiter notebook the API using the python lakefs API and let’s go ahead and install these first all right this is just the installation and the configuration step just to connect lakefest to the Jupiter notebook so we can run everything programmatically
thank you and what I’m gonna do here is I’m going to create a new repository called Netflix movie data right
before that I also need a storage layer to sit underneath so I am going to create a new minario bucket and I’m going to call this Netflix movie triple bucket
awesome so we have this bucket and now let’s take this bucket and let us create a new Lake FS repository on top of that go create a repository blank Repository and let us call it Netflix movie Repo and the SD bucket the minio bucket that I’m gonna use here is Netflix movie rapper which we just created I’m using minayo here and I am using S3 as a prefix in here because main iOS suppose S3 like API and Lake FS can work with any object store that supports a S3 like API to work with so let’s just keep the default branches Main and let’s create a repository so we have the Netflix movie Repo so let’s go ahead and create these branches as well and I’m creating two branches here one is the Ingress landing area branch which will contain the data that is coming in from our data sources and the second is the staging area Branch where you know we will be doing all the preprocessing cleaning and you know of the data before actually pushing it into production right so currently let us try to list the branches and it should only have the main branch because we only had main branch here as well we didn’t create any new branches now let’s go ahead and create the new branch which is the ingest branch and I also created the staging branch and once I have both the branches created let’s see awesome so I have in Just landing area Main and staging branches here for me let’s go ahead and check that in the UI and if I go to branches see I have three branches and one interesting thing that I want you to notice here is that branches have the same committee because I created these two branches from Main and since all of them have the same data underneath since they’ve not changed anything yet so far they all have the same pointers to point to meaning they don’t copy the data they just reference the same objects under underneath in the object store now I have my movies.csv locally and I’m going to upload that to my ingest or Ingress landing area branch and let’s do that and see what changes have happened here we’ve just added a new file which is movies.csv under this path right so let’s go here in the Ingress landing area awesome so we have the data that has landed now the next step is let’s commit this data in the Ingress and once you come at you can see here who is the committer the creation date and all the commit metadata that goes in and just like the git commit message you can also add any comment message and you also have an option to add the metadata as well so you do you have the specific diagram that has committed this data you have that option to add the diagram here as well or any other metadata that you want to add it to each commit you will be able to do which will help you with you know tracking the data very easily and let’s go ahead and see the commit again awesome so we created this repository and then we put the Netflix data into the landing area and what next is let us copy this data from our interest area and to the cleaning or the staging area and we’ll just do a simple pre-processing see what the data looks like a bit of data exploration and so on
so we have copied the data the staging area and if you go to staging we should be able to see our data in a bit so this comment is okay so reading the movie the data file with spark is taking a bit of time awesome so this comment is successful and we should be able to see this here now okay so we have the data now in the staging area for us all good now let’s go ahead and do some exploration and see what the data looks like right and it’s just the movies data and it is going to have a few columns like just 8 000 records for us and you also get a glimpse of what the data looks like and so on here it looks good it’s fine let’s go ahead and see if it has any null values or anything that we need to clean up right and let’s go ahead directly to the null checks what is going on yeah so it does have a couple of null values in some of these columns so let’s just simply drop these you might have a different you know way or a strategy to deal with these null values at work but here for the sake of the demo I’m just gonna drop them all and now the data looks clean nonal values everything sorted and what I’m going to do here is take that data partition it by a different column by a country column here and I’m going to save them as parquet files in my staging environment so let’s go ahead and do that what I’m doing here is when I go to the staging area you see the raw data which we copied and again in the staging area we also have analytics we are saving the datas in the parquet format by a different partition as well and it’s this part jobs running it’s taking a while we don’t have success here yet so there are different countries we’re partitioning them by just it’s gonna run for a second
is it done okay so this is successful we have that as well and we have also we need to commit this as well right because at any time you do these changes and go here and see you should be able to see the uncommitted changes just so I can get so the uncommitted changes are the analytics files movies you know by different partition and you can also have a look at the you know the change log of each of these files that should let you see what the changes are before you actually commit them now let’s go ahead and come at these as well
awesome so this comment is successful and when you go here and see the comments we had copied the data and now we have loaded the partitioned files in the parquet format to the staging area all good I think I wanted my files in the parquet format I wanted my files to be partitioned by country also no null values all clean data now I want to push this data into my production right all I’m doing is merging my staging Branch into the production Branch so the source would be the staging Branch destination is my broad branch currently I want to show you before we merge what is in my prod the prod is essentially the main branch currently we have nothing because we kept it clean we only wanted the clean rightly partitioned parquet files in May so let’s go ahead and merge this and see what happens awesome so no conflicts everything is good let’s go and see what happens awesome so we do have the files here from the staging area and if you go to the comments you should be able to see all the commit history the changes that happened or the data lineage of the repository being created copying the data from ingest to the staging area doing all the changes and then eventually deploying them in production right and there it goes so at the end of it testing like the staging Branch was just a temporary sandbox environment we spun up for you know testing our changes locally or cleaning up the data locally so the end of it it’s up to you you can either go and delete a specific Branch because you think you’re done with the sandbox environment and you don’t need it anymore or you can go ahead and keep it for running on other you know tests as such with the data but here I would like to delete it and just move forward and that is it so we’re done with the demo now that you know you can create these development or test data environments with lakefest to test your pipelines and to clean up your data let’s go back to our presentation and see what else can like if it’s do for you all right so now we’re done with the demo and we saw how you can use lakefest to build your development or test data environments now off to the next phase which is the deploy phase right it is fair that for the build phase you use Lake FS for building data Version Control and in the continuous integration test phrase you use lakefest to run a suit of tests before deploying your data into production and in the deploy phase though you want to highlight how lakefest can help you achieve the data quality checks and also do automatic rollbacks of production issues or production errors now like I said lakefest has this feature called lakefest hooks which are conceptually very similar to git hooks meaning you can have these hooks enabled on a specific I would say action git action for example if you’re creating a git Branch or committing it or merging it or for any of these actions you can have a lakefest hook say a pre-commit hook so anytime before you commit it it is going to check or run a list of checks which you can configure and only if those set of checks are successful will like FS let you do the commit for example in our case right when we merged our data into production we can have a set of Suits run for us or you could even have a great expectations or so die or run a suit of tests and only if those tests are successful we would want to deploy our data into production and now once you have that like I said there may be instances despite all of these best practices in players you will still end up in production errors where lakefest offers the revert option as well so in one command you will be able to revert from your error state of production to a very consistent the right state of production data so your consumers can consume the data while you’re troubleshooting what went wrong in production and you can also have like a post merge event or post commit event configured for like FS hook to do this programmatically as well right and so today when you have lakefest in your data Lake what can you do with each of these faces is first of all in the build in the build phase you will be able to you know build the data lakes with different versions enabled for you meaning you can time travel between different versions of your entire data Lake not just you know at the table level not just at the object level but at the entire repository level you can do time travel and when it comes to testing you can test the pipeline on an isolator testing Branch without risking your production data at all and in the deploy phase you can revert back from potential production errors again in just one line command by using like FS and here we have a bunch of companies who are using Lake Affairs and who are contributors to lakefest and who are lakefest champions as well and if you are like like I said like FS is an open source project that gives you data versioning capabilities so if you want to try out a POC or just think about you know how lakefest can help your data team with different use cases feel free feel free to reach out to us at Lake fsio for the documentation or for any specific questions regarding use cases and whatnot and we also have a thriving and active community of slack users lakefest users on slack who discuss the best you know data architecture and the best practices in data engineering and whatnot so feel free to join us if you have any questions as well thank you so much that’s all I have for today [Applause] foreign