Building Zeotap’s Privacy Compliant Customer Data Platform (CDP) with ScyllaDB

20 minutes

Register for access to all 30+ on demand sessions.

Enter your email to watch this video and access the slide deck from the ScyllaDB Summit 2022 livestream. You’ll also get access to all available recordings and slides.

In This NoSQL Presentation

Customer Data Platforms, commonly called CDPs, form an integral part of the marketing stack powering Zeotap's Adtech and Martech use-cases. The company offers a privacy-compliant CDP platform, and ScyllaDB is an integral part. Zeotap's CDP demands a mix of OLTP, OLAP, and real-time data ingestion, requiring a highly-performant store.

In this presentation, we share how ScyllaDB is powering our solution and why it's a great fit. We begin by describing our business use case and the challenges we were facing before moving to ScyllaDB. We then cover our technical use-cases and requirements for real-time and batch data ingestions. Then we delve into our data access patterns and describe our data model supporting all use cases simultaneously for ingress/egress. We explain how we are using ScyllaDB Migrator for our migration needs, then describe our multiregional, multi-tenant production setup for onboarding more than 130+ partners. Finally, we finish by sharing some of our learnings, performance benchmarks, and future plans.

Safal Pandita, Senior Software Engineer, Zeotap

Safal is a Senior Software Engineer at Zeotap. He has 4+ years of experience in software engineering and building highly scalable systems, having worked with giants like Flipkart and IBM in the past. He is currently leading ScyllaDB integrations with Zeotap.

Shubham Patil, Lead Software Engineer, Zeotap

Shubham is currently working as a Lead Software Engineer at Zeotap, managing some of its core consumer products and leading engineering for its CDP platform. His core areas of expertise include the design and development of large-scale distributed backend systems. He comes with 6 years of experience in Software Development.

Video Transcript

Hello, everyone. Welcome to ScyllaDB Summit 2022. My name is Shubham Patil, and I’m currently working with Zeotap as a lead software engineer. I also have my colleague, Safal Pandita, with me today. And in this session, we’ll walk you through Zeotap’s journey to build its first-party Customer Data Platform with ScyllaDB for last 1 year. So who is Zeotap?

We started in 2014 as a third-party data aggregator platform and slowly evolved into a next-generation CDP for European marketers We enable brands to better understand their customers, adhering to all privacy requirements. Our mission is to give a 360-degree view to brands’ data and enable them to make better marketing decisions. Our CDP is built completely on GCP. Now, what is Customer Data Platform and why do we need it? Because today’s customer journeys are more complex than ever, the majority of softwares follow this exact path to a wealth of touch points, both online and offline. They might stumble across a complete social visit .. . social media post, which that extend to a blog. Finally, we will find them landing on the e-commerce site and visiting the business and so on. With such unpredictable journeys, how can business keep track of their customers, and how can they deliver personalized experiences across ever-expanding set of touch points? The answer is Customer Data Platform. CDPs are software that aggregate customer data collected from a variety of sources structured into central customer profiles and then allow that data to be accessed by other pieces of softwares. CDPs built these customer profiles by combining from .. . the data from a variety of source across different data types, including first, second and third-party sources. The source could be your CRM or DMP data links or warehouse websites or mobile apps and IoT devices in future. With the golden records created, you can create audience segments, sometimes even based on machine learning, and activate them across other channels such as paid media, SMS marketing, customer service tools and even website personalization. The end result is the ability to not only manage data in a compliant manner but also to be able to efficiently deliver targeted, personalized experiences at scale across the whole customer journey. This is how high-level architecture of CDP looks like. So we collect data from multiple sources, ingest it, process it, store it and distribute it to activate it across multiple channels or sinks. So when we started building our CDP, these were our business goals. We wanted to process both batch data, which were basically a flat file drops from customers, and streaming data, which were real-time website events in a privacy-compliant manner. So batch onboarding of data not only was about ingestion of the data. It was also about exporting data to clients. Real-time, which is also called event orchestration, was basically tracking user journey in real time when a user lands on website, and then provide a customized or personalized experience to the user. And privacy and compliance was basically about providing facilities for user opt-out, consent management and consent mastering. So with these business goals, we finalized our technical goals to be these. We wanted our system to be multiregional, multi-tenant so that we can sort each client’s data into different spaces, which should not so .. . which was the first requirement of any privacy or GDPR compliant system. We wanted sub-second writes and sub-second reads. We wanted point lookups for user profiles. We wanted our system to work for data at every scale. And there were other such requirements here, as mentioned in the table. So we built our first version of CDP with very simplified goals. We wanted to first start onboarding clients’ data and let them activate it across channels. So this is how our architecture looked like for version one implementation. We used BigQuery as our store, where we ingested all the client’s data, and we were able to distribute it across the channels, but there were some problems with this. With the changing business requirements, we also just started getting the requirements of on-the-fly user unification, which was basically .. . On-the-fly user unification for the context is the process of identifying emerging user’s data across proof .. . across different data sources. Also, we wanted to read and delete user profiles with sub-second latency. At the end of building v1, these were our achievements. But as I said with frequently changing business requirements and on-the-fly user unification coming into the picture, we needed to do better. Just .. . So in this version on implementation, our sub-second read rates were not possible. Point lookups were not possible. And we also did not have a very good monitoring stack and also not have a complete control on sizing of the cluster. So we started exploring wide-column databases, and our next POC was with JanusGraph using both [Indistinct] ScyllaDB. With this new tech, we were able to solve most of our problems, but then we also started facing some other issues, specifically issues where we did not have complete control over the data model. Our scans [Indistinct] and expensive. There was no enterprise support for any of the issues that we were facing and no transparency to the transaction failure and back-end errors. So after releasing version two with JanusGraph, this is our tech matrix looked like. So some of the problems, like sub-second reads and deletes were solved, but the other problems like point lookups and “works for data at every scale” and mature monitoring stack was still not there. And then again, there were some other problems that came into the picture, like we did not have a simple SQL-like querying interface, and again, we did not have complete control on the data model. All right? So finally, we decided to remove JanusGraph layer and directly use ScyllaDB as our storage, and with this change, almost all of our tech requirements were met. So what kind of data model we used to gain this performance boost? So finally, we decided to remove the JanusGraph layer and directly use ScyllaDB as our storage. And with this change, almost all of our tech requirements were met. After releasing version three, this is how our tech matrix looked like. All the problems that we faced in version one and version two implementations were solved. We were able to do sub-second reads, writes and deletes. We were able to do sub-second point lookups. It worked for data at every scale, from a few megabytes to a few petabytes. Our every latencies, which were 600 MS in JanusGraph, were reduced to 30 MS in ScyllaDB. We had already mature and transparent monitoring stat in form of ScyllaDB monitoring. It had a Spark integration as well. We had complete control on sizing of cluster with enterprise support, a simple SQL-like querying capabilities and complete control on underlying data models also. So what kind of data model we used to gain this performance boost and what were the issues we faced during design? Handing over to Safal to explain that.Thank you, Shubham. I am Safal Pandita. I’m working as a senior software engineer here at Zeotap. And I’ll be taking you through the rest of this presentation. So let’s talk about what the requirements were from the user store. The first and foremost was ID resolution. Now ID resolution is basically a on-the-fly unification between two or more users. So this operation is both read and write-intensive, so we were looking for a fast lookup store which could give us low latencies for both read and write. We were also looking for a data store which was flexible enough to give us complex data types so that we could use it as a profile/consent/ID store. We also didn’t want to worry basically about the degradation of performance of our queries. The next requirement was a slightly tricky one. We were planning to use ScyllaDB as a linkage store as well. Now prior to this, we were using JanusGraph as our linkage store, and the native capability there to have nodes in edges was pretty convenient for us since we could model our problem pretty easily. But since we were looking to improve our latencies, we decided to migrate that use case away from JanusGraph and on to ScyllaDB. The next requirement was TTL. We were planning to use it in a few different ways. The first was in our profile stores, we were looking for TTLs on different attributes in a map. We were also looking at individual elements of a collection and the more standard use case of having TTL on a row/column level. The diagram on the right explains what I’m talking about. The profile table that you can see there has three attributes: age, city and gender. And we were looking to have TTL on each individual item and having a different one for them. So a different detail for age and a different one for city, et cetera. Similarly, the table on the top, e-mail, the items that you can see there are individual elements of a collection, and we were looking to have a TTL on the entire group, the one that is basically highlighted in the circle. Now, once we have the requirements, it’s said that to have any good data model, you need to first know your queries. So what we have done here is handpicked a few queries patterns from our ID resolution use case and we’ll try to explain them to you to better explain our ID resolution logic. We have left out a few query patterns like updation and deletions, and used the more .. . the ones that will basically explain the crux of our logic. So the first two patterns are basically going to be read patterns, and the next are going to be write patterns. So let’s start with the first one. The first query that we have is a straightforward read from the ID store. A select start from ID store where you provide ID types. For example, Google cookies, ad IDs or em_md5 and ID values, which are basically going to be hash values. Now, we have put an OR clause there just to emphasize that we were looking to provide more than one ID type in values. So this query is going to return to us a list of UCIDs, and we were planning to use this list of UCIDs as the input for the second query pattern. So the second query pattern is basically finding user profiles with UCID, which is basically select start from user profile table with UCID in, and we provide the list of UCIDs that we got from the first query pattern. Now, once we have all the users that we needed, we perform an in-memory operation where we do all the logic that we needed, which is basically the ID resolution logic, and we update all the user profiles in memory. Now after we have that data, what we wanted to do was to dump that data back into ScyllaDB. So the next two queries are all about us sliding back to ScyllaDB. So the third query is basically an insert into the ID store table, providing the ID types, values and UCID values, and the fourth query pattern is basically writing to the user store in ScyllaDB and giving the information like profiles, e-mails, UCIDs, and the query goes insert into user profile table the values. And UCID would be the primary key, and multiple IDs like e-mails, MAID and also linkages, preferences, consents, et cetera. Now that we have our queries, let’s look at the tables. So we went with two tables, ID store and user store. The ID type and value were the primary key and clustering key for the first table. This table was basically a lookup based on ID type and value so that we could fetch UCIDs. And that UCID was a primary key for the next table where we stored the different ID types, consents, profiles and all linkages. Now, what did this data model achieve for us? The first and foremost was that it gave us isolation for each client through keyspaces. We had a different keyspace for each client basically. The next was we wanted to honor privacy and consent policies like GDPR, so we had separate clusters for each region, one for, like, each, Europe, U.S., India and U.K. Also, one use case that we had was that each keyspace could have a different schema based on what the catalog our client provides us. So different clients could have different identifiers, and we wanted to ensure that our schema could handle that. So if you can see in the table in the user store table, we have ID type one, ID type two, ID type N, and these can vary for different clients. Next was we went with the more standard replication factor of three, an incremental compaction strategy, a consistency level of quorum for both read and write. And the last one, which was that now we had a single table that could act as our profile, consent and linkage store. So let’s talk about some of the problems that we faced. Now everything was fine and dandy for a while, but soon we started facing some issues. The first one was batch sizes. Now, we were using batch transactions in ScyllaDB to ensure that we could have atomicity across partitions. Now, everything was fine for a while, but as soon as the data increased, the size of our queries became larger, and as the size of the queries became larger, the batch became bigger, and it started crossing the recommended limit of 100K per batch. The next problem was collection sizes. We had initially anticipated that our collection sizes would basically be around a few kilobytes. But as the data and scale increased, we realized that we were going beyond the recommended size of 1 megabyte per collection. The third problem was basically a consequence of the first two, but the one that we cared about the most. Our latencies started worsening, and in a lot of cases, they started timing out as well. So what are the solutions that we aimed for? We started splitting our queries into multiple batches whenever we thought that our batch was becoming greater than 100K. We decided that a smaller batch would probably make our queries more efficient, and we also put multiple retries with each batch so they succeeded. We started using prepared statements because ScyllaDB is very good at handling them because it caches it, so that improved our performance. We also decided to use TTL on our linkages to keep the total volume under check so that our latency performance could improve a bit. Now, even though we solved those problems and had solutions for them, one problem remained, which was hot rows. Now we were storing linkages in a collection, and initially, we had anticipated that our collections would remain under control and they would go up to a few kilobytes or something, but soon we realized that they were going past the 1-megabyte collection limit that we had, beyond which ScyllaDB does not guarantee its SLAs. Now, what happens if you do that? Bad things only, and our latency started worsening. Now, why does that happen? It’s because collections go through a serialization/deserialization step to save space. Now, even though it saves space for ScyllaDB, it makes it slightly less performing compared to other data types. Usually, that is okay if you are there to the 1MB limit, but there was no way that we could adhere to it. So what did we do? We had to basically go for a slight redesign of our data model, and that’s what we did. We took the linked UCID column right outside of user store, which was basically a collection, and created a completely new table called linked UCIDs. We used the same primary key as before, that is UCID, and created another column called linked UCID and used as a clustering key. So what did that achieve for us? Now, the first and foremost thing was that our queries that were timing out earlier, that is they were going beyond the 10-second limit, started succeeding within our SLAs of 30 milliseconds. Now having a separate linkage store also made our queries .. . Like, it simplified our queries because now we could write TTLs on row level instead of having individual elements of a collection, which was a lot more messier. And the most important thing that we would achieve now was that there was no arbitrary limit on the number of linkages that we had. Now, we know that there is a limit on the size of a partition, but that is much more manageable than having the limit on the size of the collection. So let’s talk about some of the production issues that we had. The first one is primary key migration. Now there is no easy way to migrate your primary key. So if you are in a testing environment, maybe you can just delete your tables, create a new one with the required schema and primary keys and be done with it, but once you’re live in production and have data flowing, there are no easy ways to do this. Thankfully, the wonderful folks at ScyllaDB have a tool called ScyllaDB Migrator, which is a simple Spark job to take your data from your existing schemas and migrate it to a temporary table. Now, more or less, you can be done with it, but we wanted to reuse our original table names, and there is no way to do that in ScyllaDB as well. So we had to resort to SSTable copying from the temporary tables that we had created back to the new schema with the primary key that we wanted. The bottom line here was that you should choose your primary key very wisely. Our next production issue was schema corruption. Now, usually SSTable copying is a very safe operation, but as we found out, that under load, your cluster can sometimes take more than a minute for schema settlement. And during this interval, if you start doing your SSTable copying and the data starts replicating, it can also crash your cluster. Now, unfortunately, we had to find that out the hard way. So what did we do? The first thing that you should do probably is to check that the schema is correctly replicated on all nodes before attempting anything, something like SSTable copying. Now how would you do that? What we recommend is to write your own service around cqlsh, which can basically do that for you. The more .. . If you think that’s overkill, maybe you can just manually SSH and check on all nodes if your .. . like, if your schema is currently implemented or not. For us, the ScyllaDB team resolved our issue by restoring our snapshots that we take periodically, and we needed the migration for the affected schemas. The lesson here was something that we all know, but probably it’s good for a reminder, which is we should always have our data backed up at all times. Now let’s talk about what our production setup looks like. We have four clusters, each one in Europe, U.K., India and the U.S. Each cluster has six nodes. The instances are n2-highmem-64 GB nodes. We are using ScyllaDB enterprise version 2021.1.5. We should be upgrading that. We manage 130-plus keyspaces. We have a peak QPS of around 60K. We have a P99 for 30 ms for read and P99 of 10 ms for write. We have ingested around 5 to 6 terabytes of data as of now, and average keyspace is around 50 to 100 gigabytes, but probably our peak keyspace is around 110, 120 gigabytes. So what are our future plans? We’re exploring the idea of having a microservice to handle our schema corruption and update issues so that we don’t replicate or basically reproduce the production issues that we already had. We’re also exploring lightweight transactions for consistency guarantees. Now we also have something called a raw person identification information that are store in encrypted form in our ScyllaDB database. We are looking to have data rotation for them without blocking our real-time writes. So we are at the end, and thank you very much. These are our social media handles and personal e-mail. You can reach out to us there for any inquiries. And have a great ScyllaDB Summit and have a nice day.

Read More