Scylla Summit speaker interview: Henrik Johansson
We have a stellar roster of NoSQL experts lined up for Scylla Summit. Among the speakers is Henrik Johansson, who will cover “Using ScyllaDB for a Microservice Based Pipeline in Go.” Henrik has, after a background in Physics, worked as a software developer with many different things from shuffling financial data to language processing. He Is currently working as Senior Software Developer at Eniro, working with the backend systems for data refinement, enrichment and analysis. Technologies involve everything from the Spring stack to Hadoop, to Apache Flink, as well as newer things such as Go and Docker.
Can you tell us a little about the data model for your application using Scylla? What’s your primary key?
The primary key here is a globally unique identifier that we have across our entities in the entire company, basically. It’s supposed to be unique, and it’s been proven to be fairly unique, because all the systems use them. Sometimes a system does something wrong, but ideally it’s supposed to be globally unique. That’s the main partition key that we use in Scylla.
In Redis we used it in combination with another, app-specific, id. We just concatenated them. The problem with Redis was that over time, when the globally unique ids were connected with an app-specific id that could be shared, iterating and listing took so long using the SCAN function. That was not feasible. Other than that, Redis was really fast. I don’t think the instigators of this app knew there would be so many IDs.
Then we started making heavy use of MongoDB, and we didn’t have good experiences with that. For just storing stuff it works, but for throughput we didn’t want to go down that track again, and fill our apps with caches to get what we need in terms of speed.
We could have done the same thing in Postgres that we did with Scylla now, but we would hit the same thing where we need to cache to speed things up. When Scylla 1.0 came out, I said, “try it, see how it works.” Because it’s almost SQL, everyone of us knows SQL, and we could quickly get up to speed on that. After some initial problems installing, everything just ran really fast. Both the mapping things and the scan functionality worked really well.
When you rewrote your application to use Scylla, did you find that you were able to remove anything?
We had a number of small caches. No in-memory caches are there now. All the problems concerning keeping these caches in sync, maintaining that kind of thing in the app, is really messy. Scylla developers spend so much time implementing caches and Gossip. It’s so much work and we would really like to avoid it.
We had a lot of custom serialization code that we removed. We didn’t want to model all the entities in Go. What we did is store everything as bytes, with timestamps and a little metadata, in Scylla. It works fine. It’s very easy to store as bytes, with a column for what type this is.
So your Scylla table ends up being an unique id, an object type, and a blob, three columns?
More or less. A couple of timestamps as well. It’s very simple. If you come from a traditional relational environment this may seem brutal because you don’t have your model really in the database. But we have models with hundreds of attributes, and there is no way we could keep that in sync in a structured way in a table. So what we did is store it as a blob, make it as fast as possible, and let the application worry about serialization. Adding and removing fields is pretty easy.
When you ended up removing in-application caching functionality, can you say how many lines of code that was?
The cache was pretty well done, with a wrapper object that delegated to the actual storage. Maybe two hundred lines of code, not bad. But what really makes me sleep better is the fact that we don’t need the cache. We have other high-intensity systems that depend on MongoDB, for example, and they have minimum 20 gigabyte heaps, and the boring thing about such systems is you’re just caching because your database is too slow. There is no business value whatsoever. There are only headaches, because these things have to be carefully prodded and cared for. You have to continually do heap analysis and profiling to see: Are we evicting things too fast? Is the cache too small?
The number of lines of code isn’t that much, but the headaches at run time and the ops aspects are just so nice not to bother with.
It seems like it’s hard computer science to handle caches and then the ops problem is difficult, too?
Yes, cache invalidation is one of the classic problems. I don’t mind the effort if it brings some sort of value but if it’s just there because something else isn’t up to the task, it’s frustrating.
The low maintenance aspect is what drove me most because having a sharded system where you just point nodes at each other and it just works is super nice. The Scylla nodes, we have put so much stuff in there but there is almost no disk used. I was hoping it wasn’t just kept in memory, but I checked, it’s persisted. The ops thing, to know that you have a sufficiently fast system without much effort, that’s good stuff.
See you at Scylla summit
Check out the whole agenda on our web site to learn about the rest of the talks—including technical talks from the Scylla team, the Scylla road map, and a hands-on workshop where you’ll learn how to get the most out of your Scylla cluster.
Going to Cassandra Summit? Immerse yourself in another day of NoSQL. Scylla Summit takes place the day before Cassandra Summit begins at the Hilton San Jose, adjacent to the San Jose Convention Center. Lunch and refreshments are provided.