Aug28

Scylla Summit Preview: Numberly’s Alexys Jacob

Subscribe to Our Blog

As we prepare for Scylla Summit 2019, this is the first in a series of blogs highlighting this year’s featured presenters. If you’re not yet registered for Scylla Summit, please take the time to register now!

Alexys Jacob, known to the developer community across social media, Github and Slack as @ultrabug, is the CTO of Numberly. A self-avowed Pythonista and staunch open source proponent, Alexys has long explored and expanded the frontiers of Big Data architecture and production systems. He’s also well-known to the Scylla community, and was recognized with the award for “Most Valuable Contributions to Scylla Open Source” at last year’s Scylla Summit.

Readers may already be familiar with Alexys’ seven lessons learned evaluating Scylla, or the case study on Numberly’s move from MongoDB to Scylla. Or, most recently his talk from last year’s Scylla Summit (and the follow-on webinar) where he spoke further about the move to Scylla and optimizing data analytics through Apache Spark, Parquet and Dask. As our highest rated speaker at last year’s summit we’re glad Alexys will join us again this year.

This year your talk is entitled, “MongoDB vs Scylla: production experience from both Dev & Ops standpoint.” Are you still running MongoDB in your shop, or have you moved all your loads to Scylla?

MongoDB is still running in our infrastructure and is still an important part of it. After all, it allowed us to transition from our startup garage to our scaleup buildings all around the world!

MongoDB still does a great and valued job for most of our web backends where schema agility and document oriented queries are important. In other words, MongoDB is used where flexibility is more important than scale.

Scylla entered our infrastructure on latency and scale-sensitive use cases. That’s what is does very well in the first place and it has proven to be successful. But as more and more people get their hands on Scylla and understand its capabilities they naturally challenge their usual database of choice and we now have web backends using Scylla as well!

Some people out there don’t even see the need to run any NoSQL in their tech stack. Can you describe the reasons Numberly went with NoSQL over SQL in the first place?

First of all, let’s acknowledge that most of them may indeed not need to run a NoSQL database!

Adding a new stack and embracing the new paradigms that come with it require real efforts and changes that could be a waste of energy in a lot of contexts.

Let me explain why I often compare the motivations of the SQL-to-NoSQL movement with the server-to-cluster one:

The reasons why people move from a single server vertical scaling to clusters and horizontal scaling are to overcome hardware failures and physical limitations, surely not to simplify their lives!

Similar motivations also apply to data when you need to deal with massive data volumes, computation speed, latency preservation and schema flexibility while guaranteeing its availability in the face of the numerous applications (and customers) depending on it!

Data-driven companies like ourselves hit those constraints and limits quickly so it’s been a while since we felt the need to transition but contrary to the server-to-cluster analogy that we fully applied we still run relational SQL engines alongside our dominant NoSQL and distributed computing platforms.

At Numberly, SQL is mainly used on CRM data (such as customer and catalog data) because this data is mostly fed and mirrored from our clients’ own SQL storage engines and denormalization would add too much overhead to usual and established relational and transactional workloads.

NoSQL is used where data is more volatile in its form or more massive in its volume. Typical example is behavioral navigation and tracking data which are considered as events needing to be correlated in real-time.

Last but not least, distributed computation and pipeline ecosystems such as Hadoop and Kafka also help us express complex data queries mixing structured SQL, unstructured data and fast NoSQL stores.

You recently attended and spoke at EuroPython 2019. What advances are you seeing with Python and big data?

I think that the most visible change lies in how many people who studied mathematics and science feel natural about attending a programming language conference such as EuroPython. It’s no wonder the pyData track has been filled so quickly and is dominated by data science topics.

This trend is not new and I gave a talk on this very topic last year if you’re interested in how Python gained this position.

On the big data landscape the movement is not as spectacular as with data science but it’s still great to see that MongoDB was the principal sponsor of the conference this year. Their Python client has always been their most advanced driver and influenced all the others. I still consider it the best, most intuitive and resilient database client library.

I wish it were true for Scylla too as it would make a lot of sense to anticipate the growing influence of data science in data engineering!

You’ve said that the three pillars at Numberly are “web + devops + data.” In many organizations, those teams are often at odds with each other. How do you harmonize your strategy across all three?

By working on making sure they understand each other and their common goals so that they can share the same problems and work on fostering common solutions.

One of the main success factors here lies in the fact that all those teams speak the same tongue thanks to Python. This helps binding teams and most importantly people as you can see in our EuroPython 2019 review!

What major changes has Numberly made to its tech stack over the past year?

We built our own bare-metal Kubernetes cluster and are transitioning from our previous multi-server based workflow to a new one leveraging all the great capabilities that our Kubernetes cluster offers.

We are also embracing more and more GraphQL for our Python APIs which is proving to bring a lot of freedom to our developers and greatly ease the interactions between teams. Most of those GraphQL APIs are backed with Scylla!

Since we like to share our experience I explained and detailed those changes in the talk I gave this year at EuroPython.

For people who have been following the narrative between Numberly and Scylla so far, what new insights will you bring to Scylla Summit 2019?

I will try my best to bring tangible facts and field experience on the table that speak to both infrastructure and development oriented attendees.

Outside of my talk I’d be happy to discuss and share on how we handled change management to bring Scylla forward, foster interest for our techs, train them and even recognize their Scylla knowledge thanks to our own certification!

Thanks for taking the time to speak with me today. We’re all looking forward to your talk!

REGISTER NOW FOR SCYLLA SUMMIT!


Tags: Apache Spark, Hadoop, Python, Scylla Summit, Scylla Summit 2019, spark, Ultrabug