Another week, another Spark and Scylla post! This time, we’re back again with the Scylla Spark Migrator; we’ll take a short tour through its innards to see how it is implemented. Read why we implemented the Scylla Spark Migrator in this blog. Overview When developing the Migrator, we had several design goals in mind. First, the Migrator should be highly efficient in terms of resource usage. Resource efficiency in the land of Spark applications usually translates to avoiding data shuffles between nodes. Data shuffles are destructive to Spark’s performance, as they incur more I/O costs. Moreover, shuffles usually get slower […]
We covered the basics of Elasticsearch and how Scylla is a perfect complement for it in part one of this blog. Today we want to give you specific how-tos on connecting Scylla and Elasticsearch, including use cases and sample code. Use Case #1 If combining a persistent, highly available datastore with full text search engine is a market requirement, then implementing a single, integrated solution is an ultimate goal that requires time and resources. To answer this challenge we describe below a way for users to use best-of-breed solutions that support full text search workloads. We chose Elasticsearch open source together with […]
In the run-up to Scylla Summit 2018, we’ll be featuring our speakers and providing sneak peeks at their presentations. This interview in our ongoing series is with Hojjat Jafarpour of Confluent. His presentation at Scylla Summit is entitled Scalable Stream Processing with KSQL, Kafka and Scylla. Hojjat, before we get into your talk, tell us a little about yourself. What do you like to do for fun? I’m a Software Engineer at Confluent and the creator of KSQL, the Streaming SQL engine for Apache Kafka. In addition to challenging problems in scalable data management, I like traveling and outdoors. We are so […]
The Scylla team is pleased to announce the release of Scylla 2.3, a production-ready Scylla Open Source minor release. The Scylla 2.3 release includes CQL enhancements, new troubleshooting tools, performance improvements and more. Experimental features include Materialized Views, Secondary Indexes, and Hinted Handoff (details below). Starting from Scylla 2.3, packages are also available for Ubuntu 18.04 and Debian 9. Scylla is an open source, Apache Cassandra-compatible NoSQL database, with superior performance and consistently low latency. Find the Scylla 2.3 repository for your Linux distribution here. Our open source policy is to support only the current active release and its predecessor. […]
Large partitions, although supported by Scylla, are also well known for causing performance issues. Fortunately, release 2.3 comes with a helping hand for discovering and investigating large partitions present in a cluster — system.large_partitions table. Large partitions CQL, as a data modeling language, aims towards very good readability and hiding unneeded implementation details from users. As a result, sometimes it’s not clear why a very simple data model suffers from unexpected performance problems. One of the potential suspects might be large partitions. Our blog entry on large partitions contains a detailed explanation on why coping with large partitions is important. […]
What’s the deal with prepared statements? A query itself is just a string of text. For example: INSERT INTO tb (key,val) VALUES (“key”, “value”) In this simple example, we inserted two strings in a two-column table. Before that can happen, the CQL statement string (INSERT INTO…) needs to be sent to Scylla, parsed, and assuming no errors in the query, executed. It’s the parsing part that we are concerned with here. Parsing a CQL query is a compute-intensive operation that consumes resources just like anything else you would have a computer do. What if we could do the parsing part […]
At ScyllaDB, our development team is all about performance with improved latency and throughput. Our speakers at our recent Scylla Summit provided many tips and tricks to make Scylla’s superior latency and performance even better. ScyllaDB’s VP of R&D, Schlomi Livne, added to the growing repertoire of these tips with his talk Planning your queries for maximum performance. In it, he outlined some of the how and why of Scylla performance, and concluded with seven rules to optimize your queries.
When it comes to monitoring a distributed platform, there are many ways to go about it. There are many tools and programs used to accomplish different monitoring tasks ranging from the operating system or disk statistics to database performance metrics. What we found is that users need an all-in-one document that covers these topics so they can reference it later when needed without having to dig through multiple documents.
Gocqlx is an extension to the Go Scylla / Apache Cassandra driver Gocql. It aims to boost developer productivity while not sacrificing query performance. It’s inspired by Sqlx, a tool for working with SQL databases, but it goes beyond what Sqlx provides. For this blog post, we will pretend we’re a microblogging service and use the following schema: Gocql is a very popular Cassandra driver for the Go programming language. Usually working with it looks more or less like this (source: Gocql README): At first glance, it looks ok but there are some problems: Gocql does not provide you with […]
The combination of a database and full-text search analytics becomes unavoidable these days. In this blog post, I will demonstrate a simple way to analyze data from a database with analytics software by using Scylla and Elasticsearch together to perform a simple data mining exercise that gathers data from Twitter. This demonstration will use a series of Docker containers that will run a Scylla and Elasticsearch cluster and a Node.js app that will feed data from Twitter into both platforms. This demo can be run on a laptop or production Docker server. To get started, let’s go over the prerequisites […]