The popularity of Apache Cassandra and the applicability of it’s development model has seen it clearly emerge as the leading NoSQL technology for scale, performance and availability. One only needs to survey the ever increasing range of Cassandra-compatible options now available on the market to gain a further proof point to its popularity.
As we get started with 2018, the range of Cassandra-compatible offerings available on the market include:
We all know that the database is a key foundational technology for any application. You need to ensure you choose a product that meets the functional requirements of your use case, is robust and scalable, makes efficient use of compute resources and will be usable by your dev team and supportable by your ops team now and into the future. Selection of the database technology for a new application therefore deserves rigorous consideration of your specific requirements.
This blog post surveys the current state and key considerations for people evaluating these offerings and finishes with an overview of some of in progress development for Apache Cassandra that should ensure it remains the default, and best, choice for the majority of use cases.
This post provides some high-level considerations that should help you to narrow down contenders for evaluation. For each technology we consider:
Datastax Enterprise (DSE) is a closed source product derived from Apache Cassandra. For core Cassandra features, it is driver level compatible with Apache Cassandra. Online migration from DSE to Apache Cassandra can be achieved with minimal effort where DSE proprietary features have not been used. However, DSE contains a number of extensions that are not included in Apache Cassandra and such as bundling Spark and Solr into the same application and providing customer security and compaction providers.
Breadth of production deployment: DSE has been used in production by many organisations over several years.
Licensing model: DSE is a closed-source, proprietary product derived from open source products. Use of DSE requires payment of a licensing fee to Datastax.
Strength of community: As a proprietary product, support and enhancement of DSE is entirely reliant on Datastax. However, DSE does build on contributions from the communities for the underlying open source products.
Functionality: Functional enhancements in DSE vs the open source products are generally enterprise-specific features (such as LDAP authentication integration), relatively simple integration of the other included products (Spark, Solr) and the entirely proprietary DSE Graph graph database functionality.
Scalability and Performance: In general, DSE performance will be very similar to the underlying open source productions. However, Datastax does claim some proprietary performance improvements.
Scylla is effectively a re-implementation of Apache Cassandra in C++ with an aim of providing highly optimised performance. From a functional point of view, it provides most, but not all, functions of Cassandra and generally doesn’t aim to provide additional functions to Cassandra. It is driver-level compatible with Apache Cassandra but migration to/from Scylla requires an application-level migration strategy such as dual-writes.
Breadth of production deployment: Scylla 1.0 was released in March 2016. Several organisations are reported as running it in production although level of production deployment would be a small fraction of Apache Cassandra deployment.
Licensing Model: Scylla is open source but licensed under the AGPL (Gnu Affero General Public Use Licence). This license requires that any organisation making a modified version of the product (even for internal use) must publish those modifications. As a result of this requirement, many organisations (particularly large tech orgs that tend to adopt and drive enhancements to open source projects) will not adopt software using the AGPL.
Strength of Community: Scylla is largely dependent on a single company (ScyllaDB) for all development and support.
Functionality: Scylla generally aims to be functionally compatible with Cassandra although not all features are currently available (light weight transactions being one notable exception).
Scalability and Performance: Improved performance is Scylla’s main objectives. Scylla has published many benchmarks demonstrating substantial performance improvements. However, the most significant gains are seen when running large machines with high performance IO and performance gains in more typical cloud deployments (for manageability) are often less than these benchmarks.
Yugabyte is a new database aiming to offer both SQL and NoSQL functionality in a single database. It is driver compatible with Cassandra (although there is also a Yugabyte-specific fork of the Cassandra driver) and also Redis with announced plans for PostgresSQL compatibility.
Breadth of production deployment: Yugabyte is currently in Beta with production release planned for 2018.
Licensing Model: Yugabyte is Apache 2.0 Licensed open source software. A closed source “enterprise” edition is also offered with additional manageability and other features.
Strength of Community: Yugabyte is a new product developed by a single company (Yugabyte) and all development and support of the product is dependant on this company.
Functionality: The core Yugabyte engine supports full ACID transactions and a different replication model to Apache Cassandra. Presumably additional features will also be required for PostgresSQL compatibility. While Yugabyte claims compatibility with core Cassandra features it seems likely that, given the differences in underlying engine models, there will be semantic differences that are not readily apparent (for example, Yugabyte already claims differences in consistency semantics).
Scalability and Performance: Yugabyte have published benchmarks claiming improved performance for some scenarios. However, tuning of the Apache Cassandra configuration for their comparison benchmarks was extremely poor and, in any event, the very different architecture of Yugabyte is likely to lead to quite different performance characteristics versus Apache Cassandra depending on the use case.
Cosmos DB is a Microsoft Azure offering designed to provide a globally distributed database with NoSQL functionality. It supports multiple APIs including SQL, Javascript, Gremlin (graph), MongoDB and Cassandra.
Breadth of production deployment: Cosmos DB was released in May 2017 although it builds on Azure DocumentDB which was released in 2014. The Cassandra API was released into preview in November 2017.
Licensing model: Cosmos DB is a proprietary, closed source, technology offered only as an Azure service.
Strength of Community: Cosmos DB is developed and supported by Microsoft.
Functionality: Cosmos DB claims Cassandra compatibility but without providing a detailed breakdown of supported/not support Cassandra features and it seems unlikely there would be complete feature compatibility (at a minimum, the approach to consistency levels is quite different). The documented strategy to import data from Cassandra in Cosmos DB is via CQLshell COPY FROM / COPY TO commands which export data via CSV and generally aren’t suitable for production-size datasets.
Scalability and Performance: Cosmos provide latency SLAs for the 99th percentile which are comparable to other latency focus offerings such as Scylla. Cost effectiveness at scale is hard to gauge and is dependent on Azure pricing.
Apache Cassandra is the inspiration and genesis for all of these offerings. From it’s 1.0 release in October 2011, Apache Cassandra is now at version 3.11 with version 4.0 in development. It aims to provide virtually unlimited scalability and the ability to run with highest levels of availability and full global distribution. Many household name internet services (eg Apple, Uber, Spotify, Instagram) rely on Apache Cassandra as a core component of their architecture.
Breadth of production deployment: Production deployment of Apache Cassandra are likely an order magnitude greater than any of the other products mentioned above.
Licensing model: Apache Cassandra is Apache 2.0 Licensed open source software.
Strength of Community: Apache Cassandra development is governed by the Apache Foundation, the same organisation and governance rules that some of the most successful open source projects such as Hadoop, Spark, Tomcat and the original Apache web server. Apache Cassandra committers are employed by close to 10 different companies with regular contributions from a wide range of companies. Apple, as a key user, is one of the most active contributors to the project.
Functionality: While some of the other products above are aiming to extend the functionality of Cassandra, Cassandra define the core feature set that the others are aiming to emulate.
Scalability and Performance: There can be little question as the the scalability of Cassandra with production clusters in the thousands of nodes holding petabytes of data and service millions of operations per second. The Cassandra community is alway working to improve the performance of Cassandra with several major performance initiatives currently underway in the project (eg CASSANDRA-13476).