Sujith Jay Nair Thinking Aloud

Dynamo vs Cassandra : Systems Design of NoSQL Databases

State-of-the-art distributed databases represent a distillation of years of research in distributed systems. The concepts underlying any distributed system can thus be overwhelming to comprehend. This is truer when you are dealing with databases without the strong consistency guarantee. Databases without strong consistency guarantees come in a range of flavours; but they are bunched under a category called NoSQL databases.

NoSQL databases do not represent a single kind of data model, nor do they offer uniform guarantees regarding consistency and availability. However, they are built on very similar principles and ideas.

From a historical perspective, the advent of NoSQL databases was precipitated by the publication of Dynamo by Amazon1 & BigTable by Google, and the emergence of a number of open-source distributed data stores, which were (improved?) clones of either (or both) of these systems. Bigtable-inspired NoSQL stores are referred to as column-stores (e.g. HyperTable, HBase), whereas Dynamo influenced most of the key/value-stores. We will term these systems loosely as Dynamo-family databases, which include Riak, Aerospike, Project Voldemort, and Cassandra.

I would like to focus on systems design ideas in Dynamo-family NoSQL databases in this article, with a particular focus on Cassandra. The approach of this article is to compare and contrast Cassandra with Dynamo; and in this process, touch upon the underlying ideas. Expect a lot of homework & further readings; I will have copious amounts of references throughout the article.

.. Read More

A Simple Dichotomy for Modeling Data-Intensive Systems

Cut to the chase

Large-scale data processing serves multiple purposes. At a 30,000-feet view, every purpose can be bucketed into two broad categories:

  • Maintaining Materialized Views
  • Processing Events

This categorization is a high, high level one I use to reason about data system design, and its utility fades fast as we delve deeper into system nitty-gritty. Silos appear within & around each of these buckets as we descend into implementation of systems, but it is still a useful one to reason about data-intensive applications.

The basis of this categorization is captured in the following statement:

Every data system has two variables: data & query. The defining feature of the system is in the temporal nature of these variables. In every data system, either data or query is transient and the other is persistent.

In a data system maintaining materialized views, data (or more precisely, the view of data) is persistent, and query is a transient entity flowing into & out of the system.

In a data system processing events, query is persistent and transient data flows through the system.

.. Read More

Understanding Apache Spark on YARN


Apache Spark is a lot to digest; running it on YARN even more so. This article is an introductory reference to understanding Apache Spark on YARN. Since our data platform at Logistimo runs on this infrastructure, it is imperative you (my fellow engineer) have an understanding about it before you can contribute to it. This article assumes basic familiarity with Apache Spark concepts, and will not linger on discussing them.

.. Read More

Shuffle Hash and Sort Merge Joins in Apache Spark


This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join.

.. Read More

How Conversations on StackOverflow Teach You


Note: This post has some concepts on Scala collections. Do not worry if you have little interest in Scala; the point I am trying to convey has significance beyond my choice of language. This is an exhortation to the engineering community at large to share our learnings more.

.. Read More