Sujith Jay Nair Thinking Aloud

A Simple Dichotomy for Modeling Data-Intensive Systems

Cut to the chase

Large-scale data processing serves multiple purposes. At a 30,000-feet view, every purpose can be bucketed into two broad categories:

  • Maintaining Materialized Views
  • Processing Events

This categorization is a high, high level one I use to reason about data system design, and its utility fades fast as we delve deeper into system nitty-gritty. Silos appear within & around each of these buckets as we descend into implementation of systems, but it is still a useful one to reason about data-intensive applications.

The basis of this categorization is captured in the following statement:

Every data system has two variables: data & query. The defining feature of the system is in the temporal nature of these variables. In every data system, either data or query is transient and the other is persistent.

In a data system maintaining materialized views, data (or more precisely, the view of data) is persistent, and query is a transient entity flowing into & out of the system.

In a data system processing events, query is persistent and transient data flows through the system.

.. Read More

Understanding Apache Spark on YARN

Introduction

Apache Spark is a lot to digest; running it on YARN even more so. This article is an introductory reference to understanding Apache Spark on YARN. Since our data platform at Logistimo runs on this infrastructure, it is imperative you (my fellow engineer) have an understanding about it before you can contribute to it. This article assumes basic familiarity with Apache Spark concepts, and will not linger on discussing them.

.. Read More

Shuffle Hash and Sort Merge Joins in Apache Spark

Introduction

This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join.

.. Read More