18 Aug 2018 •
DATA-SYSTEMS
Cut to the chase
Large-scale data processing serves multiple purposes. At a 30,000-feet view, every purpose can be bucketed into two broad categories:
- Maintaining Materialized Views
- Processing Events
This categorization is a high, high level one I use to reason about data system design, and its utility fades fast as we delve deeper into system nitty-gritty. Silos appear within & around each of these buckets as we descend into implementation of systems, but it is still a useful one to reason about data-intensive applications.
The basis of this categorization is captured in the following statement:
Every data system has two variables: data & query. The defining feature of the system is in the temporal nature of these variables. In every data system, either data or query is transient and the other is persistent.
In a data system maintaining materialized views, data (or more precisely, the view of data) is persistent, and query is a transient entity flowing into & out of the system.
In a data system processing events, query is persistent and transient data flows through the system.
.. Read More
24 Jul 2018 •
APACHE-SPARK
YARN
Introduction
Apache Spark is a lot to digest; running it on YARN even more so. This article is an introductory reference to understanding Apache Spark on YARN. Since our data platform at Logistimo runs on this infrastructure, it is imperative you (my fellow engineer) have an understanding about it before you can contribute to it. This article assumes basic familiarity with Apache Spark concepts, and will not linger on discussing them.
.. Read More
28 Jun 2018 •
APACHE-SPARK
SQL
JOINS
Introduction
This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join.
.. Read More
24 Jun 2018 •
CONVERSATIONS
LEARNING
CULTURE
Note: This post has some concepts on Scala collections. Do not worry if you have little interest in Scala; the point I am trying to convey has significance beyond my choice of language. This is an exhortation to the engineering community at large to share our learnings more.
.. Read More
20 Jun 2018 •
CONCURRENCY
PARALLELISM
TL; DR
This post explores the notion that the definition of concurrency & parallelism itself is not language-agnostic. Depending on the language & paradigms we subscribe to, the definitions change.
.. Read More