Sujith Jay Nair Thinking Aloud

Concurrency and Parallelism

TL; DR This post explores the notion that the definition of concurrency & parallelism itself is not language-agnostic. Depending on the language & paradigms we subscribe to, the definitions change.

.. Read More

The Assumption of Normality in Time Series

The notion of normality is oft-encountered in statistics as an underlying assumption to many proofs and results; it is normal to assume normality (pun strongly intended; always wanted to use this one). In much statistical works, the assumption of normality, even if inaccurate, is amortized and ameliorated by the existence of Central Limit Theorem. Time-series analysis, sadly, does not enjoy this privilege. The assumption of independence, so core to the CLT and other Limit theorems, is poignantly absent in time-series.

This post tries to explain the use of Limit theorems in time-series analysis. As my intended audience comprises computer scientists/engineers, and not statisticians, this post has a long preface on the premise of the problem.

.. Read More

Broadcast Hash Joins in Apache Spark

image-title-here

Introduction

This post is part of my series on Joins in Apache Spark SQL. Joins are amongst the most computationally expensive operations in Spark SQL. As a distributed SQL engine, Spark SQL implements a host of strategies to tackle the common use-cases around joins.

In this post, we will delve deep and acquaint ourselves better with the most performant of the join strategies, Broadcast Hash Join.

.. Read More