Sujith Jay Nair Thinking Aloud

Apache Spark Articles

Sign up to receive a monthly digest from Sujith Jay

    Understanding Apache Spark on YARN

    Introduction

    Apache Spark is a lot to digest; running it on YARN even more so. This article is an introductory reference to understanding Apache Spark on YARN. Since our data platform at Logistimo runs on this infrastructure, it is imperative you (my fellow engineer) have an understanding about it before you can contribute to it. This article assumes basic familiarity with Apache Spark concepts, and will not linger on discussing them.

    .. Read More

    Shuffle Hash and Sort Merge Joins in Apache Spark

    Introduction

    This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join.

    .. Read More

    Broadcast Hash Joins in Apache Spark

    image-title-here

    Introduction

    This post is part of my series on Joins in Apache Spark SQL. Joins are amongst the most computationally expensive operations in Spark SQL. As a distributed SQL engine, Spark SQL implements a host of strategies to tackle the common use-cases around joins.

    In this post, we will delve deep and acquaint ourselves better with the most performant of the join strategies, Broadcast Hash Join.

    .. Read More