Filling Missing Data

25 Mar 2020 • APACHE-SPARK

A recent exercise I undertook of upgrading Apache Spark for some workloads from v2.4.3 to v2.4.5 surfaced a number of run-time errors of the form:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "name" among (id, place);
  at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:223)
  at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:223)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:222)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$toAttributes$2.apply(DataFrameNaFunctions.scala:475)

A little poking-around showed this error occurred for transformations with a similar general shape. The following is a minimal example to recreate it:

val df = Seq(
  ("1", "Berlin"),
  ("2", "Bombay")
  ).toDF("id", "place")

df.na.fill("empty",Seq("id", "place", "name"))

This looks wrong, but apparently works fine in v2.4.3 😲. A transformation which attempts to fill in a missing value for a column which does not exist should raise an error: v2.4.5 does that.

Prefer Unions over Or in Spark Joins

11 Oct 2019 • APACHE-SPARK SQL JOINS

A common anti-pattern in Spark workloads is the use of an or operator as part of ajoin. An example of this goes as follows:

val resultDF = dataframe
 .join(anotherDF, $"cID" === $"customerID" || $"cID" === $"contactID",
   "left")

This looks straight-forward. The use of an or within the join makes its semantics easy to understand. However, we should be aware of the pitfalls of such an approach.

The declarative SQL above is resolved within Spark into a physical plan which determines how this particular query gets executed. To view the query plan for the computation, we could do:

resultDF.explain()

/* pass true if you are interested in the logical plan of the query as well */
resultDF.explain(true)

Understanding Apache Spark on YARN

24 Jul 2018 • APACHE-SPARK YARN

Introduction

Apache Spark is a lot to digest; running it on YARN even more so. This article is an introductory reference to understanding Apache Spark on YARN. Since our data platform at Logistimo runs on this infrastructure, it is imperative you (my fellow engineer) have an understanding about it before you can contribute to it. This article assumes basic familiarity with Apache Spark concepts, and will not linger on discussing them.

Shuffle Hash and Sort Merge Joins in Apache Spark

28 Jun 2018 • APACHE-SPARK SQL JOINS

Introduction

This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join.

Broadcast Hash Joins in Apache Spark

17 Feb 2018 • APACHE-SPARK SQL JOINS

image-title-here

Introduction

This post is part of my series on Joins in Apache Spark SQL. Joins are amongst the most computationally expensive operations in Spark SQL. As a distributed SQL engine, Spark SQL implements a host of strategies to tackle the common use-cases around joins.

In this post, we will delve deep and acquaint ourselves better with the most performant of the join strategies, Broadcast Hash Join.

Sujith Jay Nair Thinking Aloud

Apache Spark Articles

Filling Missing Data

Prefer Unions over Or in Spark Joins

Understanding Apache Spark on YARN

Introduction

Shuffle Hash and Sort Merge Joins in Apache Spark

Introduction

Broadcast Hash Joins in Apache Spark

Introduction