21 Feb 2020 •
I was recently on the Software Engineering Daily podcast to talk about Data Engineering at Nubank.
It turned to be a great conversation on functional data engineering, the importance of testability & reproducibility in data engineering (and our approach to achieving it at scale at Nubank), thinking of dataset quality in terms of dataset-as-a-service, and my take on the history of data engineering as a rediscovery of the table abstraction. Check it out here.
24 Nov 2019 •
Mesos is a framework I have had recent acquaintance with. We use it to manage resources for our Spark workloads. The other resource management framework for Spark I have prior experience with is Hadoop YARN. In this article, I revisit the concept of cluster resource-management in general, and explain higher-level Mesos abstractions & concepts. To this end, I borrow heavily the classification of cluster resource-management systems from the Omega paper.
The Omega system is considered one of the precusors to Kubernetes. There is a fine article in ACM Queue describing this history. Also, Brian Grant has some rare insights into the evolution of cluster managers in Google from Omega to Kubernetes in multiple tweet-storms, such as this and this.
.. Read More
16 Nov 2019 •
Do not attribute to malice that which can be explained by the less criminal motives of ignorance and lethargy.
An aphorism of utmost utility in my life is the Hanlon’s Razor. I find it a liberating rule of thumb to weigh a lot of unavoidably unpleasant experiences in daily life. In a less formal & more terse form that I prefer, it reads:
Stupid people abound; Malicious people, less so.
There is a neat wikipedia article on it which focuses on its origin, and also introduced me to an earlier form of the aphorism by Goethe.
Misunderstandings and lethargy perhaps produce more wrong in the world than deceit and malice do. At least the latter two are certainly rarer.
Johann Wolfgang von Goethe, in The Sorrows of Young Werther
.. Read More
11 Oct 2019 •
A common anti-pattern in Spark workloads is the use of an
or operator as part of a
join. An example of this goes as follows:
val resultDF = dataframe
.join(anotherDF, $"cID" === $"customerID" || $"cID" === $"contactID",
This looks straight-forward. The use of an
or within the join makes its semantics easy to understand. However, we should be aware of the pitfalls of such an approach.
The declarative SQL above is resolved within Spark into a physical plan which determines how this particular query gets executed. To view the query plan for the computation, we could do:
.. Read More
/* pass true if you are interested in the logical plan of the query as well */