Sujith Jay Nair Thinking Aloud

Hanlon's Razor: Some Comments

Do not attribute to malice that which can be explained by the less criminal motives of ignorance and lethargy.

An aphorism of utmost utility in my life is the Hanlon’s Razor. I find it a liberating rule of thumb to weigh a lot of unavoidably unpleasant experiences in daily life. In a less formal & more terse form that I prefer, it reads:

Stupid people abound; Malicious people, less so.

There is a neat wikipedia article on it which focuses on its origin, and also introduced me to an earlier form of the aphorism by Goethe.

Misunderstandings and lethargy perhaps produce more wrong in the world than deceit and malice do. At least the latter two are certainly rarer. Johann Wolfgang von Goethe, in The Sorrows of Young Werther
.. Read More

Prefer Unions over Or in Spark Joins

A common anti-pattern in Spark workloads is the use of an or operator as part of ajoin. An example of this goes as follows:

val resultDF = dataframe
 .join(anotherDF, $"cID" === $"customerID" || $"cID" === $"contactID",
   "left")

This looks straight-forward. The use of an or within the join makes its semantics easy to understand. However, we should be aware of the pitfalls of such an approach.

The declarative SQL above is resolved within Spark into a physical plan which determines how this particular query gets executed. To view the query plan for the computation, we could do:

resultDF.explain()

/* pass true if you are interested in the logical plan of the query as well */
resultDF.explain(true)
.. Read More

Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology

Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Abouzied, A., Abadi, D.J, Bajda-Pawlikowski, K., Silberschatz, A. (2019, August). Proceedings of the VLDB Vol. 12 (12).

HadoopDB was a prototype built in 2009 as a hybrid SQL system with the features from Hadoop MapReduce framework and parallel database management systems (Greenplum, Vertica, etc). This paper revisits the design choices for HadoopDB, and investigates its legacy in existing data systems. I felt it is a great review paper for the state of modern data analysis systems.

MapReduce is the most famous example in a class of systems which partition large amounts of data over multitude of machines, and provide a straightforward language in which to express complex transformations and analyses. The key feature of these systems is how they abstract out fault-tolerance and partitioning from the user.

MapReduce, along with other large-scale data processing systems such as Microsoft’s Dryad/LINQ project, were originally designed for processing unstructured data.

The success of these systems in processing unstructured data led to a natural desire to also use them for processing structured data. However, the final result was a major step backward relative to the decades of research in parallel database systems that provide similar capabilities of parallel query processing over structured data. 1

The MapReduce model of Map -> Shuffle -> Reduce/Aggregate -> Materialize is inefficient for parallel structured query processing.

.. Read More

Datomic with Rich Hickey

This talk is an introduction to Datomic, by its creator Rich Hickey. My notes on this talk are linked below:

Open Core : How Did We Get Here?

Open source is considered an exemplar of the ‘private-collective’ model of innovation,1 a compound model with elements from both the private investment & the collective action models.

This model was an attempt to rationalise and reason about the existence of the open source software industry, and answer the question: “why would thousands of top-notch programmers contribute, without apparent material incentives, to the provision of a public good?”.2

This essay revisits the assumptions of the private-collective model, in the cloud-compute era, to understand the surgent phenomena of the open core revenue model in the commercial open source software industry. This is of particular significance in view of the perceived siege of the open source model by cloud vendors.3

.. Read More