Sujith Jay Nair

Large Language Models: Code vs. Text

2023-04-10T11:11:00+00:00

Every technology hype-cycle is a Dickensian tale of two extremes.

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair. Charles Dickens, A Tale of Two Cities

Large Language Models (LLMs) are the rage now, and we can see those extremes play out in the reception of products based on LLMs. For instance, Github Copilot has been a massive success within the programming community whereas Galactica was launched and shutdown in under 3 days by Meta and faced intense criticism.

The general perception (as of early 2023) is that (auto-regressive) LLMs are better at generating code, but have had a mixed bag of results in use-cases involving generation of general text. Why is that?

Yann LeCun provides a possible explanation of this divergence:

13. Why do LLMs appear much better at generating code than generating general text?
Because, unlike the real world, the universe that a program manipulates (the state of the variables) is limited, discrete, deterministic, and fully observable.
The real world is none of that.
— Yann LeCun (@ylecun) February 13, 2023

Why do LLMs appear much better at generating code than generating general text?
Because, unlike the real world, the universe that a program manipulates (the state of the variables) is limited, discrete, deterministic, and fully observable.
The real world is none of that.

I do NOT agree with this explanation. General text, same as code, does not contend with the entire universe. It is limited to the context of the text. An essay on hydrogen atoms and their chemical kinetics does not care about Nietzche’s thoughts on the Antichrist. Also, if the contention that LLMs deal with a finite deterministic universe were true, code copilots would not hallucinate, would they?

Copilot is great until it starts hallucinating methods that don't exist in the libraries being used, which is very often. Every time I autocomplete using Copilot, I need to check if the method exists and it makes sense. I am not sure how much of a time-saving this is.
— Delip Rao 🥭 (@deliprao) July 26, 2022

Copilot is great until it starts hallucinating methods that don’t exist in the libraries being used, which is very often. Every time I autocomplete using Copilot, I need to check if the method exists and it makes sense. I am not sure how much of a time-saving this is.

My take is that the success of LLMs with code can partly be attributed to what happens after the generation of text. In code copilots, the compiler and tests provide verification of the correctness of generated code. There is no such equivalent mechanism available in scientific writing or any form of writing. The faster feedback loops with code buttresses the user experience of using code copilots despite the high chances of hallucinations. Tightening the feedback loop is the way to improve the usability of text generation tools for science and other use cases.

Notes

Unlike general text, there are only a very limited number of correct completions in code. General text has a much wider universe of completions to offer, and thus more shots at offering a satisfactory completion. How does this affect my argument?
The notion that LLMs are better at generating code than generating general text can itself be debated. Can it be partly explained by the attitudes and expectations of those using these LLMs as aids (programmers versus scientists, for instance)?

Defining a Platform is Hard

2021-06-07T11:11:00+00:00

Engineering platforms are a vague concept. Software organisations across the board agree on the need to ‘platformise’ layers of their stack, but struggle to define the term. The question ‘what is a platform’ is met with a response ‘something similar to AWS, but at a higher layer of the company software stack’. I have previously argued why this is a false analogy.

I think we can agree that there is a dichotomy in engineering platforms: public platforms and internal (or private) platforms. AWS S3, Snowflake, and others are examples of public platforms, while internal platforms are engineering platforms built within a software organisation to serve internal users.

My approach here is to start with a reasonable definition of platforms in general and arrive at a reasonable definition for internal platforms. This was not as straightforward as it sounds. The minimum we can gain from such an exercise is the ability to identify internal platforms in the wild (“Is X an internal platform?” ). ¹

So, what is a platform?

The Bill Gates’ definition of a platform goes like this:

A platform is when the economic value to everybody that uses it, exceeds the value of the company that creates the platform.

This is a succinct description but the trouble with it is that it is a post-hoc definition: it can only identify a platform after it has started accruing value for the creators and users. Like a lagging indicator, this is useful in a limited sense. But as you might have noticed, the question we look to answer is also post-hoc (“Is X an internal platform?” ). Thus the Gates’ definition can be our initial template for a definition of internal platforms.

How can the above be paraphrased for internal platforms? Assuming that a company can be seen as a set of coordinating units or teams, we might say, ‘an internal platform is when the economic value to every team that uses it, exceeds the cost of the platform’.

So, by this definition, in a simple case where there is a single ‘software platform’ used by $m$ product teams, each of which has $n$ customers with a uniform revenue of $$R$ per customer, the cumulative value $n * m * R$ should (greatly) exceed $C$, the cost-to-company of the platform.

This definition is almost useless. Unlike the original Gates’ quote, it does not help us identify platforms from non-platforms. Everything within a company, from a set of APIs to a team of accountants in the back-office ², could pass off as a platform under this definition. Let us refine this further using what we know about internal platforms, and their distinct attributes.

Cost

Before we turn to the attributes of an internal platform, a brief note on cost. As we have already seen in the above attempted definitions, the terms ‘cost’ and ‘value’ are central to the formulations. I would like to expand a bit on the term ‘cost’ in the context of internal platforms.

The term ‘cost’ includes the dollar cost of operating the platform as well as a measure of the effort required by users to use the internal platform. How do we measure the effort required by users to use the internal platform? I have previously talked about how abstractions provided by internal platforms have to cater to an entire spectrum of users within the company, and not the median users of the system. A measurement of the effort required by users to use the internal platform is essentially a weighted average of the usability index of the platform for every user persona the platform caters to.

Attributes of an Internal Platform

So what desirable properties do internal platforms have?

Scalable

Platforms inherently have to be scalable.

A typical engineering definition of scalability would be along the dimensions of reliability and fault-tolerance; the system should be reliable and fault-tolerant as usage of the system increases. But for a platform, we need to consider cost scalability as well: the marginal cost of the platform should diminish as usage increases.

An illustration of the property of cost-scalability is as follows: Consider a platform with $n$ users and an operating cost of $C$. Assume that when the user-count increases to $n+1$, the cost increases to $C+ c_{1}$, and when the user-count increases to $n+1$, the cost increases to $C+ c_{1} + c_{2}$. For a cost-scalable platform, $c_{2} \leq c_{1}$. ³

Lasting

A ‘lasting’ platform ensures that the incremental cost to the customer grows more slowly than the incremental value to the customer.

In case of public platforms, not every platform has to be lasting. Byrne Hobart calls (public) platforms which follow the incremental value dictum as second-derivative platforms (a first-derivative platform being one which follows the Bill Gates’ definition above).

Internal platforms always have to be lasting (a.k.a. second-derivative).

Serendipitous

Every platform has users who use them for use-cases they were not designed for. ⁴ This also implies that a platform caters to multiple sets of users (target audiences and otherwise). I call this being serendipitous.

Contrary to the other attributes, being serendipitous is an attribute of internal platforms which can be leveraged to predict where to build an internal platform. I have previously talked about how overloaded use-cases within a platform are a good guide to learn about the unmet needs of the users. This is true even in cases where a platform has not yet been built. APIs with overloaded use-cases are excellent indicators that a platform with more general abstractions should probably take its place.

There is another explanation of why no true platform ⁵ will only cater to a single user persona. Internal platforms are essentially monopolies within a company for a certain value-producing activity. And monopolies tend to commodify their adjacent in the value chain. If you consider the product teams or other internal users using the internal platform as the layer adjacent to the platform in the value chain of the company, it follows naturally that internal platforms should cater to a wide spectrum of users (or use-cases) which it commodifies.

Final Take?

A definition for internal platforms in the light of these attributes could be stated as:

An internal platform is when a scalable, commodifying, coherent set of APIs ensures that the incremental cost to the customer grows more slowly than the incremental value to the customer.

Footnotes

The best-case scenario would be if we are able to leverage the definition to identify opportunities to build internal platforms (“Does Y require a platform?” ). ↩
Although I mention the example of a back-office of accountants here (for effect), eliminating it from any definition of a platform is easy. We can differentiate the same way we differentiate any automation from a manual performance of the same task: by introducing the constraint of consistent repeatability. Humans are error-prone in performing repetitive tasks, machines are less so. ↩
I use number of users here as a proxy for usage. Depending on the exact service provided by the platform, usage might not be a function of number of users alone (for example, number of API calls). ↩
AWS S3 was famously never designed for Data Lakes, but that is one of the major use-cases for S3 nowadays. ↩
No true Scotsman. ↩

Ariel

2020-10-26T09:09:00+00:00

Ariel: The Restored Edition by Sylvia Plath
My rating: 3 of 5 stars

Reading Sylvia Plath is an experience, turbulent is an understatement. Her poems show her wild and repressed thrashing against her circumstances and on one occasion, I was moved to an extent that I had a nightmare - I cannot remember the last time I had a nightmare - and on many other occasions to pen my own (bad) poetry.

I am terrified by this dark thing
That sleeps in me;
All day I feel its soft, feathery turnings, its malignity.

I knew nothing of her, her life, her work, her suicide at an age I find myself in now, and I took disproportionate interest in learning more of her. And each subsequent poem was another step downwards at night into a deep step-well, a well with some promise of water at its furthest. There is water at the bottom, yes; murky and unstill and with unrepentant poison.

The poison first came into view with the ‘The Jailer’:

He his been burning me with cigarettes,
Pretending I am a negress with pink paws.

As revulsed I was by the word, I was disgusted as I interpreted the dehumanising sense in the use of the word, almost to the extent of denying that very excruciating physical pain she felt to another human - a person of colour.

This was matched by many other lines, some as hard to digest for me as the one I quote above. Here is what I ask myself - how do I feel for a person who I know , if given a chance, would not do the same for me? In this case, because of a certain property of my skin.

There came a point where my mind switched from flailing with her - a fellow human - in her pains to being apathetic observer, almost sadistic. I suppose she is as much a ‘product of her times’ - the defense raised by her ardents - as I am of mine.

View all my reviews

AWS Is NOT Your Ideal

2020-10-01T11:11:00+00:00

Let me start with an assertion. Every platform engineering team ¹ in every organisation aspires to be like AWS ².

Every platform team wants to be like AWS, because like AWS, they provide infrastructure abstractions to users. AWS provides infrastructure via the abstractions of VMs and disks and write-capacity-units, while platform teams provide infrastructure using higher abstractions which solve service definitions, database or message queue provisioning, and service right-sizing ³.

This similarity prompts leaders of platform engineering teams to model their teams as agnostic providers of universal, non-leaky (within SLO bounds), self-served abstractions for their engineering organisation. Platform teams structured as such detached units struggle to define cohesive roadmaps which provide increasing value to business. But how does your platform differ from AWS?

Your Platform vs. The Platform

1. The Middle Ground

As an agnostic service provider, AWS can afford to cater to median use-cases. The reason platform engineering teams exist is to bridge the gap between PaaS abstractions which work for the median use-case to your business’ specific use-cases. AWS can afford to target the median (economy of scale etc.), but you cannot.

AWS can afford to stay within a single σ. You cannot.

Agnostic platform engineering teams which emulate AWS try to get away from this responsibility by proposing abstractions which target the median use-case. A tell-tale sign of this is when the lack in wide usability of internal abstractions is compensated for by extensive onboarding & repeated training. This is also a side-effect of the relative valuation of engineering time vs. the time of another function ⁴.

2. Follow the Money

The dictum ‘follow the money’ works beautifully for customer-front products. When faced with a choice between two competing features to prioritise, a common tactical play is to make something which leads to more (immediate & long-term) revenue. The proxy for increased revenue could be increased acquisition conversion, better retention or improved user experience – metrics which ensure increased revenue for the company over time. In short, revenue growth is the north star ⁵.

Not so much in platform engineering. There is no revenue since your customers are internal, captive ones. Captive audiences are forced to use a solution by the force of dictum and lack of choice. The metrics used in platform products are proxies for usability and user satisfaction – but there are no foolproof ways to measure it for captive audiences. For captive audiences, solutions can not compete and better solutions cannot win. Like a command economy, platform products are designed rather than evolved. Design takes priority over market economy. So why is design bad?

Bad Design

For design to work, there has to be an objective function against which we can design. A specification is an objective function against which engineering teams design a solution. Since we do not have reliable metrics ⁶ to rely on for platform engineering, how do we come up with specifications? And without rigorous specifications, new features created by the platform run a high risk of not solving worthwhile problems for the users. The current accepted methodology among platform engineering leaders to solve this paucity of specifications is to rely on user-interviews. This is, as mentioned before, an unreliable source since captive users do not have the best view of the ideal state of tooling and abstractions that could be available to them.

The only way to flip this situation is to let go of command-economy-style designed abstractions, and to let your platform self-organise along the principle of markets. How does that look in practice?

1. Market, FTW

Camille Fournier mentions in Product for Internal Platforms how her team partners with customer teams to develop prototypes for specific problems. These specific solutions are later honed and iterated on to become general solutions provided by your team. I would go a step further on this route, where possible. Partner to prototype with multiple teams facing related problems to develop multiple specific solutions. These specific solutions can be seen as competing candidates to solve a general problem. Bring in user-interviews at this point to gauge pain-points, and iterate individually on these specific solutions. This switches the economy of your team to a self-organised market. Once considerable thought and iteration has gone into each solution, it is time to assimilate. Assimilate the best solution(s) while migrating the rest to the chosen solution. As emphasised in Product for Internal Platforms, an early investment of time into migration strategies is essential for such a scheme to sustain.

In platforms designed with experimentation, you will find that innovation continues to thrive at the edges of the platform’s domain while the stable core of the platform is subject to periodic rework or maintenance. The use-cases a platform supports grows in a controlled manner to address an ever-growing percent of the consumers, and does not stagnate after addressing just the median users.

2. Overloaded Use-cases

Although agnostic platform engineering teams might only be catering to very specific median use-cases, the customer teams with specific needs cannot afford to be blocked and they cannot stop delivering their deliverables. These teams sometimes create their own solutions, and in such cases the above strategy of assimilation works wonders. You get a prototype for free on which the team can iterate on. However, this scenario is rarer in cases where it requires specific skills to build such solutions, such as in data platforms. One common pattern in such knowledge-constricted situations is that users find ways to overload the existing solutions with minor tweaks to fit their use-case. Look out for such overloaded use-cases within your platform, for they are excellent guides to unmet needs of the users. You can leverage them to advocate for newer features to explicitly support those use-cases.

3. Listen To Them (Only At The Start!)

As a parting note, I will take a jab at user-interviews again. The above tactics work when you are trying to scale your platform from 1 to N. When taking a platform from 0 to 1, the only solution to creating specifications is to listen to the users. Give them exactly what they want. Listen to their exact demands. A propensity of platform product managers is to rely on this excessively at a much later stage in the product’s lifecycle. User-interviews have their place in evolving products, but the over-reliance on the methodology is a bane to platform product management.

P.S. As I read back the above essay, the heavy influence of Product for Internal Platforms is clear. I would like to say that was the intention: to reassert the ideas in it which resounded with me, while stating a few of my own.

Footnotes

To define a platform for the purposes of this post, I use Camille Fournier’s words: A platform is the software side of infrastructure. ↩
I use AWS throughout this article as a stand-in for a generic cloud provider. ↩
Although I do not mention data abstractions provided by data platform teams specifically here, the arguments in this article hold just as true. ↩
The other function, in many cases, turns out to be another engineering team. eg. a team of backend engineers reliant on the tooling provided by the infrastructure team for the provisioning of servers. ↩
As anyone aware of product management for consumer products, this is gross reductionism; but let us take it at face value for the sake of the narrative I want to focus on. ↩
Metrics have an overloaded meaning here. To clarify, your team might have good observability metrics to understand the need to design better solutions to scale existing features, but no metrics to make the case for new features. ↩

Mute Buttons Are The Latest Discourse Markers

2020-06-08T11:11:00+00:00

Every one, and their mom, is on a video call at least once a day now. There is a tiny second-order effect brewing in these video calls. It’s do with the mute buttons.

Let me describe a common-place scenario in conference calls. In a call with a fair number¹ of participants, we tend to keep our microphones muted. It’s common etiquette, and the reason to do so is pretty plain. You do not want to burden the speaker & other participants with your background noise; and this helps to keep the conversations as distraction-free as possible. So, what happens when you want to speak? You unmute the microphone. Simple! And once you have spoken, you “concede the conversation” by muting yourself back.

In short, mute buttons are functioning as discourse markers, and are our latest language innovation. This also reminds me a bit of Hyrum’s Law:

… all observable behaviors of your system will be depended on by somebody.

Footnotes

IMO, more than 3 people in a call necessitates the use of the mute button. ↩

The Blue Flower : A Review

2020-05-04T11:11:00+00:00

The Blue Flower by Penelope Fitzgerald
My rating: 5 of 5 stars

Historical fiction can work at such disparate levels; an era as a backdrop for the narrative, familiar textbook history unraveling as background score to the symphony of the lead characters’ life, the idiosyncrasies of the bygone era pictured in contrast to the era of the writer. The Blue Flower has everyone of these devices used to perfection, but it is so much more.

It is a purported biography of the early life of Novalis, a romantic poet & philosopher from 18th/19th century Saxony. It is an unusual love story. I do not use ‘unusual’ as moral judgement for love across an uncomfortable age-divide, but to mean the stark contrast between the lovers in (for want of better words) their levels of intellect & emotional range. To highlight my point, let me present my favorite exchange of words between Novalis and his ladylove Sophie:

`Should you like to be born again?’, asks Novalis, expecting a conversation on the philosophy of transmigration.
Sophie considered a little. ‘Yes, if I could have fair hair.’

Such an unbridgeable divide, but Fitzgerald convinces us of the irrational sway of love (love of the truly, madly, deeply variety).

In addition, the book is an account of the lives of Lower German nobility; a comical sketch of the reaction of this landed gentry to contemporaneous French Revolution, the epochal ideas of liberty & egalitarianism that it espoused, and the subsequent march of Napoleon.

Lastly, but foremost for me, the book’s thin underlying veneer of (Fichtean) philosophy makes you want more & know more of it.

Why should poetry, reason and religion not be higher forms of Mathematics? All that is needed is a grammar of their common language.

S3 and HDFS

2020-05-04T11:11:00+00:00

Cluster storage systems have, over the past decade, moved their gold standards from directory-oriented file-systems such as HDFS to object-stores such as AWS S3. The two storage models have been dissected & compared over & again from multiple perspectives ¹ ². Again, based on your use-case, you might be more interested in a certain cross-section of differences between S3 & HDFS than other differences. I am not trying here to repeat the analyses.

I wrote this short, bullet-style compilation as a quick refresher for myself on ways S3 differs from HDFS; it is focused on the APIs & interactions Hadoop-like data-processing systems (such as Hadoop, Spark, or Flink ) might have with storage systems.

Consistency Model

The S3 consistency model promises read-after-write consistency ³. The relaxed constraints of this model include:

File delete and update operations may not immediately propagate. Old copies of the file may exist for an indeterminate time period.
Directory operations: delete() and rename() are implemented by recursive file-by-file operations. They take time at least proportional to the number of files, during which time partial updates may be visible. If the operations are interrupted, the filesystem is left in an intermediate state.

Directory Structure

Also, as an object store, there is no directory structure in S3. Hadoop-S3 clients mimic directory structure by:

Creating a stub entry after a mkdirs call, deleting it when a file is added anywhere underneath
When listing a directory, searching for all objects whose path starts with the directory path, and returning them as the listing.
When renaming a directory, taking such a listing and asking S3 to copying the individual objects to new objects with the destination filenames.
When deleting a directory, taking such a listing and deleting the entries in batches.
When renaming or deleting directories, taking such a listing and working on the individual files.

POSIX-Compliance

The above discussion might give an impression that HDFS is POSIX-compliant. Neither HDFS or S3 are POSIX-compliant.

For HDFS, append-only semantics are the best known exception, but there are many other. For example, it also seems to lack support for extended attributes, and does not honour POSIX durability semantics (for instance, it buffers writes at the client when it should not).

Cumulative Effects

The consequences of the above-listed differences include:

Directory listing can be slow. ⁴
The time to rename a directory is proportional to the number of files underneath it (directly or indirectly) and the size of the files. ⁵
Directory renames are not atomic: they can fail partway through, and callers cannot safely rely on atomic renames as part of a commit algorithm.
Directory deletion is not atomic and can fail partway through.

Footnotes

HDFS vs. Cloud Storage: Pros, Cons and Migration Tips. ↩
Top 5 Reasons for Choosing S3 over HDFS. ↩
Read-After-Write Consistency in Amazon S3. ↩
Use listFiles(path, recursive) for high performance recursive listings, whenever possible. ↩
The copy is executed inside the S3 storage, so the time is independent of the bandwidth from client to S3. ↩

Converse Conway's Law

2020-05-02T11:11:00+00:00

Melvin Conway in his 1968 paper How Do Committees Invent? postulated the now-famous Conway’s Law.

Organisations which design systems are constrained to produce designs which are copies of the communication structures of these organisations.

This homomorphism between organisational communication structures and systems designed by them, has become an adage in software management. It implies a one-way effect, though. But, does it work in the other direction?

Given a mature (say, software) system, can we infer organisational communication structures? Particularly, informal communication structures? ¹ Do informal communication structures affect system design in the first place?

Footnotes

Formal communication structures are defined by organisational reporting structures. ↩

Providing Streaming Joins as a Service at Facebook

2020-05-01T11:11:00+00:00

Providing Streaming Joins as a Service at Facebook. Jacques-Silva, G., Lei, R., Cheng, L., et al. (2018). Proceedings of the VLDB Endowment, 11(12), 1809-1821.

Stream-stream joins are a hard problem to solve at scale. “Providing Streaming Joins as a Service at Facebook” provides us the overview of systems within Facebook to support stream-stream joins.

The key contributions of the paper are:

a stream synchronization scheme based on event-time to pace the parsing of new data and reduce memory consumption,
a query planner which produces streaming join plans that support application updates, and
a stream time estimation scheme that handles the variations on the distribution of event-times observed in real-world streams and achieves high join accuracy.

Trade-offs in Stream-Stream Joins

Stream-stream joins have a 3-way trade-off of output latency, join accuracy, and memory footprint. One extreme of this trade-off is to provide best-effort (in terms of join accuracy) processing-time joins. Another extreme is to persist metadata associated with every joinable event on a replicated distributed store to ensure that all joinable events get matched. This approach provides excellent guarantees on output latency & join accuracy, with memory footprint sky-rocketing for large, time-skewed streams.

The approach of the paper is in the middle: it is best-effort with a facility to improve join accuracy by pacing the consumption of the input streams based on dynamically estimated watermarks on event-time.

Systems Overview

The streaming-join service is built on top of three in-house systems within Facebook: Scribe, Puma, & Stylus. A larger overview of these systems, along with other streaming systems in use within Facebook, is provided in Realtime Data Processing at Facebook.

Scribe is a persistent distributed messaging system that organises data in categories (like Kafka topics). Categories can be partitioned into multiple buckets, and a bucket is the unit of workload assignment.
Puma allows developers to write analytic jobs in a SQL-like DSL with Java UDFs called Puma Query Language (PQL).
Stylus is a C++ framework for building stateless, stateful and monoid stream processing operators. Stylus also provides operators the ability to replay a Scribe stream for an earlier point in time and persist in-memory state to local or remote storage.

The use of Scribe as the data transfer mechanism between operators in Stylus means that these operators can be easily plugged into a Puma query-execution DAG (in which operators are linked via Scribe as well).

Join Semantics

The system supports only inner-join and left-outer join. The join output can be all matching events within a window (1-to-n) or a single event within a window (1-to-1).

To maintain backward compatible, the system limits the changes a user can make to existing streaming joins. If the updated streaming join is significantly different, users have the option of creating a view with a new name and deleting the old one. Two examples of rules an update must follow are:

preservation of the join equality expression, as its modification can cause resharding of the Scribe categories
projection of new attributes must be specified at the end of the select list, as adding an attribute in the middle of the select list would cause the join operator to consume old attribute values as the value of a different attribute.

Query Language

For streaming-joins, the user express their joins via PQL in Puma, while the join is implemented under the hoods in Stylus. A user-defined join is defined as an application:

CREATE APPLICATION sample_app;

CREATE INPUT TABLE left (
eventtime, key, dim_one, metric
) FROM SCRIBE("left");

CREATE INPUT TABLE right (
eventtime, key, dim_two, dim_three
) FROM SCRIBE("right");

CREATE VIEW joined_streams AS
SELECT
l.eventtime AS eventtime, l.key AS key,
l.dim_one AS dim_one, r.dim_two AS dim_two,
COALESCE(r.dim_three, "example") AS dim_three,
ABS(l.metric) AS metric
FROM left AS l
LEFT OUTER JOIN right AS r
ON (l.key = r.key) AND
(r.eventtime BETWEEN
l.eventtime - INTERVAL `3 minutes' AND
l.eventtime + INTERVAL `3 minutes');

CREATE TABLE result AS
SELECT
eventtime, key, dim_one, dim_two,
dim_three, metric
FROM joined_streams
STORAGE SCRIBE (category = "result");

The join view specification above has an equality expression (line 18), and a window expressed with the BETWEEN function and using intervals on the timestamp attributes (lines 19-21).

Given a PQL query as above, it is compiled into an execution plan comprised of operators. For joins, the operators involved are :

Slicer : a Puma operator, similar to a mapper in MapReduce, which can ingest data from Scribe, evaluate expressions, do tuple-filtering, project columns, shard streams, and write data to Scribe, Hive, or other storage sinks.
Join : a Stylus operator which can ingest data from two Scribe streams, maintain the join windows, execute the join logic, and output the result into another Scribe stream.

Figure 1. Logical Plan for Join query

As part of the pre-join transformations on both the left (probe-side) & right side (build-side), projections are resolved, join equality & timestamp expressions are computed, and the streams are sharded as per the join-equality attribute & written into intermediary Scribe categories.

Since only inner joins or left-outer joins are supported by the system, the expressions of the left-side are evaluated before join while for the right-side, this is done after the join.

The Join Operator

Figure 2. The Join Operator

As shown in Figure 2, the join operator has 3 components: (i) a stateful engine, used to process the left stream, (ii) a stateless engine, processing the right stream, and (iii) a coordinator, to bridge the two engines together.

Left stateful engine : As events in the left stream are processed, the stateful engine looks-up matching events in the right join window. In case of a successful lookup, join results are generated. If there are no matches, the event is retained (till the window closes) in the buffer to retry later.¹

Right stateless engine : The stateless engine ingests the right stream and maintains a window of events that matches the specified join window for the incoming left stream events. The window is trimmed at regular intervals to expel events outside the join window. This happens when the dynamically estimated processing time for the stream moves forward.²

Coordinator : The coordinator brings both engines together by providing APIs for the left engine to look up matching events in the right engine, and for both engines to query the other stream’s dynamically estimated processing time.

Dynamically Estimated Processing Time

A stream property called processing time is used in the synchronisation of the two streams. Stream synchronisation is essential to limit the memory required to buffer events needed for matching. This section is a brief overview of processing time; while the next section describes its use in stream synchronisation.

Processing Time (PT) indicates the estimated time of a stream : a time for which we estimate that there will be no new events whose event-time has a value that is smaller than the processing time.³ To compute a stream’s PT, it is divided into micro-batches with configurable size. The PT for a micro-batch is then computed. Events in future micro-batches are expected to have event-time > PT.⁴

The sizing of micro-batches is crucial. Larger micro-batches provide better estimates of PT ⁵, but increase latency in the system.

Synchronising Streams

Figure 3. The Synchronisation Algorithm

Each stream computes its PT independently. Stream synchronisation is performed by pausing the stream that has its PT too far ahead of the other. Synchronization uses the following formula:

\[\mathsf{ PT_{left} + Window_{upper} = PT_{right}}\]

where $\mathsf{ PT_{left}}$ represents the processing time estimated for the left stream, $\mathsf{ PT_{right}}$ is the processing time for the right stream, and $\mathsf{Window_{upper}}$ is the upper boundary of the window.

Performance Evaluation

Join Accuracy

Figure 4. Join accuracy is close to accuracy observed in Batch joins with similar join windows.

Join Window Size vs. Memory Consumption

Figure 5. Memory consumption is proportional to the window size.

Join Window Size vs. Join Success Rate

Figure 6. Improvement in success rate is not large when comparing a 1-hour window to a 6-hours window.

Footnotes

The stateful engine persists state into a local RocksDB instance and replicates it asynchronously to remote HDFS clusters, and hence, called stateful. ↩
Although it maintains an in-memory state, the engine is stateless from a system perspective. It does not checkpoint to local or remote storage. It relies on replaying data from Scribe categories for fault-tolerance. ↩
In Stylus, the processing time is implemented as a percentile of the processed event times, similar to Millwheel. ↩
If the statistic for PT is an x percentile statistic, the assumption is that any future micro-batch will have at most x% of events with an event-time < PT. ↩
A better estimate of PT is that which fulfills the low watermark assumption that at most x% of events processed after a given PT will have an event-time smaller than it. ↩

Natural Languages are Interfaceless

2020-04-20T11:11:00+00:00

In The Design of Everyday Things, Donald Norman talks about the temperature knobs on his refrigerator:

I used to own an ordinary, two-compartment refrigerator - nothing very fancy about it. The problem was that I couldn’t set the temperature properly. There were only two things to do: adjust the temperature of the freezer compartment and adjust the temperature of the fresh food compartment. And there were two controls, one labeled “freezer”, the other “refrigerator”. What’s the problem? Oh, perhaps I’d better warn you. The two controls are not independent. The freezer control also affects the fresh food temperature, and the fresh food control also affects the freezer.

In fact, there is only one thermostat and only one cooling mechanism. One control adjusts the thermostat setting, the other the relative proportion of cold air sent to each of the two compartments of the refrigerator. It’s not hard to imagine why this would be a good design for a cheap fridge: it requires only one cooling mechanism and only one thermostat. Resources are saved by not duplicating components - at the cost of confused customers.

Norman is talking about the lack of a (good) interface here: a layer to translate (and hide) the structure of the underlying mechanism to the users of the mechanism. ¹ The need to translate to the user arises in two scenario:

There is a divide between the want of the user, and the how the mechanism is structured. I like to call it the what-how divide. ²
Although the mechanism & the user’s want are aligned, the mechanism is too convoluted for the user to use in a direct way. A facilitator is needed.

In both cases, a translation is needed, and the translator is termed an interface.

Languages are Interfaceless

(Inter)Faceless a.k.a No-Face

(Natural) Languages are the quintessential human way of communication. Our advanced languages are arguably the lone differentiators of our species from our cousins in the primate family, and the larger animal kingdom. ³

We have been inventing, honing, assimilating, and discarding languages since the start of our existence as a species. But we do not develop languages with an intent for it to be translated. Languages are not meant by its inventors to be translated. Every language is developed as if it is the only language in existence, and everyone else understands it.

This is in spite of the fact that translations of literature and texts are a crucial medium of cross-pollinating ideas, technology, values, and ethics.

The very leap western civilisation made into modernity via the High Middle Ages & the Renaissance is attributable (along with other major correlated factors) to the exchange of ideas between classical Greek, Latin, and Arabic. I state a few stellar examples. The philosophical commentaries of Abu ‘Ali al-Husayn ibn Sina (written in Arabic) were based on the works of Aristotle (written in classical Greek), and in turn Michael Scot’s Latin translation of ibn Sina’s works reintroduced medieval Europe to Aristotle. Muḥammad ibn Mūsā al-Khwārizmī codified Indian numeral systems in Arabic, and Latin translations of his textbooks introduced the decimal positional number system to Europe.

The number of examples, and their sheer impact are so overwhelmingly in favour for translations (with no arguable downsides), it makes one wonder why do we not have languages which are conducive to translations. Why do natural languages not have interfaces?

Languages with Interfaces?

The only reason for the non-existence of interfaces for languages is we do not know or understand what that means. It is also not how languages evolve. Organic evolution of languages do not have a fitness function which incorporates interfaceability. Even constructed, auxiliary languages do not design for it.

Constructed auxiliary languages, however, implicitly have a notion of interfaceability. I will use Esperanto as an illustrative example (partly because it is the most widely spoken constructed auxiliary language)⁴. Constructed auxiliaries are designed not to be the primary language of a person, but an auxiliary language to help communication with a speaker of another language⁵. Second, as a consequence, it provides systems for derivational word formation⁶. These in conjunction mean: a constructed auxiliary is the closest we have to a language interface. One might go a step further and say that in constructed auxiliaries, the interface is the language⁷.

In general, for every auxiliary language, there exists an interface within the mind of the speaker which translates between their primary tongue & the auxiliary tongue.

Interfaces are Dynamic

Almost every major language now possesses translation guides to almost every other major language. In cases where this is not true, we could use an auxiliary, third language to mediate the translation. Do such translation guides and dictionaries count as interfaces? This can be answered with a counter-question: Why do translations of literature between languages always require a medium (a human scholar or a machine-aided translation)? The shorter, crisper answer is no; translation guides are not interfaces.

Interfaces are dynamic. They hide the evolution of source languages as long as they conform to the interface. Users do not need to understand the evolution of the source language. On the other hand, translation guides of every form are a static snapshot of a subset of the source language captured into the target language; and this snapshot stays true for a given point in time, and may not hold true for any point in time before or after.

The above discussion on dynamicity in interfaces uncovers an important necessary (but not sufficient?) feature of interfaces: every interface has an associated Intermediate Representation (IR).

In the previous section, when I said that in constructed auxiliary languages the interface is the language, I implied that a constructed auxiliary language acts as an intermediate representation.

Ongoing Discussion

This is an incomplete article, as it reflects my nebulous thoughts on the subject. Open questions remain:

How does an intermediate representation of a natural language look? Do they need to be a first-class natural language in their own right?
Can we have interfaces live outside of a human mind? State-of-the-art Machine Language Translation does not, yet, fill those large shoes. How can we push it to be an interface?

Notes

The Law of Leaky Abstractions. ↩
Declarative programming languages are based on this philosophy of separation of the what & the how. ↩
I say “arguably”. Opposable thumbs are tight contenders as well. ↩
English is, by far, the most widely used auxiliary language, constructed or non-constructed. As a native of India, and a non-native speaker of Hindi, I believe Hindi is the second most-widely used auxiliary language. ↩
A Language for Idealists. ↩
Derivational word formation or Morphological derivation in languages allows speakers to derive hundreds of other words by learning one word root. ↩
This is not true for all constructed auxiliary languages, and particularly for Esperanto. The language evolves independently, and loses conformity to its interface. ↩

Innovation Loops

2020-04-18T11:11:00+00:00

The purpose of an engineering organization (at the risk of sounding frivolously reductionist) is to build business value. You can grow an organization’s delivered business value over time by: ¹

training members: investing in people,
improving process: investing in shaping behaviour and communication,
staking technical leverage: investing in technology.

A cumulative side-effect of these approaches is to strengthen innovation loops.

Innovation Loops

Innovation loops are informal, intrapreneurial feedback loops in engineering teams which builds products & features to address user demand & pain. It is innovation which circumvents the software development cycle involving product & market research teams. In mature teams, innovation loops complement & reinforce the existing, evolutionary product development feedback cycle. I call product development evolutionary, in contrast to the more revolutionary (or reactive) trait of innovation loops.

Regular product development as green arrows; Innovation loops as red squiggles.

Innovation loops are more prevalent in infrastructure teams than in product-focused teams. This could be partly explained by the availability of direct communication channels to users which infrastructure teams possess, and product-focused teams do not.

Notes

From Will Larson’s post on Building Technical Leverage. ↩

Filling Missing Data

2020-03-25T11:11:00+00:00

A recent exercise I undertook of upgrading Apache Spark for some workloads from v2.4.3 to v2.4.5 surfaced a number of run-time errors of the form:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "name" among (id, place);
  at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:223)
  at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:223)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:222)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$toAttributes$2.apply(DataFrameNaFunctions.scala:475)

A little poking-around showed this error occurred for transformations with a similar general shape. The following is a minimal example to recreate it:

val df = Seq(
  ("1", "Berlin"),
  ("2", "Bombay")
  ).toDF("id", "place")

df.na.fill("empty",Seq("id", "place", "name"))

This looks wrong, but apparently works fine in v2.4.3 😲. A transformation which attempts to fill in a missing value for a column which does not exist should raise an error: v2.4.5 does that.

Deep Dive

So, what changed? A review of the changelog for v2.4.5 shows a number of changes touching the functionality for working with missing data in DataFrames. The relevant change here is SPARK-29890.

SPARK-29890 addresses the issue of DataFrameNaFunctions.fill not handling duplicate columns when column names are not specified. But it addresses our issue as well, as a side-effect.

A part of the associated pull-request is presented below: the first gist is the v2.4.3 version of a private method called fillValue, and the next gist, the v2.4.5 version. fillValue is the underlying method for every overloaded version of DataFrameNaFunctions.fill.

The crux of the change relevant to us is in the signature of fillValue.

v2.4.3	v2.4.5
`def fillValue[T](value: T, cols: Seq[String]): DataFrame`	`def fillValue[T](value: T, cols: Seq[Attribute]): DataFrame`

To solve the original issue the PR addresses (i.e, handling duplicate columns when column names are not specified), the comparison of columns was switched from the earlier

cols.exists(col => columnEquals(f.name, col)) to cols.exists(_.semanticEquals(col))

This necessitated a change in the signature of fillValue. However, to convert cols required by fillValue from Seq[String] to Seq[Attribute], it is passed to a method toAttributes:

private def toAttributes(cols: Seq[String]): Seq[Attribute] = {
    cols.map(name => df.col(name).expr).collect {
      case a: Attribute => a
    }
  }

This method, as a side-effect, ensures the columns passed to DataFrameNaFunctions.fill exists in the dataframe.

In short, this change in behaviour in DataFrameNaFunctions.fill causes tiny pains & an improved correctness to fill transformations.

A Conversation with Software Engineering Daily

2020-02-21T11:11:00+00:00

I was recently on the Software Engineering Daily podcast to talk about Data Engineering at Nubank.

It turned to be a great conversation on functional data engineering, the importance of testability & reproducibility in data engineering (and our approach to achieving it at scale at Nubank), thinking of dataset quality in terms of dataset-as-a-service, and my take on the history of data engineering as a rediscovery of the table abstraction. Check it out here.

Why Are Computer Storage Units Called 'Memory'?

2020-01-11T11:11:00+00:00

Why are storage units in computers called "memory"? It might seem natural today, but the anthropomorphism does seem odd after a bit of thought. As many concepts in computing, this is attributable to von Neumann. A short thread on the history of this naming. 1/5 #computinghistory
— Sujith Jay Nair (@suj1th) January 11, 2020

UPenn Electrical Engg. Department had designed & built ENIAC, the first programmable electronic computer, between 1943-1945. This team roped in von Neumann to help design a successor to ENIAC. The first output of this collaboration was "First Draft of a Report on the EDVAC". 2/5
— Sujith Jay Nair (@suj1th) January 11, 2020

In the year (1944-45) that von Neumann was working on the "First Draft", he was introduced to a 1943 paper by McCulloch & Pitts called "A Logical Calculus of the Ideas Immanent in Nervous Activity". It described the similarities between mathematical logic & neural networks. 3/5
— Sujith Jay Nair (@suj1th) January 11, 2020

The paper's idea of the seeming similarity between digital control circuits & the operations of the biological nervous system background is supposed to have inspired von Neumann's use of biological language in the "First Draft". 4/5
— Sujith Jay Nair (@suj1th) January 11, 2020

Components in EDVAC were called "organs", and the storage of the first 'stored-program' computer was christened "memory". The use of the name "memory" to refer to computer storage stuck, while the use of "organs" & other such terms faded into oblivion. 5/5
— Sujith Jay Nair (@suj1th) January 11, 2020

Mesos, in the light of Omega

2019-11-24T11:11:00+00:00

Mesos is a framework I have had recent acquaintance with. We use it to manage resources for our Spark workloads. The other resource management framework for Spark I have prior experience with is Hadoop YARN. In this article, I revisit the concept of cluster resource-management in general, and explain higher-level Mesos abstractions & concepts. To this end, I borrow heavily the classification of cluster resource-management systems from the Omega paper.

The Omega system is considered one of the precusors to Kubernetes. There is a fine article in ACM Queue describing this history. Also, Brian Grant has some rare insights into the evolution of cluster managers in Google from Omega to Kubernetes in multiple tweet-storms, such as this and this.

Overview

I would like to start by defining the anatomy of distributed systems’ Mesos caters to.

Fig. 1: Anatomy of Distributed Systems

This class of distributed systems have one (or multiple) co-ordinator(s)¹ co-ordinating the execution of a program across a bunch of workers distributed within a cluster of machines. The co-ordinators have a set of desirable properties:

Distributed
Fault-tolerant
Elastic

Distributed refers to the simultaneous execution of program tasks² across multiple machines. This is nothing short of what is expected from a distributed system.

Fault-tolerance is the capability of the system to handle failures of a subset of workers in the cluster, and the ability to reschedule tasks away from those failing workers.

Elasticity is the capability of the system to optimise performance and resource utilisation in a cluster, given simultaneous workloads competing for resources. It could be considered the defining characteristic which differentiates cluster resource-management paradigms & systems; say, Mesos from YARN, or Kubernetes. The next section explores elasticity in some depth.

Elasticity

An elastic system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible. ³

Elasticity, as defined above, is a scheduling problem. There are multiple design choices to consider to achieve elasticity and based on these choices, resource-management systems can be bucketed into categories. The table below lists the categories suggested in the paper, along with the design choices:

Approach	Resource Choice	Interference	Allocation Granularity	Fairness
`Monolithic`	All Available	None (Serialized)	Global Policy	Strict Priority (Preemption)
`Statically Partitioned`	Fixed Subset	None (Partitioned)	Per-partition Policy	Scheduler-dependent
`Two-Level [Mesos]`	Dynamic Subset	Pessimistic	Hoarding	Strict Fairness (Dominant Resource Fairness)
`Shared-State [Omega]`	All Available	Optimistic	Per-scheduler Policy	Free-for-all, Priority Preemption

The design choices can be described briefly as follows:

Choice of Resources to Participating Workloads
- Do participating workloads (effectively, schedulers of participating workloads) have a universal view of cluster state and universal access to cluster resources? Or is it a limited view and restricted access?
- Preemptive scheduling vs. Non-preemptive scheduling.
Interference
- Pessimistic approach to resource sharing, vs. Optimistic concurrency with conflict resolution.
Allocation Granularity
- Gang-scheduling vs. Incremental allocation.
Fairness
- Strict, central enforcement of fairness policy, vs. Reliance on emergent behaviour with post-facto checks.

I refer the reader to the Taxonomy section of the paper for a more verbose discussion on design choices. The categories are summarised in the following diagram.

Fig. 2: Scheduling Architectures. Note: Statically Partitioned Schedulers are considered Monolithic.

Mesos belongs to the category of two-level scheduling. The documentation for Mesos alludes to this fact by calling Mesos a level of indirection for scheduler frameworks. ⁴

Fig. 3: Mesos, a level of indirection

Two-Level Scheduling?

The two-level scheduling provided by Mesos can be described thus: A scheduler exists at the framework-level & another exists as part of Mesos (as a component of the Mesos master ). In a cluster with heterogenous workloads, multiple frameworks function together with a single Mesos scheduler.

Fig. 4: Mesos Scheduler Interactions

The Mesos master dynamically partitions a cluster, allocating resources to the different framework schedulers. Resources are distributed to the frameworks in the form of offers, which contain only “available” resources – ones that are currently unused. ⁵ An example of a resource offer is shown below.

Fig. 5: Mesos Resource Offer

The events in the diagram are described as follows: ⁶

Agent 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation policy module, which tells it that framework 1 should be offered all available resources.
The master sends a resource offer describing what is available on agent 1 to framework 1.
The framework’s scheduler replies to the master with information about two tasks to run on the agent, using <2 CPUs, 1 GB RAM> for the first task, and <1 CPUs, 2 GB RAM> for the second task.
Finally, the master sends the tasks to the agent, which allocates appropriate resources to the framework’s executor, which in turn launches the two tasks (depicted with dotted-line borders in the figure). Because 1 CPU and 1 GB of RAM are still unallocated, the allocation module may now offer them to framework 2.

During offers, the master avoids conflicts by only offering a given resource to one framework at a time, and attempts to achieve dominant resource fairness (DRF) ⁷ by choosing the order and the sizes of its offers. Because only one framework is examining a resource at a time, it effectively holds a lock on that resource for the duration of a scheduling decision. In other words, concurrency control is pessimistic.⁸

A note on YARN

To contrast the above discussion with YARN, resource requests from per-job application masters are sent to a single global scheduler in the resource master, which allocates resources on various machines, subject to application-specified constraints. The application masters provide job-management services, and no scheduling. So YARN is effectively a monolithic scheduler architecture.

Closing Comments

A concise overview of the functionalities of Mesos in comparison with other resource management systems is shown in the following diagram. ⁹

Fig. 6: Comparison

This diagram makes clear why the makers of Mesos call it the distributed systems kernel. Mesos does not try to provide every functionality a distributed system needs to function; it provides the minimum over which frameworks are expected to build on. This explains the need for schedulers & orchestration services such as Marathon, Aurora and Peloton to run your applications on Mesos. We will delve into them in a future post. Until next time!

P.S. The title illustration is from Apache Mesos as an SDK for Building Distributed Frameworks by Paco Nathan.

Notes

In resource management systems literature, the term co-ordinator is used interchangeably with scheduler. ↩
Tasks here refers to sub-units of the program which are independently schedulable. ↩
Herbst, Nikolas; Samuel Kounev; Ralf Reussner (2013). “Elasticity in Cloud Computing: What It Is, and What It Is Not” (PDF). Proceedings of the 10th International Conference on Autonomic Computing (ICAC 2013), San Jose, CA, June 24–28. ↩
A scheduler framework is the distributed system running on top of Mesos, eg. Spark, Storm, Hadoop. In other words, framework ≈ distributed system. ↩
This is referred to as resource choice in the Omega paper. ↩
From the official Mesos documentation. ↩
Dominant Resource Fairness is an allocation algorithm for clusters with mixed workloads, which has its origins in the same UC Berkeley research group as Mesos. ↩
This is termed as pessimistic interference in the Omega paper. ↩
This comparison is from the article on Peloton, Uber’s open source resource scheduler. ↩

Hanlon's Razor: Some Comments

2019-11-16T11:11:00+00:00

Do not attribute to malice that which can be explained by the less criminal motives of ignorance and lethargy.

An aphorism of utmost utility in my life is the Hanlon’s Razor. I find it a liberating rule of thumb to weigh a lot of unavoidably unpleasant experiences in daily life. In a less formal & more terse form that I prefer, it reads:

Stupid people abound; Malicious people, less so.

There is a neat wikipedia article on it which focuses on its origin, and also introduced me to an earlier form of the aphorism by Goethe.

Misunderstandings and lethargy perhaps produce more wrong in the world than deceit and malice do. At least the latter two are certainly rarer. Johann Wolfgang von Goethe, in The Sorrows of Young Werther

Every time a fellow passenger is curt to me in the underground; whenever colleagues at work are under-appreciative to a piece of work I poured my heart into; when I face those racial micro-aggressions on a bad day; when a loved one is being unreasonable & stubborn: I find myself reaching out for the razor in an attempt to rationalise the situation in my mind.

Over time, as I have used it to great effect to calm a distressed me, I have found it to be a little deficient. Let me tell you a story:

Tia is responsible for compiling a report at work, which she believes would be instrumental in reflecting on the process inefficiencies in her organisation. She is very motivated to make this report as astute as possible, and has a slightly selfish motive of leveraging the impact of this report in her next conversation for a work promotion.

She submits the report, and waits anxiously for feedback. Days pass without a murmur, and she comes to the realisation that the attitude of the management to her report is indifference. She blames her immediate manager for burying the report. He never really likes her in the first place, goes her reasoning.

Tia was a person with a developed sense of stoicism towards such predicaments. She tried to calm herself by trying to believe the management was not being malicious, but rather being lethargic to change. But the more she tried calming herself, the strong sense of dislike for her manager engulfed her. To an extent, she was sure of her manager’s malice. Hanlon’s razor was not helping her.

The above story is illustrative, but it describes a common dilemma we face. In the face of (partial) knowledge of malice, it is hard to give the other person the benefit of doubt. So, the razor is moot in this case, right?

I would like to argue to the contrary. Yes, the purpose of a razor is to eliminate unlikely explanations for a particular phenomenon. But the part which remains unsaid in this definition of a razor is ‘to what end?’. Let us explore this question in Tia’s case.

If the intent is to investigate the real reason for her report been treated indifferently, Tia should assume malice on her manager’s part and proceed accordingly. This has the downside of assuming that the root-cause was entirely external, and usually does not lead to any self-improvement. But if the intent is to consider it an opportunity to do things in a different way next time, the best course of action is to attribute it to lethargy, misunderstanding or plain stupidity. Of course, there is no right way to choose here. But I believe being aware of our options is in itself an empowerment.

Prefer Unions over Or in Spark Joins

2019-10-11T11:11:00+00:00

A common anti-pattern in Spark workloads is the use of an or operator as part of ajoin. An example of this goes as follows:

val resultDF = dataframe
 .join(anotherDF, $"cID" === $"customerID" || $"cID" === $"contactID",
   "left")

This looks straight-forward. The use of an or within the join makes its semantics easy to understand. However, we should be aware of the pitfalls of such an approach.

The declarative SQL above is resolved within Spark into a physical plan which determines how this particular query gets executed. To view the query plan for the computation, we could do:

resultDF.explain()

/* pass true if you are interested in the logical plan of the query as well */
resultDF.explain(true)

For the purpose of our discussion we will stick to just the physical plan. For a more detailed understanding of query plans within Spark, I would recommend reading: Deep Dive into Spark SQL’s Catalyst Optimizer.

In the physical plan of a join operation, Spark identifies the strategy it will use to perform the join. The most common types of join strategies are (more can be found here):

Broadcast Join
Shuffle Hash Join
Sort Merge Join
BroadcastNestedLoopJoin

I have listed the four strategies above in the order of decreasing performance. In all cases, you do not want your joins to be resolved into a BroadcastNestedLoopJoin because it is just a fancy name for using nested for-loops to join your data-frames.

We now have enough background to understand the drawback of an or in a join clause. We will assume the data-frames in our example are of considerable size (:big-data:). Analyzing its physical plan, you will see something similar to this:

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftOuter, ((cID#8 = customerID#23) || (cID#8 = contactID#24))
:- *(1) Project [_1#4 AS cID#8, _2#5 AS c2#9, _3#6 AS c3#10]
:  +- *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#4, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#5, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#6]
:     +- Scan[obj#3]
+- BroadcastExchange IdentityBroadcastMode
   +- *(2) Project [_1#18 AS c1#22, _2#19 AS customerID#23, _3#20 AS contactID#24]
      +- *(2) SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#18, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#19, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#20]
         +- Scan[obj#17]

For many large workloads, a query plan involving a BroadcastNestedLoopJoin will result in an run-time exception similar to : SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB

So, how do we work-around this? High-school boolean algebra to the rescue! Remember an or over two sets result in their union set. We can rewrite our example as follows:

val resultPart1 = dataframe.join(anotherDF, $"cID" === $"customerID", "left")
val resultPart2 = dataframe.join(anotherDF, $"cID" === $"contactID", "left")

val resultDF = resultPart1.unionByName(resultPart2)

This produces the following physical plan:

== Physical Plan ==
Union
:- SortMergeJoin [cID#8], [customerID#23], LeftOuter
:  :- *(2) Sort [cID#8 ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(cID#8, 200)
:  :     +- *(1) Project [_1#4 AS cID#8, _2#5 AS c2#9, _3#6 AS c3#10]
:  :        +- *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#4, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#5, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#6]
:  :           +- Scan[obj#3]
:  +- *(4) Sort [customerID#23 ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(customerID#23, 200)
:        +- *(3) Project [_1#18 AS c1#22, _2#19 AS customerID#23, _3#20 AS contactID#24]
:           +- *(3) SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#18, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#19, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#20]
:              +- Scan[obj#17]
+- SortMergeJoin [cID#8], [contactID#24], LeftOuter
   :- *(6) Sort [cID#8 ASC NULLS FIRST], false, 0
   :  +- ReusedExchange [cID#8, c2#9, c3#10], Exchange hashpartitioning(cID#8, 200)
   +- *(8) Sort [contactID#24 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(contactID#24, 200)
         +- *(7) Project [_1#18 AS c1#22, _2#19 AS customerID#23, _3#20 AS contactID#24]
            +- *(7) SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#18, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#19, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#20]
               +- Scan[obj#17]

As we see, the dreaded BroadcastNestedLoopJoin has been replaced by two SortMergeJoins, which has much better performance guarantees.

Also, it is important to understand why union is an efficient operation to be embraced every time we can use it. Union causes zero shuffling of data across executors; it is just a bookkeeping change for Spark.

I will leave you with a complete reproducible example so you can try this out in your notebook:

Edit: An important caveat to the above discussion is that we can use unions instead of or only in case the or conditions are collectively independent.

Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology

2019-09-14T11:11:00+00:00

Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Abouzied, A., Abadi, D.J, Bajda-Pawlikowski, K., Silberschatz, A. (2019, August). Proceedings of the VLDB Vol. 12 (12).

HadoopDB was a prototype built in 2009 as a hybrid SQL system with the features from Hadoop MapReduce framework and parallel database management systems (Greenplum, Vertica, etc). This paper revisits the design choices for HadoopDB, and investigates its legacy in existing data systems. I felt it is a great review paper for the state of modern data analysis systems.

MapReduce is the most famous example in a class of systems which partition large amounts of data over multitude of machines, and provide a straightforward language in which to express complex transformations and analyses. The key feature of these systems is how they abstract out fault-tolerance and partitioning from the user.

MapReduce, along with other large-scale data processing systems such as Microsoft’s Dryad/LINQ project, were originally designed for processing unstructured data.

The success of these systems in processing unstructured data led to a natural desire to also use them for processing structured data. However, the final result was a major step backward relative to the decades of research in parallel database systems that provide similar capabilities of parallel query processing over structured data. ¹

The MapReduce model of Map -> Shuffle -> Reduce/Aggregate -> Materialize is inefficient for parallel structured query processing.

1) .. database systems are most efficient when they can pipeline data between operators. The forced materialization of intermediate data by MapReduce - especially when data is replicated to a distributed file system after each Reduce function - is extremely inefficient and slows down query processing.

2) MapReduce naturally provides support for one type of distributed join operation: the partitioned hash join. In parallel database systems, broadcast joins and co-partitioned joins when eligible to be used are frequently chosen by the query optimizer, since they can improve performance significantly. Unfortunately, no implementation of broadcast and co-partitioned joins fit naturally into the MapReduce programming model.

3) Optimizations for structured data at the storage level such as column-orientation, compression in formats that can be operated on directly (without decompression), and indexing were hard to leverage via the execution framework of the MapReduce model.

In spite of these shortcomings, there are valid technical (and non-technical) reasons for the wide adoption of Hadoop for structured data processing.

Fault-tolerance in Hadoop during run-time query processing.
Ability to handle heterogenous clusters
Ability to parallelize user defined functions.

HadoopDB was designed to take advantage of these technical prowesses of Hadoop, while addressing its shortcomings.

HadoopDB placed a local DBMS (PostgreSQL/VectorWise) on every node in the data processing clusters. This enabled significant speedup in the Map tasks, as filtering, projection, transformation, certain joins and, partial aggregations were pushed into the local DBMS.

Fig. 1: Pushdown of Map Functions

The desirable properties of HadoopDB as a data processing framework were:

Querying could be done in SQL, MapReduce or a combination thereof.
Ability to handle heterogenous clusters; a trait derived from Hadoop.
Fault-tolerance; another trait derived from Hadoop.

HadoopDB leveraged Hadoop’s checkpointing of intermediate data to disk after Map tasks, along with the determinism of Map and Reduce tasks in the MapReduce model to implement mid-query fault tolerance and thereby scale to very large deployments.
HadoopDB with VectorWise was able to consistently outperform Hive and a commercial DBMS.

Research Contributions

Split Execution

Split MapReduce/Database Joins : In case of broadcast joins, HadoopDB chooses either of two strategies: i) A Map-side broadcast hash join, or ii) Insert the smaller table into the DBMS as a temporary table, and perform the join within the DBMS at each node.
Partial Aggregations : Based on heuristics, partial aggregations were used in join + aggregations type of queries to either prevent unnecessary writes to HDFS, or to improve query performance.

Other contributions listed include Invisible Loading, Sinew, and Automatic Schema Generation.

Review of SQL-on-Hadoop

HadoopDb demonstrated the performance benefits of columnar data storage in the Hadoop ecosystem. The Hadoop community followed it with the introduction of columnar storage capability into HDFS file formats, namely Parquet and ORC.

Parquet and ORCFile use PAX blocks for columnar storage. In PAX, data is kept in columns within blocks, but a given block may consist of multiple columns from the same table. This makes tuple reconstruction faster since all data needed to perform this operation can be found in the same block. On the other hand, PAX reduces scan performance compared to pure column stores since not all data for a given column is placed contiguously on disk. ²

The next-wave of SQL-on-Hadoop systems utilised the performance of columnar storage, and were architected as system-level integrations of parallel databases and large-scale data processing systems.

Hive evolved from a language-level hybrid to a system-level hybrid, incorporating pluggable execution engines. Tez, similar to Dryad, was one of the execution engines borne out of this effort. An additional layer of processing called LLAP was introduced.

LLAP (Live Long and Process) … introduced per-node daemons responsible for local query execution and caching hot data. In essence, LLAP instances served a similar purpose in Hive as local DBMS servers in HadoopDB.

Apache Calcite was incorporated into Hive to provide cost-based optimizations. ORC ACID provided transactional table support.

Spark, which has similarities to Dryad and Tez, had significant performance gains over MapReduce in iterative data processing. SparkSQL brought in SQL capabilities to Spark. Delta is a transactional table storage for Spark built on Parquet.

Impala and HAWQ, like HadoopDB, include a specialized single-node query execution engine in a Hadoop cluster. They differ in the fact that intra-node communication is not managed by MapReduce. They have a complete parallel database system to manage intra-node communication, and thus entire query plans can circumvent MapReduce. This results in a loss of mid-query fault-tolerance.

Presto is also a complete parallel database system. Like Impala, Presto fully pipelines relational operators. This means faster query execution and no support for mid-query fault-tolerance.

By being complete implementations of parallel execution engines, Impala, HAWQ, and Presto are somewhat independent systems that integrate with Hadoop mostly at the storage level (although Impala and HAWQ both also integrate with Hadoop’s resource management tools). To that end, they provide similar (albeit more native) functionality to a large number of commercial parallel database systems that have “connectors” to Hadoop that enable them to read data from HDFS.

Footnotes

DeWitt, David, and Michael Stonebraker. “MapReduce: A major step backwards.” The Database Column. ↩
As noted by Dmitriy Ryaboy here, the combined effect of large block sizes (~256 MB in case of ORC and 512-1024MB for Parquet in standard deployments) and parallel reader processes diminishes the significance of the PAX “weaving pattern”. ↩

Datomic with Rich Hickey

2019-09-13T11:11:00+00:00

This talk is an introduction to Datomic, by its creator Rich Hickey. My notes on this talk are linked below:

Open Core : How Did We Get Here?

2019-07-14T11:11:00+00:00

Open source is considered an exemplar of the ‘private-collective’ model of innovation,¹ a compound model with elements from both the private investment & the collective action models.

This model was an attempt to rationalise and reason about the existence of the open source software industry, and answer the question: “why would thousands of top-notch programmers contribute, without apparent material incentives, to the provision of a public good?”.²

This essay revisits the assumptions of the private-collective model, in the cloud-compute era, to understand the surgent phenomena of the open core revenue model in the commercial open source software industry. This is of particular significance in view of the perceived siege of the open source model by cloud vendors.³

Private-Collective Model

A public good produced under the private-collective model has two classes of users: innovators and free-riders. Innovators are users of the the good, and active contributors to its sustained development. Free-riders are passive consumers of the good, with no contribution to the advancement of the good.

For the sustenance of an open source product, it is imperative that the innovator class has incentives beyond what is available to the free-rider class of users. It is a myth that the development of a public good can be sustained on the basis of pure altruism from the innovator class of users.

In case of individual innovators, the incentives, as postulated by the model, are:

learning, which is an immediate incentive.
signaling incentive, which is a form of delayed incentive.
- career concern incentive: in the form of future job offers, shares in commercial open source corporations, or future access to the venture capital market.
- visibility incentive: in the form of peer recognition.

For an organisation to participate as an innovator, a different set of incentives could be at play. Unlike the case of an individual contributor, the incentives are harder to classify, and much-less researched at scale. A majority of organisations have looked at the incentive structure of open-source from the same prism as an individual contributor. Open source provides organisations signaling incentives to attract talent, and visibility in the community (again, this helps in hiring).

Other delayed incentives traditionally identified for initiator organisations of OSS can take the form of complementary services.⁴ That is, the company expects to boost its profit on a complementary segment to the open source project (For example, technical support on the OSS product). This is popular as the Redhat business model. A caveat to this strategy is that the increase in profit in the proprietary complementary segment should offset the profit that would have been made in the primary segment, had it not been converted to open source.

Very few incentives beyond this have been identified and utilised by organisations, till recent times. This explains why organisations do not feel compelled to open source proprietary products on which the revenue of the company has a direct dependence.⁵

A recent wave of COSS (Commercial Open Source Software) organisations are trying to change this fact. A reason for this phenomenon is an hypothesis that has emerged in the industry that the open source process is a way for a small-scale organisation to use the diffusion networks associated with open source to take on a dominant player in the industry.⁶

The Cloud Era

The experimentation with open source as a business & product delivery model by COSS organisations is also becoming contemporaneous with the near-monopolistic rise of cloud vendors. In the private-collective model, the existing cloud vendors would be classified as free-riders. The zero-sum (winner-takes-all) nature of the cloud model⁷ means cloud vendors have little motivation to contribute back to the community, while capturing a disproportionate share of the value generated by an open source product.

The confluence of these two trends is the reason for the emergence of the open core model. The first generation of open core business models are simple tweaks to the open-source licenses designed to defend against cloud providers ⁸. And to be clear, the use of licenses to defend proprietorship is not much of an innovation. MySQL dual licensing has existed for a decade prior to the cloud era.

As with other innovations of business models happening in the COSS industry, open core is an emergent phenomena, and its efficacy and implications are yet to be understood in full measure. Its efficacy will be judged in terms of the degree of sustainability it brings to the COSS industry; while its implications with respect to the spirit of open source and free-sharing of software (and how far removed it is from that ideology) is up for debate.

P.S. A draft of this essay was published here, as part of the Collectiive initiative.

Footnotes

Hippel, Eric von, and Georg von Krogh. “Open source software and the “private-collective” innovation model: Issues for organisation science.” organisation science 14.2 (2003): 209-223. Link ↩
Lerner, Josh, and Jean Tirole. “Some simple economics of open source.” The journal of industrial economics 50.2 (2002): 197-234. Link ↩
Dix, Paul. “The Open Source Business Model is Under Siege.” (2017). Link ↩
The delayed incentives for a commercial open source organisation may not be limited to complementary services. The entire premise of the modern commercial open source industry is based on the type & form of delayed incentives an open source project can accrue. ↩
An argument against open sourcing products with direct impact on revenue is minimal product differentiation resulting in limited pricing power and corresponding lack of revenue. Link ↩
A prominent recent example of this phenomenon is Hashicorp Inc. ↩
A consumer of a cloud-vendor has every reason to use the cloud-vendor’s offering of a service (built on top of a open source product), rather than using it from a different organisation, even if that other organisation is the original initiator of the underlying open source product (and therefore, arguably, has more of an expertise in it). ↩
Examples include Redis, Gitlab, Elasticsearch among others. ↩

Github Sponsors

2019-05-24T11:11:00+00:00

1/ The recent announcement of @github sponsors (https://t.co/OjxZU1t6UT) is an interesting development in OSS. I feel it is a great time to revisit my essay on the questions facing open-source software: https://t.co/UgK40taf9R #GitHubSponsors 1/N
— Sujith Jay Nair (@suj1th) May 23, 2019

Q.1: Can we make the open-source movement self-sustaining? Open source survives on philanthropy: the altruism of the initiator of an open source project, the unpaid labour of the maintainer, and the monetary donations to foundations. Is there an alternative, self-sustaining way?
— Sujith Jay Nair (@suj1th) May 23, 2019

Github Sponsors could, prima facie, remove the reliance of OSS projects on foundations. It would continue to be based on philanthropy. Of the sponsors. Is that an improvement over the present? I am not sure.
— Sujith Jay Nair (@suj1th) May 23, 2019

Q.2:
a) Can we pay back for the effort of the maintainer and the individual contributor?
b) Can we provide economic incentives to the maintainers and contributors to help continued development?
c) How do we assign value to an open source project & contribution?
— Sujith Jay Nair (@suj1th) May 23, 2019

Sponsor is a definite answer to (a) & (b). It introduces economic incentives for prolonged development & maintenance of projects. (c) is the one I love. Every user decides his value for a OSS, and chooses to pay for it. The unanswered part is how to value a single contribution.
— Sujith Jay Nair (@suj1th) May 23, 2019

Q.3:
a) How can an entity be incentivised to give back a portion of the value it captures from an open source project back to the community?
b) How do we gauge the value captured by an entity from an open source project?
— Sujith Jay Nair (@suj1th) May 23, 2019

The incentive to pay is the dire scenario in which an OSS maintainer chooses to ditch the project to work on an another project which is 'sponsored'. So, if you rely on an OSS, and reap economic benefits from it, you are incentivised to pay, or maintain it on your own.
— Sujith Jay Nair (@suj1th) May 23, 2019

Overall, Sponsor is an experiment worth watching out for. Eyes peeled for what's in store! N/N
— Sujith Jay Nair (@suj1th) May 23, 2019

Catastrophic Forgetting

2018-12-01T11:11:00+00:00

1/ Catastrophic Forgetting is a long-recognised problem in neural networks; and is of great interest in cognitive sciences. In plain words, it is the destructive interference effect of learning a new skill on pre-existing skills. #deeplearning
— Sujith Jay Nair (@suj1th) December 1, 2018

2/ Research in Deep Learning has had a particular focus on this problem in recent time, particularly in the realm of reinforcement learning. e.g. [Rolnick, Ahuja, Schwarz et al. 2018], [Shin, Lee, Kim et al. 2017] among others. #deeplearning
— Sujith Jay Nair (@suj1th) December 1, 2018

3/ A common thread in these research is the replay of past data to reinforce acquired skills from the past. Rolnick et al. (https://t.co/jluXqvN2Qr) choose a 50-50 split of replay vs. new task data. #deeplearning
— Sujith Jay Nair (@suj1th) December 1, 2018

4/ The use of replay using a limited buffer demonstrates a drastic reduction in catastrophic forgetting. The limited replay buffer result, in particular, is an exciting one; considering that deep learning models are such resource-hogs. #deeplearning
— Sujith Jay Nair (@suj1th) December 1, 2018

5/ This research, in cumulation with its predecessors, answers an important question about replay: a limited sample of past experiences produces a minimal difference in performance in comparison to an unlimited buffer. #deeplearning
— Sujith Jay Nair (@suj1th) December 1, 2018

6/ An important direction this experiment could take is via tweaks on the split of replay vs new tasks data, and uniform reservoir sampling on the replay buffer. Anybody who has tried to learn using spaced repetition knows that a uniform split at every point in the...
— Sujith Jay Nair (@suj1th) December 1, 2018

7/ ...future does not guarantee optimality (or, so they say). A least-recently-used bias in sampling should help in achieving a spaced-repetition kind of behavior in replay. It would be interesting to see the effects of it on catastrophic forgetting. #deeplearning
— Sujith Jay Nair (@suj1th) December 1, 2018

Caveat to Open Source Disruption

2018-10-20T11:11:00+00:00

I believe an important caveat exists to this postulate: the physical capital necessary for the production & innovation of the resource should have low-cost access & wide distribution. I will try & explore this caveat a bit. https://t.co/p57V70PiCr
— Sujith Jay Nair (@suj1th) October 20, 2018

I will use the case of the Pharmaceutical industry. Modern drug discovery is a patent-heavy process, which should make it a ripe candidate for open source disruption. But this has not been the case, yet.
— Sujith Jay Nair (@suj1th) October 20, 2018

My argument for why this is so is the concentrated nature of the physical asset (lab infrastructure, capital for clinical trials) needed for innovation in drug discovery - it is limited to large pharmaceutical firms and some university departments.
— Sujith Jay Nair (@suj1th) October 20, 2018

The concentrated nature of the physical asset ensures the opportunity cost of losing out on innovation that could have been garnered by the resource as a commons, is very low. This, in turn, reduces the effective implementation cost of property for the resource. Hence, patents!
— Sujith Jay Nair (@suj1th) October 20, 2018

Open Source Eats Patents

2018-10-19T11:11:00+00:00

@asynchio postulates that every patent-heavy industry will be dis-intermediated by open source. A thread on why this prediction could turn out to be true. 1/N
— Sujith Jay Nair (@suj1th) October 19, 2018

Demsetz' Theory on Property Rights models the emergence of property around a resource as a function of the cost of implementing & enforcing property rights.
— Sujith Jay Nair (@suj1th) October 19, 2018

A resource, managed as property, could evolve into commons when the implementation cost of property rights exceeds the value of the increase in the efficiency of utilisation of the resource caused by adoption of property rights.
— Sujith Jay Nair (@suj1th) October 19, 2018

The effective implementation cost of property for a resource in a patent-heavy industry is a combination of two factors:
1. the (nominal) cost of a patent.
2. the opportunity cost of losing out on innovation that could have been garnered by the resource as a commons.
— Sujith Jay Nair (@suj1th) October 19, 2018

OTOH, the efficiency of utilisation of a resource for a firm in a patent-heavy industry = the share of the value created by the resource which the firm can capture. This is lower for a commons resource vs a patented resource.
— Sujith Jay Nair (@suj1th) October 19, 2018

How much lower? And, is this loss covered for by relinquishing the cost of property implementation? A traditional view on this would answer NO.
— Sujith Jay Nair (@suj1th) October 19, 2018

The emergence & continued flourish of the COSS industry, however, is a proof to the contrary. We can draw parallels between proprietary software & patents (both are moats around information resources with the abject intent of increasing an entity's share of resource utilisation).
— Sujith Jay Nair (@suj1th) October 19, 2018

I believe a drift from patents to commons will result as the realisation of the high hidden cost to implementation of patents offsets the scepticism to the commercial viability of open source. (N/N)
— Sujith Jay Nair (@suj1th) October 19, 2018

I believe an important caveat exists to this postulate: the physical capital necessary for the production & innovation of the resource should have low-cost access & wide distribution. I will try & explore this caveat a bit. https://t.co/p57V70PiCr
— Sujith Jay Nair (@suj1th) October 20, 2018

I will use the case of the Pharmaceutical industry. Modern drug discovery is a patent-heavy process, which should make it a ripe candidate for open source disruption. But this has not been the case, yet.
— Sujith Jay Nair (@suj1th) October 20, 2018

My argument for why this is so is the concentrated nature of the physical asset (lab infrastructure, capital for clinical trials) needed for innovation in drug discovery - it is limited to large pharmaceutical firms and some university departments.
— Sujith Jay Nair (@suj1th) October 20, 2018

The concentrated nature of the physical asset ensures the opportunity cost of losing out on innovation that could have been garnered by the resource as a commons, is very low. This, in turn, reduces the effective implementation cost of property for the resource. Hence, patents!
— Sujith Jay Nair (@suj1th) October 20, 2018

Dynamo vs Cassandra : Systems Design of NoSQL Databases

2018-10-02T11:11:00+00:00

State-of-the-art distributed databases represent a distillation of years of research in distributed systems. The concepts underlying any distributed system can thus be overwhelming to comprehend. This is truer when you are dealing with databases without the strong consistency guarantee. Databases without strong consistency guarantees come in a range of flavours; but they are bunched under a category called NoSQL databases.

NoSQL databases do not represent a single kind of data model, nor do they offer uniform guarantees regarding consistency and availability. However, they are built on very similar principles and ideas.

From a historical perspective, the advent of NoSQL databases was precipitated by the publication of Dynamo by Amazon¹ & BigTable by Google, and the emergence of a number of open-source distributed data stores, which were (improved?) clones of either (or both) of these systems. Bigtable-inspired NoSQL stores are referred to as column-stores (e.g. HyperTable, HBase), whereas Dynamo influenced most of the key/value-stores. We will term these systems loosely as Dynamo-family databases, which include Riak, Aerospike, Project Voldemort, and Cassandra.

I would like to focus on systems design ideas in Dynamo-family NoSQL databases in this article, with a particular focus on Cassandra. The approach of this article is to compare and contrast Cassandra with Dynamo; and in this process, touch upon the underlying ideas. Expect a lot of homework & further readings; I will have copious amounts of references throughout the article.

Caveats

Cassandra, although heavily influenced by Dynamo, also borrows from BigTable.

A chunk of the differences between Cassandra & Dynamo stem from the fact that the data-model of Dynamo is a key-value store, while Cassandra is designed as a column-family data store (which is a concept from BigTable in which the primary abstraction is a sparsely populated wide table).

For what it’s worth, Cassandra is full-fledged database implementation while Dynamo is a set of ideas which when taken together can form a highly available distributed data-store (courtesy of it being a walled system with its implementation never released to public domain).

We will focus on a comparison of the implementation choices in Cassandra & Dynamo which address the challenges of a distributed system; and will try to steer clear of contrasts arising from differences in the data models.

Meat of the Matter

The essence of this article can be summarized as below:

Problem	Dynamo	Cassandra
High Availability for Writes	Vector Clocks	Last Write Wins
Temporary Failures	Sloppy Quorum & Hinted Hand-offs	Strict Quorum & Hinted Hand-offs
Partitioning	Consistent Hashing	Consistent Hashing
Permanent Failures	Anti-entropy using Merkle trees	Anti-entropy using Merkle trees
Failure Detection	Gossip Protocols	Gossip Protocols

The remainder of this article is little more than a collection of introductions to the above concepts.

High Availability for Writes

High-availability writes in a distributed database with leaderless replication (both Dynamo and Cassandra employ leaderless replication) requires a heuristic for conflict resolution between concurrent writes. This is essential because every replica of data is considered equal and concurrent writes on the same record at two different replicas are considered perfectly valid.

The common heuristics for conflict resolution are vector clocks or last-write-wins.

Vector Clocks

Vector Clocks are a mechanism to notify the actors about the occurrence of conflicts.

To illustrate the process, assume three actors with actor IDs Anu, Baba and Chandra respectively. Let an existing data-point in the data-store be {"street" : "Lavelle", "city" : "Bangalore"} with key ‘address’. We will call this version of ‘address’ V0. Anu updates the street such that the data now reads {"street" : "Cubbon", "city" : "Bangalore"}, which we will call V1. This is updated to a single replica. A concurrent update is performed by Baba, who changes the city such that the data now reads {"street" : "Lavelle", "city" : "Bombay"} (V2 ). This is updated to another replica.

In a data-store using vector clocks, the data-store holds onto both V1 and V2 because they do not descend from each other. When an actor reads the data at a future point of time (for example, Chandra is reading the ‘address’ data), the data-store will hand back both values. The client decides on the merge strategy of the sibling data returned to it. Once descendancy can be calculated, values stored with vector clocks that have been succeeded will be removed.

Vector Clocks were proposed as a conflict resolution mechanism in the original Dynamo paper; however, most Dynamo-family databases have last-write-wins as their conflict resolution mechanism.²

The maintainers of Riak have lucid explanations of how vector clocks work and the associated challenges in Why Vector Clocks Are Easy and Why Vector Clocks Are Hard.

Last Write Wins

In last-write-wins, the latest change to a data-point alone is retained. Thus, in the above example, V2 will be the final version of the data in a last-write-wins data-store. Thus, last-write-wins is a simplification over the vector clock approach. It could lead to potential data loss (such as the update to ‘street’ introduced in V1 is lost).

The maintainers of Cassandra believe this simplification is justified because of the data model is different from a key-value database. Each row is broken up into columns which are updated independently. This fine-grained implementation of last-write-wins is argued to work well for Cassandra. In case of time-ties for the same column, Cassandra has a rule-based, deterministic method to get a commutative result. You can read more on why Cassandra doesn’t need vector clocks here.

Temporary Failures

Temporary failures occur when a replica is unavailable for read and/or write operations for a small duration of time. This could arise because of GC stalls, network and hardware outages, or maintenance shutdowns.

Hinted Handoff is a common strategy in write paths to handle and repair temporary failures in systems with leaderless replication. In read paths, the approach for handling temporary failures could be either strict or sloppy quorum.

Hinted Handoff

In Hinted Handoff, when a write is performed and a replica node for the row is either known to be down ahead of time, or does not respond to the write request, the coordinator will store a hint locally. This hint is basically a wrapper around the mutation indicating that it needs to be replayed to the unavailable node(s).

Once a node discovers via gossip that a node for which it holds hints has recovered, it will send the data row corresponding to each hint to the target.

Hinted Handoff has two purposes:

It allows the database to offer full write availability when consistency is not required.³
It improves response consistency after temporary outages.⁴

The implications of hinted writes on read & write consistencies are covered in the next section on strict & sloppy quorum. Notes on Cassandra’s implementation of hinted handoff can be found here.

Strict & Sloppy Quorum

Quorum is the number of replicas which should acknowledge a particular read or write operation; it is closely associated with the replication factor. The use of hinted writes to meet consistency requirements in read paths decides whether a quorum is strict or sloppy.

In Dynamo, in scenarios where the available replicas is less than the total replicas, sloppy quorum is used to ensure read availability. The hinted writes stored in a node other than the replicas count towards read consistency requirements, and thus reads can be served even in cases where available replicas is less than the consistency requirements.

In Cassandra, strict quorum is used. This means that hinted writes do not count towards read or write consistency requirements (with the exception of ANY write consistency).

Consistent Hashing

This article started with saying that distributed databases represent a distillation of years of research in distributed systems. No other concept illustrates this better than consistent hashing (or ring hashing). Consistent hashing has been around since 1997 ⁵, and formed the basis of the formation of Akamai Technologies, and the subsequent birth of the Content Distribution Network industry.

In consistent hashing, the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned multiple random value within this space. Each random value is called a vnode position; a single node is associated to multiple vnodes & consequently multiple positions on the ring.

Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first vnode with a position larger than the item’s position. The node associated with the vnode is the location of the data item.

Fig. 1: Consistent Hashing

The principle advantage of consistent hashing is incremental stability; the departure or arrival of a node into the cluster only affects its immediate neighbours and other nodes remain unaffected.

However, ring hashing is prone to uneven load distribution. The table below, taken from the paper on Jump Hash (an alternative algorithm to ring hash), shows the standard error in loads & the 99% confidence interval of bucket sizes as multiples of average loads.

Points per Bucket	Standard Error	Bucket Size 99% Confidence Interval
1	0.9979060	(0.005, 5.25)
10	0.3151810	(0.37, 1.98)
100	0.0996996	(0.76, 1.28)
1000	0.0315723	(0.92, 1.09)

This table can be read in the following way: in an implementation with 10 vnodes per node, the standard deviation of load is within (1 - 0.3151810) ≈ 0.69. The 99% confidence interval for the bucket sizes lies between 0.37 to 1.98 times the average load. Such a high variance could make capacity planning tricky (although I am yet to see this in practice with Cassandra). In Cassandra, the number of vnodes is controlled by the parameter num_tokens.

Consistent hashing is also a part of the replication strategy in Dynamo-family databases. In Cassandra, two strategies exist. In SimpleStrategy, a node is anointed as the location of the first replica by using the ring hashing partitioner. Subsequent replicas are placed on the next nodes clockwise on the ring. In NetworkTopologyStrategy, for each datacenter, the same steps are performed with a difference when choosing subsequent replicas: subsequent replicas are placed on the next node clockwise on the ring which belongs to a different rack than the location of the previous replica.

The use of consistent hashing in Voldemort and Riak follow the same above-illustrated pattern. An excellent primer on alternatives to ring hashing & their respective trade-offs can be found here.

Anti-Entropy using Merkle trees

The distributed nature of data means data in a replica can become inconsistent with other replicas over time. Dynamo-family databases have a multi-pronged approach to deal with it. Hinted Hand-offs is a strategy in write-paths to handle and repair temporary failures. Read repair is a strategy to repair inconsistencies observed in the read-path.

The above two strategies work behind the scene to repair data; but due to their very nature of repairing data accessed during reads or writes alone, they cannot repair every data item in the replicas. They work best if the system membership churn is low and node failures are transient. Hence, these databases provide a manual way to trigger repairs of data. This is called an anti-entropy repair. ⁶

Anti-entropy is a process of comparing the data of all replicas and updating each replica to the newest version. It relies on Merkle tree hash exchanges between nodes. The following extract, from Cassandra documentation, is a good overview of the use of Merkle trees in the anti-entropy process.

Merkle trees are binary hash trees whose leaves are hashes of the individual key values. The leaf of a Cassandra Merkle tree is the hash of a row value. Each Parent node higher in the tree is a hash of its respective children. Because higher nodes in the Merkle tree represent data further down the tree, Casandra can check each branch independently without requiring the coordinator node to download the entire data set.

After the initiating node receives the Merkle trees from the participating peer nodes, the initiating node compares every tree to every other tree. If a difference is detected, the differing nodes exchange data for the conflicting range(s), and the new data is written to SSTables. The comparison begins with the top node of the Merkle tree. If no difference is detected, then the data requires no repair. If any difference is detected, the process proceeds to the left child node and compares and then the right child node. When a node is found to differ, inconsistent data exists for the range that pertains to that node. All data that corresponds to the leaves below that Merkle tree node will be replaced with new data.

The key difference in Cassandra’s implementation of anti-entropy from Dynamo is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes.⁷ Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.

Notes on Cassandra’s implementation of anti-entropy can be found here. A more-detailed blog-entry on anti-entropy, with a similar take on comparing Riak, Cassandra & Dynamo as this article, can be found here.

Gossip Protocols

Gossip protocols are a class of peer-to-peer communication protocols inspired by information dissemination in real-life social networks. In Dynamo, decentralized failure detection protocols use a simple gossip-style protocol that enable each node in the system to learn about the arrival (or departure) of other nodes. Failure detection helps nodes in avoiding communication with unresponsive peers during read and write operations. Dynamo’s failure detection protocol is based on Gupta et al. (2001), although the exact implementation details are obscure.

The Cassandra implementation of gossip is very similar to Dynamo, and we have a lot more information on it via its documentation. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

The gossip process tracks state from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes communicated about secondhand, third-hand, and so on). Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, and historical conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. Configuring the phi_convict_threshold property adjusts the sensitivity of the failure detector.

That’s All, Folks!

The scope of topics covered in this article are vast; and I do not believe a single article does justice to them. However, the attempt here was to compile a compendium on the ideas behind NoSQL distributed systems. A future article will attempt a similar take on NewSQL databases, but that’s for another day. Hope you find this useful. Until next time!

Notes

A fair amount of review-articles exists for the Dynamo paper: a few interesting takes can be found here & here. ↩
Riak has last-write-wins as its default mechanism for conflict resolution, inspite of a high chance of data-loss. As its maintainers put it,

Vector clocks are hard: even with perfect implementation you can’t have perfect information about causality in an open system without unbounded information growth.

↩
The catch here is ‘when consistency is not required’. In Dynamo which uses sloppy quorum, hinted writes count towards write consistency requirements, and hence hinted handoff helps offer full write availability.

In Cassandra, for every write consistency other than ANY, hinted writes do not count towards write consistency requirements. Hence, full write availability may not be available. ↩
In a similar vein to the note above, sloppy quorum in Dynamo means that hints stored in nodes other than the replicas count towards read consistency requirements; thus, improving read availability.

Cassandra does not use hinted writes in computing read consistency requirements; thus hinted handoffs do not improve read availability. ↩
This may not be entirely true. A form of consistent hashing is claimed to have existed in Teradata since 1986. I could not find any source to back this claim. However, it is clear the Akamai paper was the first to name this technique as consistent hashing. ↩
Anti-entropy processes need not always be manual. Riak periodically clears & regenerates the Merkle hash trees used in the anti-entropy process. They term it active anti-entropy. ↩
Cassandra does not persist the hash trees generated during anti-entropy.

On the other hand, Riak (which seems to have followed the original Dynamo approach on this) uses persistent, on-disk hash trees instead of in-memory hash trees. The advantages of this approach, as stated by its maintainers, are twofold:

a) Riak can run anti-entropy operations with a minimal impact on memory usage.

b) Riak nodes can be restarted without needing to rebuild hash trees. ↩

A Simple Dichotomy for Modeling Data-Intensive Systems

2018-08-18T11:11:00+00:00

Cut to the chase

Large-scale data processing serves multiple purposes. At a 30,000-feet view, every purpose can be bucketed into two broad categories:

Maintaining Materialized Views
Processing Events

This categorization is a high, high level one I use to reason about data system design, and its utility fades fast as we delve deeper into system nitty-gritty. Silos appear within & around each of these buckets as we descend into implementation of systems, but it is still a useful one to reason about data-intensive applications.

The basis of this categorization is captured in the following statement:

Every data system has two variables: data & query. The defining feature of the system is in the temporal nature of these variables. In every data system, either data or query is transient and the other is persistent.

In a data system maintaining materialized views, data (or more precisely, the view of data) is persistent, and query is a transient entity flowing into & out of the system.

In a data system processing events, query is persistent and transient data flows through the system.

I like examples

What are examples of systems which can be reasoned using this simple model?

Every database system can be looked at as a system maintaining materialized views. Data is persistent, by the very definition of a database. It provides a DSL (such as SQL) to query against this persistent data. These queries are transient; once an output is generated against the query, no record is kept of it (except logs of it, perhaps). Some queries mutate data, but that is all right. It still fits the model; we defined data to be persistent, not immutable.

Database triggers are systems processing events. A pattern is stored against a trigger, and every time a new data point satisfies this pattern, a trigger event is generated.

A class of systems which belong to the bucket of systems processing events are CEP (Complex Event Processing) systems. In fact, every system which belongs to the bucket of systems processing events can be called a CEP system.

An analytics system performing batch computations, or stream processing, or implementing some form of lambda architecture is an example of a system capable of being modeled as either. The model depends on the vantage point from where you observe the system.

Every statistic, metric, aggregation, and machine-learning model that the system computes is a materialized view into the source data. Thus, if we view the analytics system in conjunction with the system-component storing the materialized views, i.e, from the vantage point of a consumer of the materialized views, the system exhibits the property of persistent data & transient query.

On the other hand, when viewed in disjunction with the component storing the materialized views, it exhibits the property of permanent query and transient data.

Why does this dichotomy exist?

Data in a system exists either as state or a stream. Martin Kleppmann has a loose analogy to connect states and streams ¹. In this analogy, State is defined as the mathematical integration (a cumulative effect) of a stream.

\[\mathsf{\mathbf{state(now) = \int_{t = 0}^{now} stream(t) dt }}\]

Our dichotomy is a direct effect of the two forms of data, and which form is the primary concern of your system. Systems concerned with state fall into the bucket of systems maintaining materialized views; whereas systems concerned with stream are event processing systems. In this sense, we could very well rename our categories as state systems and stream systems (although I feel these names are too generic to have any recall value).

Why do I need this vague dichotomy?

This dichotomy could form a part of your ‘W’ questions when you are designing a data-intensive system: more specifically, I believe it answers the ‘why’ question. Let us take a step back and have a brief look at each of the basic ‘W’ questions we need answered when designing a large-scale data-processing application.

What is the input to your system?

At the outset, we need to define the properties of the input data along the following dimensions:
- Bounded vs Unbounded
- Order
- Completeness
Tyler Akidau has a very lucid explanation of these concepts in his blog on The world beyond batch: Streaming 101.
How is the computation done?

Based on the answers of ‘what’, you could now make a choice of ‘how’ your computation will be performed. Two paradigms exist: Streaming and Batch. I refer the user again to the above blog by Akidau for a definition of these terms.
Who is the consumer of the output of your system?

Is your consumer interested in the aggregated state or the processed/enriched stream? The answer to this question seems to closely resemble our dichotomy. Multiple consumers interested in both stream & aggregated state will exist for your system; this is not incoherent. As we have observed in our examples, these multiple consumers are only placed at different vantage points with respect to your system. Thus, defining consumers is an exercise of defining the vantage points to your system.
Why is the computation performed?

The answer to this question is the raison d’être for your system. I believe our dichotomy captures a high-level answer to this question. Also, answering the ‘why’ encompasses every other ‘W’ question; so it helps to start with why.

The central tenet of this dichotomy is an old idea; streams and databases have had separate handling and research attention since long. ² ³

However, synergy between the two cannot be overstated. Recent state-of-the-art systems, such as Kafka and Samza, have blurred the distinction between them. Suggested reading include ⁴ and ⁵, both by Jay Kreps, along with ⁶ as examples of how stream systems are proving their utility as state systems.

References

Kleppmann, Martin. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. “ O’Reilly Media, Inc.”, 2017. ↩
Abadiand, D., et al. “Aurora: A data stream management system.” Proc. ACM SIGMOD. 2003. ↩
Aggarwal, Charu C., ed. Data streams: models and algorithms. Vol. 31. Springer Science & Business Media, 2007. ↩
Kreps, Jay. “It’s Okay To Store Data In Kafka”. Confluent, 2018, Link. Accessed 14 Aug 2018. ↩
Kreps, Jay. “Why Local State Is A Fundamental Primitive In Stream Processing”. O’reilly Media, 2018, Link. Accessed 15 Aug 2018. ↩
Gray, Jim. “Queues are databases.” arXiv preprint cs/0701158 (2007). ↩

Understanding Apache Spark on YARN

2018-07-24T11:11:00+00:00

Introduction

Apache Spark is a lot to digest; running it on YARN even more so. This article is an introductory reference to understanding Apache Spark on YARN. Since our data platform at Logistimo runs on this infrastructure, it is imperative you (my fellow engineer) have an understanding about it before you can contribute to it. This article assumes basic familiarity with Apache Spark concepts, and will not linger on discussing them.

Overview on YARN

YARN is a generic resource-management framework for distributed workloads; in other words, a cluster-level operating system. Although part of the Hadoop ecosystem, YARN can support a lot of varied compute-frameworks (such as Tez, and Spark) in addition to MapReduce.

The central theme of YARN is the division of resource-management functionalities into a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is the unit of scheduling on a YARN cluster; it is either a single job or a DAG of jobs (jobs here could mean a Spark job, an Hive query or any similar constructs).

Fig. 1: YARN Architecture [1]

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler [1].

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks [1].

Glossary

The first hurdle in understanding a Spark workload on YARN is understanding the various terminology associated with YARN and Spark, and see how they connect with each other. I will introduce and define the vocabulary below:

Application

A Spark application is the highest-level unit of computation in Spark. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. A Spark job can consist of more than just a single map and reduce. On the other hand, a YARN application is the unit of scheduling and resource-allocation. There is a one-to-one mapping between these two terms in case of a Spark workload on YARN; i.e, a Spark application submitted to YARN translates into a YARN application.

Spark Driver

To understand the driver, let us divorce ourselves from YARN for a moment, since the notion of driver is universal across Spark deployments irrespective of the cluster manager used.

Fig. 2: Spark Cluster Overview [4]

Spark applications are coordinated by the SparkContext (or SparkSession) object in the main program, which is called the Driver. In plain words, the code initialising SparkContext is your driver. The driver process manages the job flow and schedules tasks and is available the entire time the application is running (i.e, the driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes) [4].

YARN Client

A program which submits an application to YARN is called a YARN client, as shown in the figure in the YARN section. Simple enough.

The notion of driver and how it relates to the concept of client is important to understanding Spark interactions with YARN. In particular, the location of the driver w.r.t the client & the ApplicationMaster defines the deployment mode in which a Spark application runs: YARN client mode or YARN cluster mode.

Client mode: The driver program, in this mode, runs on the YARN client. Thus, the driver is not managed as part of the YARN cluster.

Take note that, since the driver is part of the client and, as mentioned above in the Spark Driver section, the driver program must listen for and accept incoming connections from its executors throughout its lifetime, the client cannot exit till application completion.

Fig. 3: YARN Client Mode [2]

Cluster mode: The driver program, in this mode, runs on the ApplicationMaster, which itself runs in a container on the YARN cluster. The YARN client just pulls status from the ApplicationMaster. In this case, the client could exit after application submission.

Fig. 4: YARN Cluster Mode [2]

Executor and Container

The first fact to understand is: each Spark executor runs as a YARN container [2]. This and the fact that Spark executors for an application are fixed, and so are the resources allotted to each executor, a Spark application takes up resources for its entire duration. This is in contrast with a MapReduce application which constantly returns resources at the end of each task, and is again allotted at the start of the next task.

Also, since each Spark executor runs in a YARN container, YARN & Spark configurations have a slight interference effect. I will illustrate this in the next segment.

Configuration and Resource Tuning

With our vocabulary and concepts set, let us shift focus to the knobs & dials we have to tune to get Spark running on YARN. We will be addressing only a few important configurations (both Spark and YARN), and the relations between them.

We will first focus on some YARN configurations, and understand their implications, independent of Spark.

yarn.nodemanager.resource.memory-mb

It is the amount of physical memory, in MB, that can be allocated for containers in a node. This value has to be lower than the memory available on the node.
```
<name>yarn.nodemanager.resource.memory-mb</name>
<value>16384</value> 
```
yarn.scheduler.minimum-allocation-mb

It is the minimum allocation for every container request at the ResourceManager, in MBs. In other words, the ResourceManager can allocate containers only in increments of this value. Thus, this provides guidance on how to split node resources into containers. Memory requests lower than this will throw a InvalidResourceRequestException.
```
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
```
yarn.scheduler.maximum-allocation-mb

The maximum allocation for every container request at the ResourceManager, in MBs. Memory requests higher than this will throw a InvalidResourceRequestException.
```
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
```

Thus, in summary, the above configurations mean that the ResourceManager can only allocate memory to containers in increments of yarn.scheduler.minimum-allocation-mb and not exceed yarn.scheduler.maximum-allocation-mb, and it should not be more than the total allocated memory of the node, as defined by yarn.nodemanager.resource.memory-mb.

We will refer to the above statement in further discussions as the Boxed Memory Axiom (just a fancy name to ease the discussions). A similar axiom can be stated for cores as well, although we will not venture forth with it in this article.

Let us now move on to certain Spark configurations. In particular, we will look at these configurations from the viewpoint of running a Spark job within YARN.

spark.executor.memory

Since every executor runs as a YARN container, it is bound by the Boxed Memory Axiom. However, a source of confusion among developers is that the executors will use a memory allocation equal to spark.executor.memory. In essence, the memory request is equal to the sum of spark.executor.memory + spark.executor.memoryOverhead. Thus, it is this value which is bound by our axiom.
spark.driver.memory

In cluster deployment mode, since the driver runs in the ApplicationMaster which in turn is managed by YARN, this property decides the memory available to the ApplicationMaster, and it is bound by the Boxed Memory Axiom. But as in the case of spark.executor.memory, the actual value which is bound is spark.driver.memory + spark.driver.memoryOverhead.

In case of client deployment mode, the driver memory is independent of YARN and the axiom is not applicable to it. In turn, it is the value spark.yarn.am.memory + spark.yarn.am.memoryOverhead which is bound by the Boxed Memory Axiom.

I hope this article serves as a concise compilation of common causes of confusions in using Apache Spark on YARN. More details can be found in the references below. Please leave a comment for suggestions, opinions, or just to say hello. Until next time!

References

[1] “Apache Hadoop 2.9.1 – Apache Hadoop YARN”. hadoop.apache.org, 2018, Available at: Link. Accessed 23 July 2018.

[2] Ryza, Sandy. “Apache Spark Resource Management And YARN App Models - Cloudera Engineering Blog”. Cloudera Engineering Blog, 2018, Available at: Link. Accessed 22 July 2018.

[3] “Configuration - Spark 2.3.0 Documentation”. spark.apache.org, 2018, Available at: Link. Accessed 22 July 2018.

[4] “Cluster Mode Overview - Spark 2.3.0 Documentation”. spark.apache.org, 2018, Available at: Link. Accessed 23 July 2018.

Shuffle Hash and Sort Merge Joins in Apache Spark

2018-06-28T00:00:00+00:00

Introduction

This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join.

Although Broadcast Hash Join is the most performant join strategy, it is applicable to a small set of scenarios. Shuffle Hash Join & Sort Merge Join are the true work-horses of Spark SQL; a majority of the use-cases involving joins you will encounter in Spark SQL will have a physical plan using either of these strategies.

MCVE

Let us take an example to understand the join strategies better. This time we will be using the Mondrian Foodmart dataset to write our queries against. For those unaware of it, the foodmart dataset is a popular test dataset for OLAP scenarios. It originated as part of the test suite of the Pentaho Mondrian OLAP engine. You can check out its schema layout here. We will be concerned with only a couple of tables from the dataset: sales_fact_98 & customer.

In Spark REPL, you can create the tables as shown below:

The explain output on the join-table describes the physical plan of the join operation:

The Spark SQL planner chooses to implement the join operation using ‘SortMergeJoin’. The precedence order for equi-join implementations (as in Spark 2.2.0) is as follows:

Broadcast Hash Join
Shuffle Hash Join: if the average size of a single partition is small enough to build a hash table.
Sort Merge: if the matching join keys are sortable.

Pick One, Please

There is some confusion over the choice between Shuffle Hash Join & Sort Merge Join, particularly after Spark 2.3. Part of the reason is the introduction of a new configuration spark.sql.join.preferSortMergeJoin, which is internal, and is set by default.

This means that Sort Merge is chosen every time over Shuffle Hash in Spark 2.3.0.

The preference of Sort Merge over Shuffle Hash in Spark is an ongoing discussion which has seen Shuffle Hash going in and out of Spark’s join implementations multiple times. It was first removed from Spark in version 1.6.0. It made a comeback in 2.0.0. In 2.3.0, it has again been voted out in favour of Sort Merge.

A reason for the preference of Sort Merge is that it is considered a more robust implementation, as Shuffle Hash Join requires the hashed table to fit in memory, counter to Sort Merge Join which can spill to disk. But Shuffle Hash does have its benefits, particularly when the build side is much smaller than stream side. In that case, the building of a hash table on smaller side should be faster than sorting the bigger side. And given this clear benefit, I am sure Shuffle Hash will rise again from the ashes.

Deep Dive

I would like to spend this section on Sort Merge Join alone, since its presence is invariant across Spark versions. The implementation of Sort Merge Join in Spark is similar to any other SQL engine, except that it happens over partitions because of the distributed nature of data. This means the best performance of this strategy is when the rows corresponding to same join-key are co-located. In every other case, it involves a shuffle operation to co-locate the data. We will have more to say on performance in the Caveats section.

I have below the code generated by Spark to perform the sort & merge operations. I give it to you without comment.

Caveats

The performance of Sort Merge Join, as with every distributed join strategy, is optimal under certain conditions and sub-par under certain others.

Best case scenarios :

In deteriorating order,

The Datasets have a known, shared partitioner; if the Datasets sharing the same partitioner are materialized by the same action, they will end up being co-located.
The Datasets are distributed evenly on the join columns.
The number of keys (combinations of join column values) is adequately large for the cluster and data-size at hand (since parallelism is proportional to the number of unique keys).

Worst case scenarios :

Extremely uneven sharding of Datasets on the join columns.
A large Dataset is joined with another Dataset, such that a majority of the rows of larger Dataset are not relevant to the join condition. (In this case, these non-relevant rows of the large Dataset will still be shuffled across before being filtered out; hence, the performance hit.)

Upcoming posts in this series will explore Cartesian Product, Broadcast Nested Loop Joins and others. Tune in for them. Please leave a comment for suggestions, opinions, or just to say hello. Until next time!

How Conversations on StackOverflow Teach You

2018-06-24T11:11:00+00:00

Note: This post has some concepts on Scala collections. Do not worry if you have little interest in Scala; the point I am trying to convey has significance beyond my choice of language. This is an exhortation to the engineering community at large to share our learnings more.

I was working on an open-source ticket recently where I received the following review comment:

“IndexedSeq[_] (being backed by scala.collection.Vector) are horrifically inefficient, and we should replace that with a better IndexedSeq that’s just backed by an Array.”

The reviewer here is referring to a Collection class ‘immutable.IndexedSeq’. For those new to Scala, IndexedSeq is an interface which provides an Java Array-like API. And like any Java interface, it states this contract without constraints on the run-time of its operations.

In other words, it defines methods like ‘head()’, ‘next()’, ‘hasNext()’ etc, but says nothing on how quick the implementation of these need to be. They could range from O(1) to O(N) in the various classes implementing the interface. The default implementation is backed by a Vector which uses a tree-like structure to perform these operations. So, definitely not O(1). And that is the point the reviewer is trying to make.

I was not aware of the nuances of IndexedSeq and set about in learning more about how it is implemented, and to find an Array-backed implementation. Although I learnt more and more about how IndexedSeq work, I was at a loss on how to implement an Array-backed implementation without copying the entire implementing class from scala.collections and making the relevant changes.

As a resort of the final sorts, I posted a question on StackOverflow. A few people responded via comments and answers; each offering a different world-view. But, what amazed me as I read and responded to their comments and answers is how little I knew on the topic and on some allied topics; how an exercise which lasted less than a few hours (between me posting the question and the final acceptance of an appropriate answer) can be so enriching.

But if we think of it, any conversation on matters we know little about (and would like to know more about) with a community which knows more will turn out to be enriching. Be it StackOverflow, Hacker News, Twitter, Reddit or any other forum. The best use of any such forum is when you cease to be just a consumer of information; but rather indulge in conversations. Tell them what you know; and let them teach you more.

Which brings me to my particular pet peeve. Why do we engineers not use the immediate community we have? Our teams are communities in themselves; each knowing something more about something than the other. How often do we share with each other things we learnt during that day? In this week? In this month? During last sprint or release? You must have learnt something! Something which surprised you; something which brought out an ‘Oh!’. Why would you not share it with everyone around you? Walk up to somebody today, and say ‘Did you know that …?’. And for the greater good, share it with the wider world as well. We would all like to know.

P.S. For those of you interested in the SO post and the original open-source ticket (which is still a work in progress), check out the links below:

Also, a thank you to the awesome team at CommitStrip for producing great comic strips, such as the one at the start of this post. The original strip can be found here.

Concurrency and Parallelism

2018-06-20T11:11:00+00:00

TL; DR This post explores the notion that the definition of concurrency & parallelism itself is not language-agnostic. Depending on the language & paradigms we subscribe to, the definitions change.

Introduction

Concurrency is a much-overloaded term in computer science. Every domain of literature in computer science defines its own flavour of what concurrency means to it. To add to the confusion, “Parallelism” and “Muti-programming” are purported as synonyms to concurrency. So much for our simple minds to wrap around.

The intention of this article is to provide a bit of a preface on the notion of concurrency, and compare it with parallelism. Bear with me.

Concurrency != Parallelism

Concurrency is the use of multiple threads of control to achieve a program objective. Each thread is interleaved in (execution) space & time, and the interleaving is non-deterministic. Thus, concurrent programs are non-deterministic in general.

The programmer coerces determinism into the program using synchronisation. Concurrency does not imply parallel execution; it is possible even on a single CPU core.

Contrast that with parallelism, which is the condition where a program utilises the multiplicity of computational hardware (several CPU cores, for instance) [1]. The idea here is about computational speedup alone. It does not entail interaction among the program components, nor between the components and external agents (such as a UI component, or the database).

So where does the notion of concurrency == parallelism come from? I would like to quote Simon Marlow, co-developer of the Glasgow Haskell Compiler and author of “Parallel and Concurrent Programming in Haskell”, on this:

It’s a natural consequence of languages with side-effects: when your language has side-effects everywhere, then any time you try to do more than one thing at a time you essentially have non-determinism caused by the interleaving of the effects from each operation. So in side-effecty languages, the only way to get parallelism is concurrency; it’s therefore not surprising that we often see the two conflated. ¹

In other words, the notion of threads of control (as defined by concurrency) makes sense only in a language with side-effects. In a purely-functional language, there are no effects to observe, and the evaluation order is irrelevant. ² Thus, parallelism can be achieved without concurrency.

On the other hand, in languages with side-effects, parallelism becomes a subset of concurrency. Hence, for instance, concurrency is defined as follows in Java:

A condition that exists when at least two threads are making progress. A more generalised form of parallelism that can include time-slicing as a form of virtual parallelism ³.

The understanding of this distinction, within the realm of your language of choice, is essential to develop and reason about concurrent and parallel programs. Until next time!

References

GHC Mutterings. (2018). Parallelism /= Concurrency. Available at: Link [Accessed 15 Jun. 2018]. ↩
Marlow, S. (2013). Parallel and Concurrent Programming in Haskell. ↩
Oracle Docs. (2018). Defining Multithreading Terms (Multithreaded Programming Guide). Available at: Link [Accessed 10 Jun. 2018]. ↩

The Assumption of Normality in Time Series

2018-03-10T11:11:00+00:00

The notion of normality is oft-encountered in statistics as an underlying assumption to many proofs and results; it is normal to assume normality (pun strongly intended; always wanted to use this one). In much statistical works, the assumption of normality, even if inaccurate, is amortized and ameliorated by the existence of Central Limit Theorem. Time-series analysis, sadly, does not enjoy this privilege. The assumption of independence, so core to the CLT and other Limit theorems, is poignantly absent in time-series.

This post tries to explain the use of Limit theorems in time-series analysis. As my intended audience comprises computer scientists/engineers, and not statisticians, this post has a long preface on the premise of the problem.

Preface

Limit Theorems

Limit theorems are a class of theorems in statistics governing the behaviour of sums of stochastic variables. The most widely used limit theorems are the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN). Our focus in this post is on the Central Limit Theorem. CLT, in itself, is a family of theorems rather than one single theorem. But in every form, CLT forms a set of weak-convergence rules around the sum of stochastic variables. We will have a little more to say on weak-convergence in our section on Convergence.

In its more generic form, CLT states that the sum of any number of random variables generated in a stochastic process will be asymptotically distributed according to a small set of attractor distributions. Notice that this general form of CLT makes no assumption on the independence of the variables; however, it is of scant use in this particular form.

In case of i.i.d. (independent and identically distributed) stochastic variables with a finite variance, the attractor distribution is a normal distribution. In other words, if $ x_{i} $ is a sequence of i.i.d. random variables, with ${Ex}_{i} = \mu$, $Var( x_{i}) = \sigma^2$, CLT states that,

\[\frac{\sqrt{n}}{\sigma}(\frac{1}{n}\sum_{i=1}^{n} x_{i} - \mu) \Rightarrow \mathcal{N}(0, 1)\]

In practice, how would you use this fact? A clichéd, yet pedagogical, example would be an experiment involving large number of coin flips. Suppose you choose a sample of n coin flips and count the number of heads you get within the sample. If this exercise of choosing a sample is done over and over, the count of heads you get will follow an approximate normal distribution. As you increase the sample size n asymptotically, the distribution will tend closer and closer to a normal distribution.

Convergence

I make a short note here on convergence for completeness. The relevant types of convergence common in statistics and probability are,

Almost surely, almost everywhere with probability one, $w.p. 1$:

\[\mathsf{\mathbf{X_n} \xrightarrow{a.s} \mathbf{X} : \mathbb{P} \{ \omega : \lim \mathbf{X_n} = \mathbf{X}\} = 1}\]

In probability, in measure:

\[\mathsf{\mathbf{X_n} \xrightarrow{p} \mathbf{X} : \lim\limits_{n} \mathbb{P} \{ \omega : |\mathbf{X_n} - \mathbf{X}| > \epsilon\} = 0}\]

In distribution, weak convergence (this is the kind of convergence promised by CLT):

\[\mathsf{\mathbf{X_n} \xrightarrow{d} \mathbf{X} : \lim\limits_{n} \mathbb{P} ( \mathbf{X_n} \leq {x} ) = \mathbb{P} ( \mathbf{X} \leq {x} )}\]

Convergence almost surely implies convergence in probability which, in turn, implies convergence in distribution. Convergence in distribution only implies convergence in probability if the distribution is a point mass.

Normality Assumptions in Time Series

To understand why the assumption of normality is important in modeling time-series, let us take the case of an AR(1) process, a linear first order autoregressive process. The following discussion can be extended to other common time-series structures as well. The AR(1) structure can be defined as:

\[\mathsf{Y_t = {\phi}Y_{t-1} + Z_t} \tag{2.1}\]

where $\mathsf{\{Y_t\}}, t = 0, 1,..$ is a first order Markov process on sample space $\mathbf{Y} \subseteq \mathbb{R}$ with conditional (transition) density $\mathsf{p(y_t \mid y_{t-1})}$. $\mathsf{\phi}$ can take any allowable value such that $\mathsf{Y} \subseteq \mathbb{R}$ when $\mathsf{Y}_{t-1} \subseteq \mathbb{R}$. $\mathsf{Z_t}$ is an i.i.d. sequence with mean $\mathsf{\lambda}$. More on this can be found at [Grunwald].

The normal AR(1) process with mean $\mathsf{\mu}$ is usually written in terms of a series of white noise variables $\mathsf{\{E_t\}}$:

\[\mathsf{Y_t - \mu = \phi(Y_{t-1} - \mu) + E_t} \tag{2.2}\]

where $\mathsf{E_t \sim \mathcal{N}(0, \sigma^2)}$ are i.i.d. and $\mathsf{\mid \phi\mid < 1}$.

The question is why would you choose to model your time-series as (2.2) over (2.1), even in the face of a lack of evidence of normal behaviour in your data. The reason is convenience.

A feature of normal AR(1) processes is that the marginal distribution is also normal. Thus,

\[\mathsf{Y_t \sim \mathcal{N}(\mu, \frac{\sigma^2}{1 - \phi^2})} \tag{2.3}\]

It is uncommon, in case of any other distribution, for the conditional and the marginal probabilities to have similar distributions. In fact, in the general case, they do not even possess relatively simple forms. The Normal distribution is a rare exception [Grunwald]. In addition, the parametric calculations corresponding to a particular time-series model using maximum likelihood estimators are simplified with the assumption of normality.

We will continue seeking the premise for the assumption of normality in time-series analysis in the next post, which will also elaborate on the extensions of Central Limit Theorem to dependent random variables. Until next time!

References

[1] Grunwald, G. K., R. J. Hyndman, and L. M. Tedesco. “A unified view of linear AR (1) models.” (1995). [Link]

[2] Asymptotic Distributions in Time Series, Statistics 910, Wharton School of the University of Pennsylvania. [Link]

[3] Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare, Massachusetts Institute of Technology. [Link]

Broadcast Hash Joins in Apache Spark

2018-02-17T00:00:00+00:00

Introduction

This post is part of my series on Joins in Apache Spark SQL. Joins are amongst the most computationally expensive operations in Spark SQL. As a distributed SQL engine, Spark SQL implements a host of strategies to tackle the common use-cases around joins.

In this post, we will delve deep and acquaint ourselves better with the most performant of the join strategies, Broadcast Hash Join.

The 30,000-foot View

The Broadcast Hash Join (BHJ) is chosen when one of the Dataset participating in the join is known to be broadcastable. A Dataset is marked as broadcastable if its size is less than spark.sql.autoBroadcastJoinThreshold. We can explicitly mark a Dataset as broadcastable using broadcast hints (This would override spark.sql.autoBroadcastJoinThreshold; and hence could result in performance hits or OOM, if the Dataset is too large). In either case, it is broadcast to every executor machine.

Once the broadcasted Dataset is available on an executor machine, it is joined with each partition of the other Dataset. That is, for the values of the join columns for each row (in each partition) of the other Dataset, the corresponding row is fetched from the broadcasted Dataset and the join is performed.

Let’s Dive

An example would be in order to understand the nuances of Broadcast Hash Join. What better example than the ‘Hello World!’’ of SQL: Employee and Department tables. Our Employee table has id, name, and did columns; Department table has id, and name as columns. You know the drill; I will not bother to draw out the tables.

In Spark REPL, you can create the tables, and initiate a join on the tables as shown below:

As the action in the last command triggers the lazy evaluation, you will see Spark make a flurry of transformations on the data. We are interested in the bit where the join magic happens. Since the default value for spark.sql.autoBroadcastJoinThreshold is 10M and the size of our datasets are miniscule, BHJ is chosen as the join strategy even without us providing any hints.

The broadcasted object is of type HashedRelation: either a LongHashedRelation (when the join key is a Long or an Int) or an UnsafeHashedRelation(in other cases, such as a String or a Float). The HashedRelation subtypes are backed by either a LongToUnsafeRowMap or a BytesToBytesMap, respectively. In our case, since our join column is of String type, a UnsafeHashedRelation is selected.

The broadcast object is physically sent over to the executor machines using TorrentBroadcast, which is a BitTorrent-like implementation of org.apache.spark.broadcast.Broadcast.

The broadcasted object, once available at the executors, is processed by the following generated code where the actual join takes place. I have annotated the code with relevant comments.

The important point to notice, in our case, is since both our tables are broadcastable (< 10M), Department is chosen since it has a smaller estimated physical size. The section on Caveats has an item (item #3) which points to a related case where broadcast hints are explicitly mentioned against both sides. The handling remains the same.

Caveats

Broadcast Hash Join is performed only under certain circumstances, due to limitations of broadcasting complete datasets. One of the condition is, of course, the configuration spark.sql.autoBroadcastJoinThreshold (which, as we know, can be overriden at one’s own risk). In addition, we have the following caveats:

BHJ is not supported for full outer join.
For right outer join, Spark can only broadcast the left side. For left outer, left semi, left anti and the internal join type ExistenceJoin, Spark can only broadcast the right side.
If both sides have broadcast hints (only when the join type is inner-like join), the side with a smaller estimated physical size will be broadcast.

All of these stand as of Spark 2.2.0.

You want more?

A reasonable way to understand the expected behaviour of Broadcast Hash Join is to peruse the test cases against it. You can find them here. Also, check out this talk on optimizing Spark SQL joins.

Upcoming posts in this series will explore Shuffle Hash Joins, Broadcast Nested Loop Joins and others. Tune in for them. Please leave a comment for suggestions, opinions, or just to say hello. Until next time!

P.S. A thank you to angriestprogrammer for the strip at the start of this post. The original strip can be found here.

An Early Employee's Field Guide to Workplace Arguments

2018-02-03T00:00:00+00:00

TL; DR Conflicts are common in an early-stage startup. This post lists a set of mental models an early employee can use to prevent, judge, diffuse and take leverage of conflicts.

Preface

Every startup takes pride in its team. In many cases, the team is the only edge a startup possesses and needs. Peter Thiel defines a startup team as the largest collection of people who are privy to an important, contrarian truth. And this team has a shared passion to prove the value of the contrarian truth to the world (and capture some of that value, as they go).

Literature abounds on how to bring such a dream team together, and although still an imperfect science at best, mental frameworks do exist to guide a founder to assemble a passionate core-team. What is missing is a how-to on keeping your core-team together as a cohesive, high-functioning unit. Consider this post to be the first in a series exploring that idea.

Herding Cats

The thing about early employees nobody mentions is they are eccentric. There, I said it! Unconventional at best, maladaptive at worst. But that is why they are here. To prove the world wrong about an important truth. To see a future very few dare imagine. With an unprincipled, yet definite, optimism about themselves, and the world in general.

Such personality traits make it difficult, nigh impossible, to work in cohesion with similarly strong-minded, vocal people. The shared passion does douse the fire (ironic!); but opinions do diverge, arguments do arise, and conflicts do happen. And all of these will (and should?) happen more often not. Because ‘if everyone is thinking alike, then somebody isn’t thinking’.

Lesson number one in keeping your genius team together is learning to resolve and accept conflicts. Conflict management within startups does not fall squarely on the founders’ shoulders. Early employees are invested in it to a degree where they need to take corrective measures as well. The idea of this post is to provide a set of mental models (wrapped as commandments) to help you deal with conflicts.

In the first post, I will focus on the models an early employee can use to prevent, judge, diffuse and take leverage of such situations. Subsequent posts in this series will focus on other stakeholders in a startup. (The mental models I suggest are abstract enough to hold sway even in circumstances outside of a startup. Also, I expect a strong degree of overlap between the mental models for each set of stakeholders.)

I reiterate here that the theme of this post is not to discourage people from engaging in arguments, but to help them do it better.

I will teach thee what thou shalt do

Before we begin exploring the core ideas, a small side-note on the presentation style. Commandments are a peculiar way of presenting mental models. Mental models are frameworks to make sense of the world as a free-thinker. Commandments, in their traditional sense, are directives with little scope for free thinking. Still, I believe framing mental models in an imperative fashion adds to its recall value. Who doesn’t love a good catchphrase? So, here we go.

You are not your idea

Argumentum ad hominem is a well-known logical fallacy wherein a statement is argued against by questioning the motive or character of the entity making the statement, rather than questioning the substance of the statement. This fallacy, in spite of being easy to identify, is well-entrenched in our rationale. I have seen myself use it often inadvertently, only to identify it as such in retrospect.

However, here I make a case for its converse. Because this logical fallacy is so entrenched in us, we have a tendency to see it where it does not exist. Every attack on your idea, whether substantial or not, is seen as a personal attack. Rejection of an idea is perceived as a rejection of your entire, unique school of thought. Our ideas are core to our self-identity, and it is particularly painful to detach ourselves from it. However, it is important to consider counter-arguments with objectivity if you wish to gain from the argument. You are not your idea.

Praise in public, criticize in private

This is from Management 101, or even Parenting 101. The fact is when you share specifically what a team-mate did great and why you think it is admirable, it has more meaning to them and reinforces the behaviour in the larger team. The person is incentivized to continue the work, and encourage others in the same path. Classic feed-forward behaviour. And it works for you; you are viewed as genuine and downright cool.

On the other hand, private criticism is about being kind. As we will see in ‘Don’t be an asshole’, nobody works well with (you guessed it!) an asshole. Private criticism is also about correcting behaviours without pushing people into being defensive. The greater good lies in subtle course-correction, than in a public slugfest.

Don’t be an asshole

Robert I. Sutton penned a popular essay “More Trouble Than They’re Worth” in which he talks about “The No Asshole Rule” (This was later the title of his book based on the essay). The theme, which must be already apparent, is that toxic employees are detrimental to an organization in the long run, irrespective of their perceived value as an individual contributor.

Assholes abound, within the workplace and without; and it’s certainly not hard to identify them. A simple litmus test is mentioned in the book:

After encountering the person, do people feel oppressed, humiliated or otherwise worse about themselves?
Does the person target people who are less powerful than him/her?

Simple enough; because it is based on how another person makes you feel. But it’s hard to identify yourself as one. Who likes to call themselves a jerk? But, as Prof. Sutton says in his essay, ‘assholes are us’. Everyone of us can be blamed for that behaviour at some point. The idea is to reflect and identify instances when you have been one, and correct yourself. Easier said than done. A simple trick I use is to note my behaviour when I am in a group where I feel far superior than the rest. If you act like a jerk in such a setting, chances are you are one. Don’t be one.

Give credit

Andy Grove talks about the proverbial Japanese company in which every employee is seated around a single table. I like to think of this, in the context of a startup, as every one ‘having a seat at the table’; in other words, every one having both the influence and power to make decisions and effect change. What this also entails is that a team of early employees is a flat hierarchy with near-complete visibility.

But complete visibility does not mean that you do not credit people for their ideas or their contributions. The fallacy here is in thinking that it is already public knowledge. As described in ‘Praise in public, criticize in private’, there are immense benefits to public credit. But more so, it is good practice to attribute ideas to its rightful owners even in conversations not involving the original owner. It will help your team now; and it will help your team more when they no longer fit around a single table, or even a single campus.

You are not your team

Even pint-sized organizations are divided into functional teams. Engineering, Sales, Operations divisions are among the common functional working groups you come across in tech startups. Conflict within startups are not always inter-personal; functional teams engage in arguments regarding product and company direction, primacy of their team’s objectives and a host of other issues. In a way, such arguments thrust the organization forward (sometimes in a direction you, as a member of a particular working group, do not personally favour).

On the other hand, as early employees, everyone in the core team has a personal equation with every other member in it, across functional divides. This personal equation is fostered by the previously stated shared sense of definite optimism and passion.

When teams cross swords, you would see yourself pulled into the argument and forming a judgmental opinion about the other team and its members. I believe there is no harm in forming such opinions about the opposing team because they form the basis of your team’s rivalry with the other. However, you should check yourself from forming opinions on the members of the other team. Teams could have orthogonal directions within the organization; employees do not. At the personal level, your shared passion should be the defining aspect of your relation with co-workers. You are not your team.

Call them stupid, not evil

Hanlon’s razor is a heuristic guide to eliminating unlikely explanations for a person’s behaviour. In the plainest words, it states, ‘Never attribute to malice that which can be adequately explained by stupidity, but don’t rule out malice.’.

In many scenarios, the first instinct when trying to ascertain an opposing person’s behaviour (particularly when that behaviour is seen as detrimental to your objectives) is to attribute it to the person’s intentions. This fits our rhetoric since that person (or team) is already viewed as an adversary. This may not be true. Malice and trickery are far less frequent than stupidity and neglect. Attributing malice, where none existed, only leads to sour relations and absurd confusions. In short, call them stupid, not evil.

Endnote

I do not believe this to be a comprehensive list, nor a coherent one. I think of these commandments as a work in progress, at best. Subsequent posts will look at conflicts within a startup from the perspective of other stakeholders. Still more posts will explore the underlying theme of how culture trumps strategy. You can find them here.

Multiple Parameter Lists in Scala

2018-01-27T00:00:00+00:00

Note: I wrote this article as part of a contribution to Scala Documentation. The original post can be found here.

Methods may define multiple parameter lists. When a method is called with a fewer number of parameter lists, then this will yield a function taking the missing parameter lists as its arguments. This is formally known as currying.

Here is an example, defined in Traversable trait from Scala collections:

def foldLeft[B](z: B)(op: (B, A) => B): B

foldLeft applies a binary operator op to an initial value z and all elements of this traversable, going left to right. Shown below is an example of its usage.

Starting with an initial value of 0, foldLeft here applies the function (m, n) => m + n to each element in the List and the previous accumulated value.

val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val res = numbers.foldLeft(0)((m, n) => m + n)
print(res) // 55

Multiple parameter lists have a more verbose invocation syntax; and hence should be used sparingly. Suggested use cases include:

Single functional parameter

In case of a single functional parameter, like op in the case of foldLeft above, multiple parameter lists allow a concise syntax to pass an anonymous function to the method. Without multiple parameter lists, the code would look like this:

numbers.foldLeft(0, {(m: Int, n: Int) => m + n})

Note that the use of multiple parameter lists here also allows us to take advantage of Scala type inference to make the code more concise as shown below; which would not be possible in a non-curried definition.

numbers.foldLeft(0)(_ + _)

Also, it allows us to fix the parameter z and pass around a partial function and reuse it as shown below:

val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val numberFunc = numbers.foldLeft(List[Int]())_

val squares = numberFunc((xs, x) => xs:+ x*x)
print(squares.toString()) // List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

val cubes = numberFunc((xs, x) => xs:+ x*x*x)
print(cubes.toString())  // List(1, 8, 27, 64, 125, 216, 343, 512, 729, 1000)

Implicit parameters

To specify certain parameters in a parameter list as implicit, multiple parameter lists should be used. An example of this is:

def execute(arg: Int)(implicit ec: ExecutionContext) = ???

Behaviour of ORDER BY in FROM: MariaDB vs MySQL

2017-12-01T00:00:00+00:00

TL; DR In MariaDB, query with ORDER BY in a FROM subquery produces an unordered result. In effect, ORDER BY is ignored in FROM subqueries. MySQL does not ignore ORDER BY in FROM subqueries.

Longer Version

Older versions of MariaDB (< 10.2.0) did not have window functions such as rank(), dense_rank(), row_number() among others. To understand where you would use such a function, dense_rank() for instance, consider the following example:

Given an Employee table and a Department table as shown below, find employees who earn the top three salaries in each of the department.

Employee Table

Id	Name	Salary	DepartmentId
1	Joe	70000	1
2	Henry	80000	2
3	Sam	60000	2
4	Max	90000	1
5	Janet	69000	1
6	Randy	85000	1

Department Table

Id	Name
1	IT
2	Sales

I list three approaches to solving this problem, starting with the easiest one which makes use of dense_rank().

The Dense-Rank Version

Using dense_rank(), this can be accomplished using:

SELECT * FROM
  (
   SELECT d.Name as Department, e.Name as Employee, Salary,
   DENSE_RANK()
       OVER (PARTITION BY DepartmentId ORDER BY Salary DESC) Rank
   FROM Employee e JOIN Department d ON e.DepartmentId = d.id
 ) t WHERE rank <= 3
     ORDER BY Department, Rank;

Fiddle Link

(Homework Assignment: Why use dense_rank() instead of rank()? How does it affect your result?)

No Window Functions?!

Things get slightly ugly when you do not have access to window functions. Of course, a workaround could be to use a join and sub-query as shown below:

SELECT
    d.Name AS 'Department', e1.Name AS 'Employee', e1.Salary
FROM
    Employee e1 JOIN
        Department d ON e1.DepartmentId = d.Id
WHERE
    3 > (SELECT COUNT(DISTINCT e2.Salary)
        FROM Employee e2
        WHERE
            e2.Salary > e1.Salary
                AND e1.DepartmentId = e2.DepartmentId
        );

But this is sub-optimal, and we can do better.

Session Variables

Another way would be to use Session Variables in your queries. This is where the observed behavior of MariaDB and MySQL part ways. I wrote the following query on a fiddle against MySQL 5.6 and expected it to work on MariaDB. Alas!

set @did := NULL;
set @rn := 1;
set @sal := NULL;

select `Department`, `Employee`, `Salary` from (
  select t.`Department`, t.`Employee`, t.Salary,
	@rn:= if(@did = DepartmentId, if(@sal = Salary, @rn, @rn + 1), 1 ) as rank,
	@did:= DepartmentId,
	@sal:= Salary
  from
	(
	  select d.name as `Department`, e.Name as `Employee`, DepartmentId, Salary
	  from Employee e
		  inner join Department d on e.DepartmentId = d.Id
	  order by DepartmentId, Salary desc
	) t
	) f where rank <= 3;

Fiddle Link

As you can see, I make use of an ORDER BY clause inside a FROM subquery. MariaDB blatantly ignores it, while MySQL is more gracious. A few wasted hours and some googling thereafter, I realized this difference and more so, find out that this is not a bug on MariaDB; more of a deliberate feature. According to MariaDB FAQ,

A “table” (and subquery in the FROM clause too) is - according to the SQL standard - an unordered set of rows. Rows in a table (or in a >subquery in the FROM clause) do not come in any specific order. That’s why the optimizer can ignore the ORDER BY clause that you have >specified. In fact, the SQL standard does not even allow the ORDER BY clause to appear in this subquery (we allow it, because ORDER BY >… LIMIT … changes the result, the set of rows, not only their order).

You need to treat the subquery in the FROM clause, as a set of rows in some unspecified and undefined order, and put the ORDER BY on the >top-level SELECT.

It is a bit of a snag to work with this “feature”, and I am still trying to solve the original problem in MariaDB versions <10.2.0 using session variables. If you have a solution in mind, or have something more to add to this conversation, comment below.