Sujith Jay Nair Thinking Aloud

Data Systems Articles

Subscribe for an infrequent newsletter.

    S3 and HDFS

    Cluster storage systems have, over the past decade, moved their gold standards from directory-oriented file-systems such as HDFS to object-stores such as AWS S3. The two storage models have been dissected & compared over & again from multiple perspectives 1 2. Again, based on your use-case, you might be more interested in a certain cross-section of differences between S3 & HDFS than other differences. I am not trying here to repeat the analyses.

    I wrote this short, bullet-style compilation as a quick refresher for myself on ways S3 differs from HDFS; it is focused on the APIs & interactions Hadoop-like data-processing systems (such as Hadoop, Spark, or Flink ) might have with storage systems.

    .. Read More

    Providing Streaming Joins as a Service at Facebook

    Providing Streaming Joins as a Service at Facebook. Jacques-Silva, G., Lei, R., Cheng, L., et al. (2018). Proceedings of the VLDB Endowment, 11(12), 1809-1821.

    Stream-stream joins are a hard problem to solve at scale. “Providing Streaming Joins as a Service at Facebook” provides us the overview of systems within Facebook to support stream-stream joins.

    The key contributions of the paper are:

    1. a stream synchronization scheme based on event-time to pace the parsing of new data and reduce memory consumption,

    2. a query planner which produces streaming join plans that support application updates, and

    3. a stream time estimation scheme that handles the variations on the distribution of event-times observed in real-world streams and achieves high join accuracy.

    Trade-offs in Stream-Stream Joins

    Stream-stream joins have a 3-way trade-off of output latency, join accuracy, and memory footprint. One extreme of this trade-off is to provide best-effort (in terms of join accuracy) processing-time joins. Another extreme is to persist metadata associated with every joinable event on a replicated distributed store to ensure that all joinable events get matched. This approach provides excellent guarantees on output latency & join accuracy, with memory footprint sky-rocketing for large, time-skewed streams.

    The approach of the paper is in the middle: it is best-effort with a facility to improve join accuracy by pacing the consumption of the input streams based on dynamically estimated watermarks on event-time.

    .. Read More

    A Conversation with Software Engineering Daily

    Nubank Data Engineering with Sujith Nair

    I was recently on the Software Engineering Daily podcast to talk about Data Engineering at Nubank.

    It turned to be a great conversation on functional data engineering, the importance of testability & reproducibility in data engineering (and our approach to achieving it at scale at Nubank), thinking of dataset quality in terms of dataset-as-a-service, and my take on the history of data engineering as a rediscovery of the table abstraction. Check it out here.

    Mesos, in the light of Omega

    image-title-here Mesos is a framework I have had recent acquaintance with. We use it to manage resources for our Spark workloads. The other resource management framework for Spark I have prior experience with is Hadoop YARN. In this article, I revisit the concept of cluster resource-management in general, and explain higher-level Mesos abstractions & concepts. To this end, I borrow heavily the classification of cluster resource-management systems from the Omega paper.

    The Omega system is considered one of the precusors to Kubernetes. There is a fine article in ACM Queue describing this history. Also, Brian Grant has some rare insights into the evolution of cluster managers in Google from Omega to Kubernetes in multiple tweet-storms, such as this and this.

    .. Read More

    Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology

    Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Abouzied, A., Abadi, D.J, Bajda-Pawlikowski, K., Silberschatz, A. (2019, August). Proceedings of the VLDB Vol. 12 (12).

    HadoopDB was a prototype built in 2009 as a hybrid SQL system with the features from Hadoop MapReduce framework and parallel database management systems (Greenplum, Vertica, etc). This paper revisits the design choices for HadoopDB, and investigates its legacy in existing data systems. I felt it is a great review paper for the state of modern data analysis systems.

    MapReduce is the most famous example in a class of systems which partition large amounts of data over multitude of machines, and provide a straightforward language in which to express complex transformations and analyses. The key feature of these systems is how they abstract out fault-tolerance and partitioning from the user.

    MapReduce, along with other large-scale data processing systems such as Microsoft’s Dryad/LINQ project, were originally designed for processing unstructured data.

    The success of these systems in processing unstructured data led to a natural desire to also use them for processing structured data. However, the final result was a major step backward relative to the decades of research in parallel database systems that provide similar capabilities of parallel query processing over structured data. 1

    The MapReduce model of Map -> Shuffle -> Reduce/Aggregate -> Materialize is inefficient for parallel structured query processing.

    .. Read More

    Datomic with Rich Hickey

    This talk is an introduction to Datomic, by its creator Rich Hickey. My notes on this talk are linked below:

    Dynamo vs Cassandra : Systems Design of NoSQL Databases

    State-of-the-art distributed databases represent a distillation of years of research in distributed systems. The concepts underlying any distributed system can thus be overwhelming to comprehend. This is truer when you are dealing with databases without the strong consistency guarantee. Databases without strong consistency guarantees come in a range of flavours; but they are bunched under a category called NoSQL databases.

    NoSQL databases do not represent a single kind of data model, nor do they offer uniform guarantees regarding consistency and availability. However, they are built on very similar principles and ideas.

    From a historical perspective, the advent of NoSQL databases was precipitated by the publication of Dynamo by Amazon1 & BigTable by Google, and the emergence of a number of open-source distributed data stores, which were (improved?) clones of either (or both) of these systems. Bigtable-inspired NoSQL stores are referred to as column-stores (e.g. HyperTable, HBase), whereas Dynamo influenced most of the key/value-stores. We will term these systems loosely as Dynamo-family databases, which include Riak, Aerospike, Project Voldemort, and Cassandra.

    I would like to focus on systems design ideas in Dynamo-family NoSQL databases in this article, with a particular focus on Cassandra. The approach of this article is to compare and contrast Cassandra with Dynamo; and in this process, touch upon the underlying ideas. Expect a lot of homework & further readings; I will have copious amounts of references throughout the article.

    .. Read More

    A Simple Dichotomy for Modeling Data-Intensive Systems

    Cut to the chase

    Large-scale data processing serves multiple purposes. At a 30,000-feet view, every purpose can be bucketed into two broad categories:

    • Maintaining Materialized Views
    • Processing Events

    This categorization is a high, high level one I use to reason about data system design, and its utility fades fast as we delve deeper into system nitty-gritty. Silos appear within & around each of these buckets as we descend into implementation of systems, but it is still a useful one to reason about data-intensive applications.

    The basis of this categorization is captured in the following statement:

    Every data system has two variables: data & query. The defining feature of the system is in the temporal nature of these variables. In every data system, either data or query is transient and the other is persistent.

    In a data system maintaining materialized views, data (or more precisely, the view of data) is persistent, and query is a transient entity flowing into & out of the system.

    In a data system processing events, query is persistent and transient data flows through the system.

    .. Read More