Sujith Jay Nair Thinking Aloud

Reading List on Data Systems

A list of papers, articles, and online resources I have found essential to understanding data-intensive systems and building new data systems. The list is curated and maintained by Sujith Jay Nair (@sujithjay). If you think a paper should be part of this list, please submit a pull request here. I will add it to the list once I peruse the paper. Please make sure the subject-matter of the paper is within the realm of either i) understanding data systems, or ii) building data systems.

Data systems are defined to include:

  • Database systems
  • Data processing systems

This list is inspired by Reynold Xin’s list on Database Readings, and is a work in progress.

Table of Contents

  1. Consistency and Consensus
  2. Query Processing
  3. State and Stream
  4. Database Design

Consistency and Consensus

Query Processing

State and Stream

  • Data in Flight (2010): Introduces a model of streams as a superset of the relational model. Streams introduce a notion of time (processing-time, IMO) to the relational model. I explore a similar idea in this post. In a relational table, data is persistent and query is transient; in a stream, query is persistent and data is transient.

Database Design

  • Dynamo: Amazon’s Highly Available Key-value Store (2007): This paper on Dynamo (not to be confused with DynamoDB, which is ‘built on the principles of Dynamo’) is an excellent primer on understanding concepts behind high-availability storage systems; concepts such as Consistent Hashing, Sloppy Quorum, Anti-entropy processes, and Gossip.

  • Cassandra - A Decentralized Structured Storage System (2009): Cassandra is one of many data storage systems heavily influenced by Dynamo. However, important differences exist. I have written about it in this post.