Data Systems Articles
- Maintaining Materialized Views
- Processing Events
Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Abouzied, A., Abadi, D.J, Bajda-Pawlikowski, K., Silberschatz, A. (2019, August). Proceedings of the VLDB Vol. 12 (12).
HadoopDB was a prototype built in 2009 as a hybrid SQL system with the features from Hadoop MapReduce framework and parallel database management systems (Greenplum, Vertica, etc). This paper revisits the design choices for HadoopDB, and investigates its legacy in existing data systems. I felt it is a great review paper for the state of modern data analysis systems.
MapReduce is the most famous example in a class of systems which partition large amounts of data over multitude of machines, and provide a straightforward language in which to express complex transformations and analyses. The key feature of these systems is how they abstract out fault-tolerance and partitioning from the user.
MapReduce, along with other large-scale data processing systems such as Microsoft’s Dryad/LINQ project, were originally designed for processing unstructured data.
The success of these systems in processing unstructured data led to a natural desire to also use them for processing structured data. However, the final result was a major step backward relative to the decades of research in parallel database systems that provide similar capabilities of parallel query processing over structured data. 1
The MapReduce model of
Map -> Shuffle -> Reduce/Aggregate -> Materialize is inefficient for parallel structured query processing.
This talk is an introduction to Datomic, by its creator Rich Hickey. My notes on this talk are linked below:
Mesos is a framework I have had recent acquaintance with. The only other resource management framework I have prior experience with is Hadoop YARN. These are my notes as I try to make friends with Mesos (he’s a good boy!), and revisit the concept of cluster resource-management in general... Read More
State-of-the-art distributed databases represent a distillation of years of research in distributed systems. The concepts underlying any distributed system can thus be overwhelming to comprehend. This is truer when you are dealing with databases without the strong consistency guarantee. Databases without strong consistency guarantees come in a range of flavours; but they are bunched under a category called NoSQL databases.
NoSQL databases do not represent a single kind of data model, nor do they offer uniform guarantees regarding consistency and availability. However, they are built on very similar principles and ideas.
From a historical perspective, the advent of NoSQL databases was precipitated by the publication of Dynamo by Amazon1 & BigTable by Google, and the emergence of a number of open-source distributed data stores, which were (improved?) clones of either (or both) of these systems. Bigtable-inspired NoSQL stores are referred to as column-stores (e.g. HyperTable, HBase), whereas Dynamo influenced most of the key/value-stores. We will term these systems loosely as Dynamo-family databases, which include Riak, Aerospike, Project Voldemort, and Cassandra.
I would like to focus on systems design ideas in Dynamo-family NoSQL databases in this article, with a particular focus on Cassandra. The approach of this article is to compare and contrast Cassandra with Dynamo; and in this process, touch upon the underlying ideas. Expect a lot of homework & further readings; I will have copious amounts of references throughout the article... Read More
Cut to the chase
Large-scale data processing serves multiple purposes. At a 30,000-feet view, every purpose can be bucketed into two broad categories:
This categorization is a high, high level one I use to reason about data system design, and its utility fades fast as we delve deeper into system nitty-gritty. Silos appear within & around each of these buckets as we descend into implementation of systems, but it is still a useful one to reason about data-intensive applications.
The basis of this categorization is captured in the following statement:
Every data system has two variables: data & query. The defining feature of the system is in the temporal nature of these variables. In every data system, either data or query is transient and the other is persistent.
In a data system maintaining materialized views, data (or more precisely, the view of data) is persistent, and query is a transient entity flowing into & out of the system.
In a data system processing events, query is persistent and transient data flows through the system... Read More