Skew Strikes Back
02 Sep 2020 •NOTES
Introduction
Suppose that one is given a graph with N edges, how many distinct triangles can there be in the graph?
The two conceptual ideas of the paper:
 a theoretical basis for skew based on geometry: and a family of algorithms which provide optimal ways of avoiding skew.
 to challenge the database dogma of doing ‘one join at a time’: there are classes of query for which any project joinproject plan will be slower than the optimal runtime by a polynomial factor in data size.
History of Join Processing
The major approaches to join processing are: using structural information of the query and using cardinality information; the AGM bounds bring together both sets of information. ^{1}
Structural Approaches
 Algorithms rely on a structural property of a query such as acyclicity or bounded width to process joins in polynomial time.
 For example, in acyclic query (acyclic iff it has a join tree. GYOreduction), Yannakakis algorithm runs in ~linear time in input + output size. ^{2}
 Width of a query (qw) is a measure of how far a query is from being acyclic. Acyclic queries have qw=1.
 A family of algorithms exist which state that if the width is bounded by a constant, the query is tractable (a polynomialtime algorithm exists which can evaluate it).
Cardinalitybased Approaches
(Structural) decomposition methods focus “only” on structural features, while they completely disregard “quantitative” aspects of the query, that may dramatically affect the queryevaluation time.
 The run time of structural approaches is \({O(N^{w+1} log N)}\), where N is the input size and w is the width measure.
 Commercial databases place little emphasis on the structural property of the queries and tremendous emphasis on the cardinality aspect; an example is how commercial databases treat complex multiway joins as a series of pairwise joins. Such any joinproject plans are destined to be slower than the optimal plans.
Bridging the Gap
AGM defines a tight bound on the output size of a join query as a function of input sizes. This width, in turn, leads to the notion of fractional query number
and fractional hypertree width
(fhw
). fhw
is proven to be less than or equal to every other width measure. Leapfrog Triejoin is an example implementation of this join.
Triangle Query
Notes

The standard in database theory is to measure run time of join algorithms in terms of input data, assuming the input query is fixed. ↩