a stream synchronization scheme based on event-time to pace the parsing of new data and reduce memory consumption,
a query planner which produces streaming join plans that support application updates, and
a stream time estimation scheme that handles the variations on the distribution of event-times observed in real-world streams and achieves high join accuracy.
Providing Streaming Joins as a Service at Facebook. Jacques-Silva, G., Lei, R., Cheng, L., et al. (2018). Proceedings of the VLDB Endowment, 11(12), 1809-1821.
Stream-stream joins are a hard problem to solve at scale. “Providing Streaming Joins as a Service at Facebook” provides us the overview of systems within Facebook to support stream-stream joins.
The key contributions of the paper are:
Trade-offs in Stream-Stream Joins
Stream-stream joins have a 3-way trade-off of output latency, join accuracy, and memory footprint. One extreme of this trade-off is to provide best-effort (in terms of join accuracy) processing-time joins. Another extreme is to persist metadata associated with every joinable event on a replicated distributed store to ensure that all joinable events get matched. This approach provides excellent guarantees on output latency & join accuracy, with memory footprint sky-rocketing for large, time-skewed streams.
The approach of the paper is in the middle: it is best-effort with a facility to improve join accuracy by pacing the consumption of the input streams based on dynamically estimated watermarks on event-time... Read More
A common anti-pattern in Spark workloads is the use of an
or operator as part of a
join. An example of this goes as follows:
val resultDF = dataframe .join(anotherDF, $"cID" === $"customerID" || $"cID" === $"contactID", "left")
This looks straight-forward. The use of an
or within the join makes its semantics easy to understand. However, we should be aware of the pitfalls of such an approach.
The declarative SQL above is resolved within Spark into a physical plan which determines how this particular query gets executed. To view the query plan for the computation, we could do:
resultDF.explain() /* pass true if you are interested in the logical plan of the query as well */ resultDF.explain(true)
This post is the second in my series on Joins in Apache Spark SQL. The first part explored Broadcast Hash Join; this post will focus on Shuffle Hash Join & Sort Merge Join... Read More
This post is part of my series on Joins in Apache Spark SQL. Joins are amongst the most computationally expensive operations in Spark SQL. As a distributed SQL engine, Spark SQL implements a host of strategies to tackle the common use-cases around joins.
In this post, we will delve deep and acquaint ourselves better with the most performant of the join strategies, Broadcast Hash Join... Read More