luohao
5/14/2019 - 7:04 PM

Presto query cache discussion

Presto query cache discussion

Problem definition:

  • [Storage]: determining how to store and index query results.
  • [Maintenance]: efficiently invalide query result in cache when the data changes
  • [Exploitation]: making efficient use of cached query to speed up query processing (with optimizer rules that rewrites the plan with cached result).

Martin's 4-steps to query result caching

1. Compute a digest of the plan. Clients can then do a conditional execution (similar to a conditional HTTP GET with an etag) so that they could do caching on their own
2. Cache results on Presto side
3. Cache partial queries (say, results or aggregations or anything that could be materialized and saved)
4. Full support for materialized query tables and rewrites

My thoughts about progressing

1. Support materialized query result tables
2. Support optimizer rewrites with materialized query result tables
3. Support materialized view and optimizer rewrites

Materialized query result tables

  1. use a connector to store the results
  2. rewrite the plan in PlanFragmenter, insert a pair of TableScan and TableWriter below Output.

Support optimizer rewrites with materialized query result tables

  1. introduce new connector SPI that validates if a data scaned by a TableScan has been changed
  2. add optimizer rule to rewrite the plan if it finds a match (plan matches && no changes in data)

Support materialized view and optimizer rewrites

  1. add syntax support
  2. add SPI to support storage of materialized view
  3. modify the optimizer rules to support materialized view

Design choices

global v.s. local to a cluster/coordinator

how do we store the cached query info (e.g., plan)

build a new connector v.s. use existing connector(e.g., hive)

how do we implement the optimizer rule to match the plan with stored materialized query result tables / materialized views

equivilant substitution v.s. containment substitution

when the cached query result contains all rows in the requested plan, we can build filter/aggregation on top of the cached result. The final rewrite is still a equivilant rewrite.