5/14/2019 - 7:04 PM

Presto query cache discussion

Rendered
Source

Problem definition:

[Storage]: determining how to store and index query results.
[Maintenance]: efficiently invalide query result in cache when the data changes
[Exploitation]: making efficient use of cached query to speed up query processing (with optimizer rules that rewrites the plan with cached result).

Martin's 4-steps to query result caching

1. Compute a digest of the plan. Clients can then do a conditional execution (similar to a conditional HTTP GET with an etag) so that they could do caching on their own
2. Cache results on Presto side
3. Cache partial queries (say, results or aggregations or anything that could be materialized and saved)
4. Full support for materialized query tables and rewrites

My thoughts about progressing

1. Support materialized query result tables
2. Support optimizer rewrites with materialized query result tables
3. Support materialized view and optimizer rewrites

Materialized query result tables

use a connector to store the results
rewrite the plan in PlanFragmenter, insert a pair of TableScan and TableWriter below Output.

Support optimizer rewrites with materialized query result tables

introduce new connector SPI that validates if a data scaned by a TableScan has been changed
add optimizer rule to rewrite the plan if it finds a match (plan matches && no changes in data)

Support materialized view and optimizer rewrites

add syntax support
add SPI to support storage of materialized view
modify the optimizer rules to support materialized view

Design choices

global v.s. local to a cluster/coordinator

how do we store the cached query info (e.g., plan)

build a new connector v.s. use existing connector(e.g., hive)

how do we implement the optimizer rule to match the plan with stored materialized query result tables / materialized views

equivilant substitution v.s. containment substitution

when the cached query result contains all rows in the requested plan, we can build filter/aggregation on top of the cached result. The final rewrite is still a equivilant rewrite.

Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

Presto query cache discussion

Problem definition:

Martin's 4-steps to query result caching

My thoughts about progressing

Materialized query result tables

Support optimizer rewrites with materialized query result tables

Support materialized view and optimizer rewrites

Design choices

global v.s. local to a cluster/coordinator

how do we store the cached query info (e.g., plan)

build a new connector v.s. use existing connector(e.g., hive)

how do we implement the optimizer rule to match the plan with stored materialized query result tables / materialized views

equivilant substitution v.s. containment substitution