Optimizers: The Low-Key MVP

119 points by tosh 8 days ago | 13 comments

adrian17 8 days ago |
> The SQL above results in a plan similar to the DuckDB optimized plan, but it is wordier and more error-prone to write, which can potentially lead to bugs.
FWIW, aside from manual filter pushdown, I consider the JOIN variant the canonical / "default" way to merge multiple tables; it keeps all the join-related logic in one place, while mixing both joining conditions and filtering conditions in WHERE always felt more error-prone to me.
Sesse__ 8 days ago |
It is also the only way to represent join conditions for outer joins.
camgunz 7 days ago |
Same here; I've always intuited that this would limit the generated tuples. I'm too lazy to do it now, but I wonder if other DB engines also perform this optimization that effectively makes filtering in JOIN conditions equivalent to filtering in WHERE clauses. I'd also be interested in some example queries that were hand-optimized to the point of obvious obscurity--my guess is it's harder to do this in SQL than in something like C.
Sesse__ 7 days ago |
Any non-toy SQL engine is going to do condition pullup and pushdown. (Well, I know at least one that only does explicit pullup, but does pushdown more implicitly because it doesn't use a tree. The effect is largely the same, modulo some shenanigans around outer joins.)
Pushdown is actually a more subtle problem than people give it credit for; it inhabits this weird space where things are too trivial to be covered in papers but is too hard to cover properly in textbooks (e.g., no coverage of multiple equalities or outer joins).
ikesau 8 days ago |
> This means your optimizations need to be applied by hand, which is sustainable if your data starts changing.
Seems like a missing "un" here
Compelling article! I've already found DuckDB to be the most ergonomic tool for quick and dirty wrangling, it's good to know it can handle massive jobs too.
Nihilartikel 8 days ago |
I regularly use duckdb on datasets of 1B+ rows, with nasty strong columns that may be over 10MB per value in the outliers. Mostly it just works, and fast too! When it doesn't, I'll usually just dump to parquet and hit it with sparksql, but that is the exception rather than the rule.
unwind 8 days ago |
Meta: I think the title would be better here if it came out and said query optimizers.
That gives a less subtle clue that it's about databases than looking at the domain.
pstoll 7 days ago |
Yep - i thought / hoped / assumed it was about Optimizers aka Sat solvers, etc
A la https://github.com/google/or-tools
lmeyerov 8 days ago |
Always a fan of query plan articles!
Note: the dig at dataframe libs is worth some care in case you think that means duckdb can optimize and they cannot
Dask, Polars, and others pick a lazy default in order to make distribution and other optimizations easier. When staying in their pure fragments ('vectorized'), the same scheduler rewriting opportunity is here.
This is a subtle but important distinction when looking at these frameworks. We are making our new graph query language 'gfql' to be dataframe-native so it can run naturally & natively as a step of pipelines people are already doing, but also to ensure we automatically run as optimized CPU/GPU columnar opts. At the same time, because of the intent to allow room for query plan optimization, we are staying declarative / lazy, even if the generated & interpreted code uses an eager DF runtime . I'm optimistic about output target lazy DF systems doing query planner work for us long-term here, but for the eager framework targets, the query planning has to be on our side.

kwillets 7 days ago |

One possible hand optimization is to push the aggregation below the joins, which makes the latter a few hundred rows.

kwillets 7 days ago |

To follow up, it's about 3x faster:

  SELECT 
      pickup.zone AS pickup_zone,
      dropoff.zone AS dropoff_zone,
      cnt AS num_trips
  FROM
      (select pickup_location_id, dropoff_location_id, count(*) as cnt from taxi_data_2019 group  by 1,2) data
  INNER JOIN
      (SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') pickup
      ON pickup.LocationID = data.pickup_location_id
  INNER JOIN
      (SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') dropoff
      ON dropoff.LocationID = data.dropoff_location_id
  
  ORDER BY num_trips desc
  LIMIT 5;
  ┌───────────────────────┬───────────────────────┬───────────┐
  │      pickup_zone      │     dropoff_zone      │ num_trips │
  │        varchar        │        varchar        │   int64   │
  ├───────────────────────┼───────────────────────┼───────────┤
  │ Upper East Side South │ Upper East Side North │    536621 │
  │ Upper East Side North │ Upper East Side South │    455954 │
  │ Upper East Side North │ Upper East Side North │    451805 │
  │ Upper East Side South │ Upper East Side South │    435054 │
  │ Upper West Side South │ Upper West Side North │    236737 │
  └───────────────────────┴───────────────────────┴───────────┘
  Run Time (s): real 0.304 user 1.791931 sys 0.132745

(unedited query is similar to theirs, about .9s)

And, doing the same to the optimized query drops it to about .265s.

Sesse__ 7 days ago |
AFAIK there are rather few query optimizers that can do this; it's not an easy problem, in part because your search space explodes. There's a 2014 paper that addresses this (https://madoc.bib.uni-mannheim.de/37228/1/main.pdf) and some other papers that combine grouping and join into a “groupjoin” operation; the main problem is that it only really seems to scale to planning small queries (less than eight joins or so, IIRC).
kwillets 7 days ago |
I've never run across it, but I would describe the job of an analytics engineer as doing this over and over for analysts -- it's probably semantically clearer to do the join first and then aggregate, so I end up pushing it down for them.
Thanks for the reference; this question has been on my mind for some time.